17  Data Visualization Principles & Perception

NoteLearning Goals

After this lesson, you should be able to:

  • Explain when, and why, to use a data visualization
  • Describe common features of “good” data visualizations
  • Identify principles of visual perception that can be used to make effective and expressive plots
  • Compare the features and utility of various plot types
  • Choose an appropriate kind of plot based on the data
  • Know where to go for more resources on making accessible and equitable data visualizations.

Data visualization is the graphical display of abstract information to help us make sense of phenomena and to communicate these findings. It is a powerful tool to help us uncover and share the stories of our data. Visualizations help us retain and analyze all the information in our data, uncover and share our insights, and describe our research in a useful way. If a picture is worth a thousand words, then a good data visualization is worth millions.

But how many of us have ever taken a course explicitly on data visualization? It’s typically not taught in standard data analysis courses, yet it is a mainstay for nearly every sector in today’s data-driven world. Today we’ll dive into the what, how, and why of data visualization and describe some best practices that you can immediately implement into your research workflows. Along the way we’ll also focus on building up our collective data literacy skills, and employ critical approaches to produce science that is more robust, transparent, and equitable.

17.1 A Brief History of Data Visualization

Figure 17.1: (source)

Data visualization is not a modern invention. Quantitative information display has been traced back to prehistory with the locations of stars mapped on the Lasaux cave drawings. Clay tokens, quipu, and stick charts dating back as far as 5500 BC also illustrate our long history of creating shared representations of data. The oldest known data visualization dates to 1160 BC with the Turin Papyrus Map, which accurately illustrates the distribution of geological resources in a region in Egypt. These earliest forms of data visualization served purposes of navigating culture and living within society—from accounting to agriculture, transportation, religion, and medicine. They were used to help us explore and understand natural phenomena and the workings of the universe.

The French philosopher Rene Descartes is attributed as developing the precursor to today’s modern plot in the 17th century—a two-dimensional coordinate system for displaying values. Later in the 18th century William Playfair began creating left to right oriented plots, allowing the viewer to explore how values change over time. He’s also attributed to inventing the bar graph and, unfortunately, the pie chart (we’ll get to why that’s unfortunate, later). Into the 19th and 20th centuries we see an explosion of chart types

Unsurprisingly, the invention of tools like paper and computers shaped our relationship with knowledge and information, playing a strong role in how we collect, analyze, store and visualize data. As we gather more and more complex data, we seek more ways to visualize its meaning and in the 19th and 20th centuries we see an explosion of chart types and techniques for communicating with statistical graphics.

NoteSee also

For a more complete history of data visualization, check out A Brief History of Data Visualization by M. Friendly.

In particular, check out the following famous data visualizations.

17.1.1 Famous Data Visualizations

Florence Nightingale, the “mother of nursing,” produced in 1857 a rose diagram depicting seasonal sources of British soldier’s fatalities in the Crimean War. Out of the 18,000 soldiers who had died, 16,000 had died of disease in a hospital (blue shading) rather than from their wounds (black shading). This image is credited with helping to persuade the British government to improve conditions in military hospitals.

Figure 17.2: The context surrounding this rose diagram is actually quite bit more complex. For additional background, check out this podcast based on the book The Data Detective by Tim Harford.

John Snow, a London physician, traced the source of an 1854 cholera outbreak in Soho. By examining the locations of reported cholera deaths, Snow demonstrated that the disease was connected to a contaminated well on Broad Street, contributing to growing understanding that cholera was a waterborne disease and not caused by foul ‘miasmas’ in the air. He later used a map in his publication to show the concentration of the cholera cases around the contaminated pump. On this map, the height of the dark bars correspond to the number of deaths at a given location. While Snow didn’t invent the mapping technique of layering thematic data on top of topographic maps, nor actually compose the map himself (it was created by cartographer Charles Cheffins), this map was so effective that history often calls Snow the “father of epidemiology.” Learn more about the history of the map in this recent post by Kenneth Field.

Figure 17.3

Lastly, no discussion of the history of data visualization is complete without a nod to Charles Minard’s acclaimed depiction of Napoleon Bonaparte’s ill-fated invasion of Russia. Edward Tufte declared in his popular 1983 book The Visual Display of Quantitative Information that Napoleon’s March “may well be the best statistical graphic ever produced.” The thick band denotes the size of the army at each position, beginning at the Polish-Russian border. The dark lower band is tied to temperature and time scales, and shows the path of Napoleon’s retreat from Moscow and shrinking army size during the bitterly cold winter.

Figure 17.4

This image has also been recreated with modern plotting software, including the ggplot2 package in R, which you will learn about in the next lesson.

Figure 17.5: (source: ggplot2: Grammar of Graphics in R by Hadley Wickam)

17.2 What is Data Visualization?

At their core, data visualizations are products that:

  • Represent data.
  • Have a specific purpose.
  • Tell a data-driven story.

There are two main types of data visualizations:

  1. Information visualizations (aka infographics or infoviz) tend to be visually striking, dramatizing a problem with unique and visually appealing imagery that draws the casual viewer in.
  2. Statistical graphics aim to make comparisons, to reveal patterns and discrepancies. We use statistical graphs to communicate our research results, often for viewers who are already immersed or interested in the problem.

While many topics within this reader will apply to infoviz as well, our emphasis is on creating judicious and accurate statistical graphics.

NoteSee also

For more info comparing infoviz and statistical graphs, see Gelman and Unwin 2013.

17.2.1 Why Viz?

There are lots of ways to represent our data. In fact, tables are often the most common way to report data, and they are great at conveying exact values. But interpretation of data displayed in a table is largely up to the viewer. It’s hard to perceive the overall summary of the data from a table, unless it’s really simple and, in that case, you often don’t even need a table and can just report those statistics as text.

Data visualization, on the other hand, takes advantage of our ability to process information by shifting the balance between our natural perceptive and cognitive abilities to convey a specific message. Most of the information that’s sent to our brains is visual. In fact, it’s been found that the human brain processes visual imagery 60,000 times faster than text! Data visualizations allow us to move from a predominantly thinking perspective to a seeing perspective. The cerebral cortex, which primarily handles our cognition, is slow and less efficient than the visual cortex, which processes images. Thus, visual diagrams are often easier for us to process than pages of words describing our research. Absorbing information quickly allows us to make novel inferences, and make more productive and informed decisions. Not surprisingly, well composed data visualizations are the most effective type of scientific communication.

NoteSee also

For guidance on how to convert a table into a plot, see this paper by Andrew Gelman.

Ultimately, the utility of a data visualization depends on how well it’s composed.

17.2.2 Good Data Visualizations

  • Provide rapid access to data.
  • Faithfully represent the data and tell a story.
  • Are expressive.
  • Are effective.

Helpful data visualizations intuitively, clearly, accurately, and efficiently explain complex ideas. The patterns and relationships presented must be valid, and the visual relevant to the data it presents. A data visualization cannot exist without a narrative, and good data visualizations always include context. Good plots grab our attention and create a positive visual impact. This aids our ability to make connections and recall the features of the data. They can be aesthetically pleasing but that’s not the end goal. Good plots are accessible (not everyone perceives the visual world the same way). They leverage aspects of human perception to allow for intuitive inference of relationships between abstract concepts (our data).

NoteSee also

Want to feel inspired? Check out Information is Beautiful and Flowing Data.

Figure 17.6: (source)

17.2.3 Bad Data Visualizations

  • Have too much, or too little, information.
  • Are inconsistent.
  • Ignore limits of human perception.
  • Misrepresent the data.
  • Use inappropriate (or garbage) data.

Have you ever seen a pie chart where the labeled slices add up to something other than 100%? That’s a poorly executed data visualization. Goal: don’t end up on WTF Viz.

Figure 17.7: (source)

17.3 Before You Viz, Make a Plan

Modern software makes it easy to quickly create a plot. But before you fire up your computer and start plotting, stop and think. Write out your visualization plan. This will save you time in the long run, and result in a more robust data visualization.

Ask yourself:

  • Why am I making this visualization? (purpose)
  • Who am I making it for? (audience)
  • How will I use and share it? (medium)
  • What can I use to make it? (tools)
  • What story does it tell? (message)
  • Who does it affect? Who is left out? (critical approach)
Tip

How many plots you need is always the wrong question. You need exactly as many as you need to tell your story.

17.3.1 Purpose

First, identify why you are making a visualization. We use data visualizations in different ways across the iterative steps of the research data pipeline:

Figure 17.8
  • Collection: plots can help us understand who, what, and where the data represent. It can help us track our progress, and help us project required effort to complete this phase of the project.
  • Cleaning: plotting is a quick and effective way to spot errors in our data. It allows us to grasp the extent of issues such as outliers and missing data.
  • Exploration: plots are a powerful tool for exploratory data analysis (EDA). Plots help us identify patterns, summarize variables and relationships. (see Tukey 1960)
  • Confirmation: plots also help us conduct confirmatory data analysis (CDA). We can plot diagnostics like the model fit, residuals, and model comparisons that confirm whether a model is correct. CDA is an iterative process over the course of research, one reason why we advocate using scripting languages and other reproducible workflows for generating graphics.
  • Validation: plots also help us to debug and validate our code. We can visually inspect the results at each step of the code we are writing and verify whether it satisfies our expectations.
  • Communication: sharing the insights from our data with others is probably the most commonly understood and emphasized purpose of data visualizations. This is also often the hardest type of data visualization to “get right,” because we don’t always remember to design the visual to speak specifically to who we are sharing it with.

17.3.2 Audience

Who are you making the data visualization for? There is no such thing as a “generic” data visualization. Are you making the figure for:

  • Yourself, to help you clean or explore your data?
  • Your immediate colleagues or research team to update them on your research progress?
  • Experts in your field reading your publication or listening to your presentation?
  • A general audience as part of your public outreach?
  • Policy makers who might not know all the details but might be making big decisions based on your results?

Knowing who you’re making the visualization for will help you think through the following steps to create something of value for your intended purpose. It will also help you determine how effort is needed to compose a plot to achieve your goal.

17.3.3 Medium

There are always constraints when creating a data visualization. It’s best to discover these before you start, rather than after you’ve created a beautiful data visualization that’s completely inappropriate for your intended use.

If you’re creating the visual to accompany a journal article, you probably need to use a static figure and not an interactive or dynamic dashboard. Does your journal allow for color figures? When in doubt, start with greyscale—it’s a lot easier to add color, rather than take it away, as you revise your figures.

If you’re showing the figure during a presentation, you probably want to simplify it—you audience will have 5 seconds max to read, understand, and interpret your visualization. A really complex figure that requires minutes to comprehend will just distract your audience away from what you—and your data—are saying. It might be more effective to compose and display the same plot in different ways to best communicate your points.

For a poster presentation where your audience is expected to spend significant time pondering over your findings, you might want to have one very large, clear figure that disentangles the complexity of your project.

If you’re creating a visual for a website, you might be able to go nuts—bring on the interactivity, the dynamic data display—until you crash the server because it requires too much compute time.

NoteChecklist

17.3.4 Tools

At the UC Davis DataLab, we advocate for the use of open-source software and scripting languages for data-driven research projects, including for generating data visualizations.

Using scripting languages makes it easy for you to reproduce your data visualizations. As you clean and update your data, you can re-create your visuals easily by re-running your code. You can also return to a figure later and know exactly what it represents and how you made it. You don’t have to worry about remembering which buttons you clicked, and in what order, like you would when using a GUI based software.

Using free, open-source software also means that you can easily and freely share your data, code, and output with your collaborators, reducing the equity and reproducibility barriers posed by the use of proprietary software. Open-source software that’s great for plotting—like R—also has amazing user communities and resources to help you learn the code and create your ideal visualization.

Tip

Be practical with yourself: you probably aren’t going to learn a new package or other plotting software overnight. If your conference talk is tomorrow, using familiar software like Excel for plotting can be fine, especially if you know some tricks to clean up and customize the appearance of your plots.

Here’s a non-exhaustive list of open source tools we recommend for data visualization:

17.3.5 Message

Research is storytelling with data. Every data visualization is an important piece of that story. It may help you confirm (or reject) a hypothesis, discover new correlations, or predict the likelihood of a future event.

Creating statistical graphics is like writing a novel—you get to decide who and what will be featured in your data story. And just like one page of a novel, your data visualization alone doesn’t tell the whole story. Every data visualization should contain the details required for explanation, and they require narratives.

Write out captions for each plot before you make it. What does the plot show? After creating the plot, go back and update the caption with the take home points for your viewer. How might others focus on a different message? If you can’t articulate what the plot is about then you probably should rethink what you are choosing to display and how you are showing it.

17.3.6 Critical Approaches

Important

Don’t skip this step. It’s last on this list but is the most important on your journey to making useful data visualizations. Data are information, and information is power. Use this power intentionally and mindfully throughout the process of creating and sharing your visualizations.

As you reflect on your answers to the planning prompts above, critically review the features of your data:

  • What do the variables you’ve selected for your visualization mean? How are they defined? How did those definitions come to be? Why did you select them?
  • Who will your data visualizations affect? What groups are left out? How does this affect the story your data tells? How might someone misrepresent or misunderstand your story? Bring back the bodies.

Conducting these connotative and denotative explorations of your data will not only result in a more robust visualization, but will make you a better researcher and support a more inclusive and equitable society.

NoteSee also

To learn more and practice these steps on some case studies, check out our Critical Approach to Data Visualization workshop and Data Feminism research and learning cluster.

17.4 Graphical Elements of a Plot

A data visualization is useful only if it encodes information in a way that our eyes can perceive and our brain can understand. Marks and channels are the building blocks of all data visualizations and are employed to accomplish this encoding.

Marks are the the basic geometries, or graphical elements, in a plot that depict our data items or their linkages. Marks indicate “where” something is and include points (0d), lines (1d), areas (2d), and volumes (3d).

Channels are the attributes of that control how the marks appear. Channels are used to encode (or indicate) the values or meaning of our data. Channels were first described in the mid-20th century by Jacques Bertin in his book Semilogie graphique (the Semiology of Graphics [1967]), which argues that visual perception operates according to rules that can be followed to express information visually in intuitive, accurate and efficient ways. He described seven main categories of visual variables (channels): location or position, size, shape, orientation, color, and texture. More recent publications list up to 12 channels useful for encoding meaning in data visualizations (Roth 2017).

By understanding the nature of our data in combination with the principles of visual perception, we can decide which marks and channels to use for a given data visualizations.

Figure 17.9: (source: Visualization Analysis and Design by Tamara Munzner)

17.5 Principles of Visual Perception

Leveraging principles of visual perception (the ability to see and interpret surrounding visual information) will help us identify appropriate plot types and design better, more informative graphics. Humans are wired to look for structure, patterns, and logic. Our brains are amazing—they take ambiguous visual information and transform it into something organized, symmetrical, or familiar so we can understand it. But, we don’t process all visual information equally.

17.5.1 Visual Magic Tricks

Take a look at the following questions and images.

Figure 17.10: Which line is bigger?

They’re the same length, if you pay careful attention to the scales of the axes!

Figure 17.11: Which inner circle is bigger?

The circles are the same size.

Figure 17.12: Do these lines connect?

The lines do NOT connect. Hold up a ruler or straight edge and prove it for yourself.

Figure 17.13: Is the center bar in this image by Dodek a gradient?

Nope! It’s a solid color.

Figure 17.14: What shape(s) do you see in this image?

Did you see a vase or two faces?

These visual “magic tricks” work because they capitalize on innate weaknesses in our visual perception.

17.5.2 Steven’s Psychophysical Power Law

Research studies by Stanley Smith Stevens and others have shown that we exhibit innate biases in how we perceive magnitude changes in the intensity of various types of stimuli.

For example, we perceive the intensity of an electrical shock to a greater degree than its actual, physical intensity would seem to warrant. We’re also poor at accurately perceiving changes in brightness and estimate it to increasing less than it actually does. However, we have near perfect perception of length proportional to its actual increase. This is especially true if lengths are aligned and on the same scale. Knowing this can help us design more intuitively useful plots.

Figure 17.15: (source: Figure 5.7 in Visualization Analysis and Design by Tamara Munzner)

17.5.3 Perception and Encodings

Based on psychophysics, we can rank encodings to help us identify which ones will more accurately allow us to judge differences in relative magnitudes, which is important when working with ordinal, interval or ratio data.

From most to least accurate by magnitude perception:

  • Position along a common scale. Spatial position is the easiest feature for us to recognize and evaluate, and unsurprisingly is used in the most common plot types: bar charts, scatterplots.
  • Positions along identical but nonaligned scales. Small multiples, grid, lattice, panel, and Rellis charts.
  • Length. We can easily recognize proportions and evaluate lengths, especially when they are aligned, such as in bar charts.
  • Direction. We recognize directionality fairly easily. Trend charts utilize this to demonstrate changes over time.
  • Angle, slope. It’s harder to evaluate angles than length or position. Pie charts can be as efficient as stacked bar charts, unless there are more than 3 parts to the whole. But ask yourself—if there are fewer than 3 or fewer parts, do you really need a visualization?
  • Area. Determining the relative magnitude of areas is much harder compared to lengths, and should be used (like in bubble charts) for indicating the relative importance, and not absolute magnitude changes.
  • Volume. 3D objects as represented in 2-D space are hard to evaluate. Avoid them. I’m looking at you, exploding 3D pie chart.
  • Curvature. Perceiving changes in the degree of a curve magnifies the difficulties in detecting direction, angle, and non-aligned lengths.
  • Density, color saturation and shading. Color is the least accurate way to convey patterns. Saturation is the intensity of a single hue, and increasing color intensity is intuitively perceived as correlating to an increasing value. But individual hues are hard to compare to one another. Heatmaps along the same color gradient can be a good way to convey an overall picture of change in values over a range. We’ll talk more about color later on.
  • Color hue. For data visualizations, color hue is the most challenging encoding to detect changes in magnitude.

17.5.4 Evaluating Graphics

No matter how clever the choice of the information, and no matter how technologically impressive the encoding, a visualization fails if the decoding fails. (Cleveland 1983)

How do we detect if our encodings have failed? Munzner uses the principles of expressiveness and effectiveness to help us evaluate our data visualizations.

The expressiveness of a visual encoding should “express all of, and only, the attributes of the data.” It is violated when we use encodings that do not match our data type or our visualization goals. When it fails, a chart is not only sub-optimal and confusing, it can be incorrect and misleading. Charts can fail the expressiveness test if their encodings imply ordering when there actually is none, or they mis-order a variable.

The effectiveness of a visual encoding addresses how accurately can the interpreter of the chart decode the encodings within it and derive accurate knowledge. According to Munzner, “the importance of the attribute should match the salience of the channel,” meaning we should use channels at the top of the list to encode the variables that are the most important to communicating our data story.

When looking at a plot, can you accurately detect differences is the sizes of the bubbles? Can you discriminate between all of the colors, compare the shades? Can you separate the dimensions of the data?

Taking these principles together, when we want to compare magnitudes of ordinal data (numeric, continuous, or ordered qualitative data—like height, weight, number of children in a family, or a rating), we should use encodings at the top of Mackinlay’s list above.

Conversely, for nominal data (categorical or un-ordered qualitative data—like gender), use these identity channels:

  • Shape: glyphs are effective at grouping categorical attributes together. But be mindful that the more shapes you use, the harder it will be for a viewer to remember what corresponds to which specific data attribute.
  • Color: while color can be very effective in data visualizations (see the Gestalt principles below) typically less is more. Apply contrasting colors only to differences in meanings in the data, or to emphasize the main elements. Start with grey, and add color only as necessary. And, be mindful when defining your color palette. Color brewer and Viz Palette provide palettes that optimize our perceptive abilities and design for accessibility. Do a color check—how will a person with colorblindness perceive your graphics?
  • Texture: similar to shape and color, texture can be useful for differentiating between categories or separate areas. Textures can be particularly effective at replacing colors, such as in black and white figures, and for increasing accessibility by reinforcing a color encoding.

Combining channels can result in integral or separable coding pairs, respectively allowing attributes to be perceived holistically or with separate judgments regarding their graphical dimension.

17.5.5 Gestalt Principles

In addition to decoding specific elements, our brains have an amazing ability to create and perceive structure along visual objects. This is commonly referred to as the Gestalt principles of visual perception. This framework, combined with Steven’s Law, can help us think through how to use marks and channels together to create expressive and effective data visualizations:

  • Similarity: objects with the same visual properties are assumed to be similar and are grouped together.
    • Example: use design elements such as shape, color, and organization to indicate groupings of the data. In design theory these are called “preattentive features” because we actually see and perceive them before we really think about them. In some experiments it was found to take less than 0.5 seconds for the eye and brain to process a preattentive property of an image.
  • Proximity: objects that are close together are perceived as a group.
    • Example: since physical distance connotes similarity, grouping bars on a chart can indicate similarities among their data. Instead of listing it in a legend, directly label data groupings by adding informative text directly onto the graph.
  • Continuity: elements that are aligned (on the same line, curve, or plane) are perceived to be more closely related to each other than to other elements.
    • Example: it is often easier for us to perceive the groupings if the shapes are curves, rather than lines with sharp edges.
  • Enclosure: objects that appear to have a boundary around them (i.e., are found within the same common or enclosed region) are perceived as being related.
    • Example: Add line boundaries or shades to group objects.
  • Connection: objects that are connected, such as by a line, are perceived as a group.
    • Example: connect different data together to indicate a relationship. This connectedness is highly effective as it often over-rules the other principles for group perception. Every line plot is an example of connectedness.
  • Closure: complex arrangements of visual elements are perceived as a single, recognizable pattern.
    • Example: open structures are often perceived as closed, complete and regular.
  • Figure and Ground: objects are perceived as either standing out prominently in the foreground (or front figure) of an image, or recede into the background.
    • Example: shading or color blocking can be employed to to distinguish between the more important figure and less important ground features of an image. Place elements of the most importance in the foreground figure.
  • Focal Point: whatever stands out visually is perceived as the most important. It will grabs our attention first, and holds it for the longest.
    • Example: use design elements selectively to draw attention to the most important features of the data.
Figure 17.18: Gestalt principles for perpetual grouping and figure-ground segregation. (source: Gestalt Principles for Attention and Segmentation in Natural and Artificial Vision Systems by G. Kootstra, N. Bergstrom, D. Kragic (2011).)

17.6 Accessible Data Visualizations

17.6.1 Color

Color can be one of the most challenging—and important—attributes to apply to a plot. Special care must be taken when applying color to our data visualizations to ensure they are accessible to persons with color blindness. Color blindness prevents viewers from distinguishing between certain colors, their brightness, and/or shades of a color. Affecting approximately 1 in 12 men (8%) and 1 in 200 women (0.5%) around the world, it is likely that some viewers of your data visualization will perceive its colors differently.

Overall we’re not doing a good job at using color mindfully in our science communication. If you want to use color, the following are some recommendations to keep in mind.

Recommendation 1: Avoid problematic color combinations. The most common types of color blindness makes it hard to tell the difference between red and green (deuteranope and protanope color blindness). Blue-yellow color blindness (tritanope) is less common. Avoid using: red/green, green/brown, green/blue, blue/gray combinations. Many graphing software unfortunately use these combinations as a default and you will have to manually change this on your figures.

To demonstrate why these combinations are problematic, here is a color vision test:

Recommendation 2: Use an online tool to help you pick a colorblind friendly palette depending on your data and visualization needs. Examples include:

Figure 17.20

Recommendation 3: Use a colorblindness simulator to check your visualization. Who won’t be able to see the differences you’re trying to display with color? Here are a few simulators:

Recommendation 4: Add textures, symbols, or other channels to reinforce the grouping attributes on your plot.

Figure 17.21: (source)

Recommendation 5: Rethink your plot. You may not actually need color at all to effectively display your data.

Figure 17.22: (source)

Here are some more resources to help you use color effectively and mindfully in your data visualizations.

Color and design:

Color accessibility in R:

17.6.2 Alternative Text

So far we’ve taken for granted that visualization is an accessible mode of communication, but researchers and audiences alike are not all sighted. RStudio is behind on vision impairment accessibility, but some packages can provide text descriptions and sonification/audification of plots to improve accessibility for non-visual data interaction.

For example, the BrailleR package, has a VI function that wraps around ggplot objects and provides a text-description output. This description is a starting point but it does not summarize the data itself, so it is important to consider also informative figure captions or embedded alternative text so that all viewers are able to interpret the visualization.

Other packages like the sonification package’s sonify function can be used to represent data in audio form. With the function, the x-axis can span sound across time, so that the length of time a sound plays follows the data long the x-axis from left to right; the y-axis can be expressed as pitch, so that the pitch of the sound matches to the values of the data (lower value means lower pitch).

17.7 Designing Statistical Graphics

You are now ready to make your plot! You can combine marks and channels to create nearly any plot type, and there are many established types of statistical graphics that you can choose from to showcase your data. Each type has its benefits, and drawbacks, based on how it encodes your data.

Tip

Match the chart type to your data—and what you want it to show—and not the other way around.

Step 1: Identify Your Data Type

Data can be quantitative or qualitative:

  • Quantitative data is either continuous (numerical data like height and weight), or discrete (constrained values, such as the number of children in a family).
  • Qualitative data can be ordered (categories that have a relationship but no meaningful distance between them, such as movie star ratings), or nominal (categories that have no meaningful order, such as gender).

Step 2: Determine Your Functional Approach

Ask ask yourself:

  • What are the tasks you want the visual to support?
    • Showing how values compare to each other? How the data are distributed? How they are composed? How values relate?
  • What specific visual best supports those tasks?
  • What do you expect people to naturally do in their “visual queries” as they explore the plot?
  • How can you modify the graphical marks and channels to support faster queries?

Step 3: Select a Plot Type

Now that you’ve identified your data types and what you need your visualization to show, explore your different chart type options! Start with this nifty From Data to Viz tool. Select your data type(s) and click through for the pros, cons, and alternate options for a bevy of charts.

Figure 17.23

For example, if you want to enable accurate comparisons of individual quantitative values and their relationships, try a scatterplot or a chart with lines or bars sitting aligned on a single axis.

The following list contains an overview of some of the most common plot types you may encounter:

Single quantitative variables are plotted to show the frequency distribution of the data. While histograms are the most common density plots, a single quantitative variable can also be plotted using a rug plot/strip chart, boxplot, or violin plot (described below, where they’re most commonly employed).

Two quantitative variables can be plotted using a:

  • Scatterplot: each axis encodes the values of a different quantitative variable, and individual data are represented as points (or dots) on the chart.
  • Line plot: data points are connected by straight lines. Line-scatter plots are are common for time series or trend data.

One quantitative and one qualitative variable are suitable for a:

  • Bar chart: bars represent the amount of data in different categories of a variable. One axis encodes the frequencies of the quantitative data, and the other axis the categories of the qualitative data.
  • Boxplot: shows the median, quartiles, and bounds of your data.
  • Violin plot: in essence, a boxplot that also shows the distribution of your quantitative variable.
  • Word cloud: these eye-catching visualizations display a list of words with their font size corresponding to their importance. But, they require huge sample sizes and are not very useful as they often distort reality. For example, long words will look more prominent just because they have more letters and cover more area—our eyes don’t intuitively parse out word height from length.
  • Pie chart: uses relative frequencies to show how large each category is in relation to the whole.
ImportantWarning

Pie charts are grudgingly listed here because you will see them in the wild, but resist the temptation to use them!

Based on our visual perception, pie charts are inherently problematic because they encode values as visual attributes. Pie charts encode data as the area of each slice, as well as the angle that it forms in the center of the pie, making it difficult to easily perceive and compare differences.

Over 492 posts on WTF Visualizations are tagged as pie charts! Almost any other chart type is better than a pie chart.

Other complex plot types you may encounter that layer additional marks and channels on the above chart types are:

  • Lollipop chart: a dot chart where the dots are connected by lines to an axis.
  • Mosaic plot: also called a treemap, these plots display hierarchical data as sets of nested rectangles sized proportionately to their values.
  • Bubble plots: scatterplots where the size of a dot corresponds to a third numerical or ordered categorical value.
  • Radar plots / star chart: line plots where each variable has its own axis and all axes are joined at the center of the figure.
  • Network diagrams: also called graphs, these plots show connections (edges) between entities (nodes).
NoteSee also

To learn more about network diagrams, check out DataLab’s network toolkit and network analysis workshop.

NoteSee also

Google also has an interactive plot gallery. And this Stack Exchange post has even more chart type resources.

But remember, some encodings are more difficult to accurately decode. When in doubt, stick to simple figures with points and lines.

NoteWhat About Maps?!

Geospatial data visualization by nature is complex and encodes a lot of attributes. Interested in learning more? Check our DataLab’s Spatial Sciences research and learning cluster and workshops.

Step 4: Iterate

Create your visual, and run through step 2 again keeping in mind the principles of visual perception, effectiveness and efficiency. Does it meet your needs? If not, try a different type. Graphing, like writing, requires continuous editing.

17.8 Tips for Better Plots

Making effective data visualizations takes practice and experience. The more plots you look at, the more you will intuitively recognize what works—and what doesn’t—for data visual storytelling. One takeaway I hope you discover is the need to avoid unnecessary complexities.

Tip

If the “story” is simple, keep it simple. If the “story” is complex, make it look simple.

Below are some tips to help achieve those goals.

17.8.1 Get Rid of Chartjunk

An easy way to instantly improve your plots is to eliminate superfluous material. Extra tick marks and grid lines; unnecessary text and arrows; decimal places beyond the measurement error of the level of difference; cute little butterfly clip art: this chartjunk has no meaning and it clutters up a chart, making it hard for your viewer to see what’s most important—your data. The amount of ink on your figure should directly correspond with the amount of data you present. If it doesn’t, you have a lot of chartjunk. (Evidence #10298 that pie charts are never a good choice.)

Try these de-cluttering steps to improve your charts:

  • Shift from center to left-justified text
  • Retain white space
  • Clear contrasts
  • Remove chart borders
  • Remove (or strongly mute) gridlines
  • Remove data markers and point labels (unless they are important)
  • Remove unnecessary polygon filling
  • Cleanup and rename axis labels to be intuitive
  • Replace the title with something informative
  • Label the data directly using the principle of proximity
  • Leverage consistent color and other aesthetics

Creating visual order and reducing chartjunk will dramatically improve your graphic by helping your data stand out.

17.8.2 Facilitate Comparisons

  • Avoid having the graph elements interfere with the data
  • Juxtapose or supepose plots (using the same scales)
  • Use visually prominent symbols
  • Avoid over-plotting; try jittering, or smoothing
  • Don’t change a scale mid-axis
  • Use only one scale on one axis
  • Use color, judiciously
  • Avoid jiggling the baseline
  • Don’t distort the data; take care when selecting the encodings

A common mistake is to use more encodings than there are dimensions of the data. If you data only has two dimensions (say number of students in STEM by gender identity), your figure could reasonably use points, rarely area, and never volume. (I’m looking at you, 3D pie chart.)

17.8.3 Create Information-Rich Plots

Data visualizations cannot exist without text. They require context to infer meaning. Ask yourself:

  • Does the caption describe what has been graphed? Does it draw attention to the important features? Describe the conclusions drawn by the graph?
  • Are the legends and labels clear and intuitive?
  • Are important reference lines and points labeled?

17.8.4 Don’t Distort the Data

There’s a bestselling book called [“How to Lie with Statistics”][]. Written by the journalist (and not a statistician) Darrell Huff in 1954, the book focuses on how decisions we make in selecting the data and analysis method, along with errors in interpretation, can generate incorrect conclusions. Similarly, visualization principles can be mis-applied when graphing such that the takeaway message from a graphic distorts reality. Review your plots to make sure they both tell, and show, the truth.

17.8.5 Practice

Just as an author edits before publishing the novel, and an artist sketches before making the masterpiece, plotting is an iterative process. Proofread for clarity and consistency. Check whether your plots pass the expressiveness and effectiveness tests. Does a viewer draw the same conclusions from the figure that you do?

Here’s a cheat sheet and checklist to help you design and improve your data visualizations. Happy plotting!

Websites:

Articles:

  • Mickinlay, Jock. 1986. Automating the design of graphical presentations of relational information. ACM Transactions on Graphics. https://doi.org/10.1145/22949.22950
  • Cleveland, William S. & Kleiner, Beat. 1975. A Graphical Technique for Enhancing Scatterplots with Moving Statistics. In Proceedings of the Annual Meeting. Atlanta, GA.
  • Fisher, Ronald Alymer. (1915). Theory of Statistical Estimation. Proceedings of the Cambridge Philosophical Society. 22. 700-725.

Contemporary books and chapters:

  • Cleveland, William. 1994. The elements of graphing data, 2nd edition. Hobart Press.
  • Drucker, J. 2014. Graphesis: Visual Forms of Knowledge Production. Harvard UP. Cambridge, MA.
  • Friendly, M. 2007. A Brief History of Data Visualization. In Handbook of Computational Statistics: Data Visualization. III. Springer-Verlag. Heidelberg. 1-34.
  • Munzner, Tamara. 2014. Visualization analysis and design.
  • Huff, Darrell. 1954. How to Lie with statistics. W. W. Norton & Company. New York.
  • Tufte, Edward R. 1983. The Visual Display of Quantitative Information. Graphics Press. Cheshire, CT.
  • Wainer, Howard. 2007. Graphic discovery: a trout in the milk and other visual adventures.
  • Wilkinson, Leland. 2005. The Grammar of Graphics, 2nd ed.. Springer. New York. Yau, Visualize this: The flowing data guide to design, visualization, and statistics

Historical books:

  • Bertin, Jacques. 1983. Semiology of Graphics. University of Wisconsin Press. Madison, WI. (trans. W. Berg) 1967
  • Descartes, Réne. 1637. La Géométrie. In Discours de la Méthode. Essellier. Paris.
  • Minard, Charles Joseph. 1861. Des Tableaux Graphiques et des Cartes Figuratives. E. Thunot et Cie. Paris.
  • Playfair, William. 1786. Commercial and Political Atlas: Representing, by Copper-Plate Charts, the Progress of the Commerce, Revenues, Expenditure, and Debts of England, during the Whole of the Eighteenth Century. Corry. London.
  • Snow, John. 1855. On the Mode of Communication of Cholera. (n.p.). London.
  • Tukey, John Wilder. 1977. Exploratory Data Analysis. Addison-Wesley. Reading, MA.
  • Tukey, John Wilder. 1960. A survey of sampling from contaminated distributions. In Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling (I. Olkin et al., eds.) 448–485. Stanford Univ. Press.

R graphics references: