3. Exploring Data#
Now that you have a solid foundation in the basic functions and data structures of Python, you can move on to using it for data analysis. In this chapter, you’ll learn how to efficiently explore and summarize with visualizations and statistics. Along the way, you’ll also learn how to apply functions along entire sets of data in Pandas DataFrames and Series.
Learning Objectives
Describe how Python iterates over data
Write loops to do things repeatedly
Write list comprehensions to do things repeatedly
Use Pandas aggregation methods to explore a data set
Prepare data for visualization
Describe the grammar of graphics
Use the grammar of graphics to produce a plot
Identify where to go to learn more about making effective visualizations
3.1. Setup#
3.1.1. Packages#
As in the last chapter, you will be working with two primary packages: NumPy and Pandas. Later, you will load another set of packages to visualize your data.
import numpy as np
import pandas as pd
3.1.2. Data#
We will continue working with the banknotes data set. Once you’ve imported your packages, load this data in as well.
banknotes = pd.read_csv("data/banknotes.csv")
You’re now ready to go.
3.2. Iterating Over Data#
Before we go into data exploration in full, it’s important to understand how Python/Pandas computes summary statistics about a data set. Section 1.7.2 introduced column-wise operations in Pandas; you will learn more of them below. These operations are a convenient and efficient way to compute multiple results at once, and with only a few lines of code.
Under the hood, Pandas has to iterate over each value in a cell to perform
operations like .mean
or .min
. We can do this too using a for-loop.
3.2.1. For-Loops#
For-loops iterate over some object and compute something for each element. Each
one of these computations is one iteration. A for-loop begins with the
for
keyword, followed by:
A placeholder variable, which will be automatically signed to an element at the beginning of each iteration
The
in
keywordAn object with elements
A colon
:
Code in the body of the loop must be indented by 4 spaces.
For example, to print out all the column names in banknotes.columns
, you can
write:
for column in banknotes.columns:
print(column)
currency_code
country
currency_name
name
gender
bill_count
profession
known_for_being_first
current_bill_value
prop_total_bills
first_appearance_year
death_year
comments
hover_text
has_portrait
id
scaled_bill_value
Within the indented part of a for-loop, you can compute values, check conditions, etc.
for value in banknotes["bill_count"]:
if value < 1:
print(value)
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.33
0.33
0.33
0.33
0.33
0.33
0.33
0.33
0.33
0.33
0.33
0.33
0.33
0.33
0.33
0.25
0.25
0.25
0.25
0.33
0.33
0.33
0.5
0.5
0.33
0.5
0.33
0.33
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.33
0.33
0.33
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.33
0.33
0.33
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
Oftentimes you want to save the result of the code you perform within a
for-loop. The easiest way to do this is by creating an empty list and using
append
to add values to it.
result = []
for value in banknotes["current_bill_value"]:
if value % 25 == 0:
result.append(value)
result
[100,
100,
50,
50,
50,
100,
100,
1000,
500,
200,
100,
50,
50,
50,
100,
200,
100,
50,
200,
200,
100,
50,
100,
5000,
1000,
10000,
20000,
2000,
50000,
20000,
2000,
10000,
100000,
1000,
20000,
5000,
50000,
2000,
10000,
10000,
5000,
20000,
50000,
2000,
1000,
2000,
5000,
1000,
200,
500,
5000,
2000,
1000,
500,
200,
100,
200,
200,
200,
500,
2000,
100,
2000,
100,
100,
500,
50,
50,
200,
50,
100,
500,
100000,
5000,
100000,
10000,
50000,
20000,
50000,
20000,
2000,
1000,
2000,
10000,
200,
100,
50,
2000,
500,
10000,
5000,
1000,
500,
1000,
100,
50,
500,
1000,
10000,
5000,
5000,
500,
200,
100,
50,
1000,
5000,
1000,
50000,
10000,
1000,
200,
50,
500,
2000,
100,
500,
500,
1000,
1000,
1000,
500,
200,
200,
100,
100,
500,
1000,
100,
200,
1000,
50,
100,
100,
50,
100,
200,
50,
500,
100,
500,
200,
50,
1000,
1000,
1000,
50,
100,
5000,
2000,
100,
1000,
500,
200,
50,
200,
500,
100,
50,
1000,
100000,
200,
500,
1000,
5000,
10000,
20000,
50000,
50,
100,
200,
50,
100,
200,
50,
1000,
200,
100,
500,
50,
100,
1000,
100,
500,
200,
50,
2000,
500,
50,
200,
100,
50,
100,
200]
3.2.2. List Comprehensions#
A more efficient and succinct way to perform certain append
operations is
with a list comprehension. A list comprehension is very similar to a
for-loop, but it automatically creates a new list based on what your iterations
do. This means you do not need to create an empty list ahead of time.
The syntax for a list comprehension includes the keywords for
and in
, just
like a for-loop. The difference is that in the list comprehension, the repeated
code comes before the for
keyword rather than after it, and the entire
expression is enclosed in square brackets [ ]
.
Here’s a list comprehension that divides each value in the current_bill_value
column by 2:
[value / 2 for value in banknotes["current_bill_value"]]
[50.0,
50.0,
25.0,
10.0,
5.0,
25.0,
5.0,
10.0,
5.0,
25.0,
50.0,
50.0,
10.0,
2.5,
10.0,
500.0,
250.0,
100.0,
50.0,
25.0,
5.0,
2.5,
1.0,
25.0,
25.0,
5.0,
50.0,
100.0,
50.0,
25.0,
5.0,
5.0,
10.0,
100.0,
10.0,
100.0,
50.0,
10.0,
2.5,
25.0,
50.0,
5.0,
5.0,
5.0,
5.0,
5.0,
10.0,
2500.0,
500.0,
5000.0,
10000.0,
1000.0,
25000.0,
10000.0,
1000.0,
5000.0,
50000.0,
500.0,
10000.0,
2500.0,
25000.0,
1000.0,
5000.0,
5000.0,
2500.0,
10000.0,
25000.0,
1000.0,
500.0,
1000.0,
2500.0,
500.0,
100.0,
250.0,
2500.0,
1000.0,
500.0,
250.0,
100.0,
50.0,
100.0,
100.0,
100.0,
250.0,
1000.0,
50.0,
1000.0,
50.0,
50.0,
10.0,
250.0,
2.5,
25.0,
10.0,
5.0,
2.5,
5.0,
10.0,
25.0,
2.5,
1.0,
100.0,
0.5,
5.0,
10.0,
25.0,
50.0,
250.0,
50000.0,
2500.0,
50000.0,
5000.0,
25000.0,
10000.0,
25000.0,
10000.0,
1000.0,
500.0,
1000.0,
5000.0,
100.0,
50.0,
25.0,
10.0,
1000.0,
250.0,
5000.0,
2500.0,
500.0,
250.0,
500.0,
50.0,
25.0,
250.0,
500.0,
5000.0,
2500.0,
2500.0,
250.0,
100.0,
10.0,
50.0,
25.0,
500.0,
2500.0,
500.0,
25000.0,
5000.0,
500.0,
100.0,
10.0,
25.0,
250.0,
1000.0,
50.0,
250.0,
250.0,
500.0,
500.0,
500.0,
250.0,
10.0,
100.0,
100.0,
50.0,
50.0,
10.0,
250.0,
500.0,
5.0,
50.0,
2.5,
10.0,
100.0,
500.0,
2.5,
25.0,
50.0,
5.0,
10.0,
50.0,
10.0,
5.0,
5.0,
25.0,
50.0,
100.0,
25.0,
250.0,
50.0,
250.0,
100.0,
10.0,
25.0,
500.0,
500.0,
500.0,
0.5,
2.5,
5.0,
10.0,
25.0,
50.0,
2500.0,
1000.0,
50.0,
500.0,
250.0,
100.0,
25.0,
5.0,
10.0,
100.0,
250.0,
10.0,
50.0,
25.0,
500.0,
50000.0,
100.0,
250.0,
500.0,
2500.0,
5000.0,
10000.0,
25000.0,
5.0,
10.0,
5.0,
2.5,
2.5,
5.0,
10.0,
25.0,
50.0,
100.0,
5.0,
2.5,
25.0,
10.0,
50.0,
100.0,
25.0,
500.0,
10.0,
100.0,
50.0,
250.0,
5.0,
2.5,
1.0,
0.5,
25.0,
1.0,
2.5,
0.5,
10.0,
50.0,
5.0,
500.0,
50.0,
10.0,
250.0,
100.0,
25.0,
1000.0,
250.0,
25.0,
1.0,
100.0,
10.0,
50.0,
5.0,
2.5,
5.0,
10.0,
25.0,
50.0,
100.0]
List comprehensions can optionally include the if
keyword and a condition at
the end, to filter out some elements of the list:
[year for year in banknotes["first_appearance_year"] if year > 2012]
[2018,
2018,
2018,
2018,
2018,
2018,
2018,
2018,
2019,
2018,
2019,
2018,
2018,
2017,
2018,
2017,
2017,
2015,
2015,
2015,
2015,
2014,
2014,
2014,
2014,
2014,
2014,
2016,
2021,
2020,
2017,
2016,
2016,
2016,
2016,
2016,
2016,
2015,
2017,
2014,
2017,
2013,
2020,
2020,
2021,
2021,
2015,
2016,
2015,
2015,
2015,
2015,
2013,
2020,
2017,
2013,
2013,
2019,
2015,
2018,
2018,
2018,
2018]
This is similar to subsetting in Pandas.
Note that you can assign the results of a list comprehension to a new variable and then perform further computations on them:
recent_years = [year for year in banknotes["first_appearance_year"] if year > 2012]
np.median(recent_years)
np.float64(2017.0)
You can learn more about comprehensions in the official Python documentation.
3.3. Aggregate Functions#
3.3.1. Aggregating a Column#
In Section 1.7.2, you learned how to compute the mean, minimum,
and maximum values from a Series. Pandas offers a more generalized way to
handle these functions through its .aggregate
method. This method
aggregates the elements of Series, reducing the Series to a smaller number
of values (usually one value).
For example, to compute the median of all values in first_appearance_year
:
banknotes["first_appearance_year"].aggregate('median')
np.float64(1996.0)
The .agg
method is an alias for .aggregate
. The Pandas
documentation advises that you use the alias:
banknotes["first_appearance_year"].agg('median')
np.float64(1996.0)
You can pass functions to .agg
in addition to names of functions:
banknotes["first_appearance_year"].agg(np.median)
/tmp/ipykernel_130502/2993909395.py:1: FutureWarning: The provided callable <function median at 0x70e120de1620> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
banknotes["first_appearance_year"].agg(np.median)
np.float64(1996.0)
The method is particularly powerful for its ability to handle multiple
functions at once, using a list. Below, we compute the mean, median, and
standard deviation for bill_count
:
banknotes["current_bill_value"].agg([np.mean, np.median, np.std])
/tmp/ipykernel_130502/3929709255.py:1: FutureWarning: The provided callable <function mean at 0x70e12178f380> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
banknotes["current_bill_value"].agg([np.mean, np.median, np.std])
/tmp/ipykernel_130502/3929709255.py:1: FutureWarning: The provided callable <function median at 0x70e120de1620> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
banknotes["current_bill_value"].agg([np.mean, np.median, np.std])
/tmp/ipykernel_130502/3929709255.py:1: FutureWarning: The provided callable <function std at 0x70e12178f4c0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
banknotes["current_bill_value"].agg([np.mean, np.median, np.std])
mean 4038.956989
median 100.000000
std 14336.386917
Name: current_bill_value, dtype: float64
Aggregation methods can also work on multiple columns at once:
banknotes[["current_bill_value", "scaled_bill_value"]].agg(np.mean)
/tmp/ipykernel_130502/1876906337.py:1: FutureWarning: The provided callable <function mean at 0x70e12178f380> is currently using DataFrame.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
banknotes[["current_bill_value", "scaled_bill_value"]].agg(np.mean)
current_bill_value 4038.956989
scaled_bill_value 0.306058
dtype: float64
3.3.2. Aggregating within Groups#
Aggregation is especially useful when combined with grouping. The .groupby
method groups rows of a DataFrame using the columns you specify. The grouping
columns should generally be categories rather than decimal numbers. For
example, to group the banknotes by gender
and then count how many entries are
in each group:
banknotes.groupby("gender").size()
gender
F 59
M 220
dtype: int64
Use bracket notation to look at a specific column for each group:
banknotes.groupby("gender")["current_bill_value"].mean()
gender
F 2062.745763
M 4568.940909
Name: current_bill_value, dtype: float64
It’s also possible to group by multiple conditions:
banknotes.groupby(["gender", "profession"]).size()
gender profession
F Activist 4
Head of Gov't 1
Monarch 8
Musician 5
Other 3
Performer 1
Politician 4
Religious figure 2
Revolutionary 9
STEM 2
Visual Artist 6
Writer 14
M Educator 4
Founder 45
Head of Gov't 42
Military 13
Monarch 10
Musician 7
Other 2
Performer 2
Politician 23
Religious figure 1
Revolutionary 19
STEM 14
Visual Artist 7
Writer 31
dtype: int64
By default, the grouping columns are moved to the index of the result. You can
prevent this by setting as_index = False
in .groupby
:
banknotes.groupby(["gender", "profession"], as_index = False).size()
gender | profession | size | |
---|---|---|---|
0 | F | Activist | 4 |
1 | F | Head of Gov't | 1 |
2 | F | Monarch | 8 |
3 | F | Musician | 5 |
4 | F | Other | 3 |
5 | F | Performer | 1 |
6 | F | Politician | 4 |
7 | F | Religious figure | 2 |
8 | F | Revolutionary | 9 |
9 | F | STEM | 2 |
10 | F | Visual Artist | 6 |
11 | F | Writer | 14 |
12 | M | Educator | 4 |
13 | M | Founder | 45 |
14 | M | Head of Gov't | 42 |
15 | M | Military | 13 |
16 | M | Monarch | 10 |
17 | M | Musician | 7 |
18 | M | Other | 2 |
19 | M | Performer | 2 |
20 | M | Politician | 23 |
21 | M | Religious figure | 1 |
22 | M | Revolutionary | 19 |
23 | M | STEM | 14 |
24 | M | Visual Artist | 7 |
25 | M | Writer | 31 |
Tip
You can also reset the index on a DataFrame, so that the current indexes
become columns with the .reset_index
method.
Leaving the grouping columns in the index is often convenient because you can easily access results for the groups you’re interested in:
grouped = banknotes.groupby(["gender", "profession"]).size()
grouped.loc[:, "Visual Artist"]
gender
F 6
M 7
dtype: int64
A few aggregation functions only make sense when used together with groups. One
is the .first
method, which returns the first element or row. The .first
method is especially useful if all the values in a group are the same and you
want to reduce the data to one row per group. For instance, the same country
appears across multiple rows in our data set. With .first
, you can select the
corresponding currency code:
banknotes.groupby("country")["currency_code"].first()
country
Argentina ARS
Australia AUD
Bangladesh BDT
Bolivia BOB
Canada CAD
Cape Verde CVE
Chile CLP
China RMB
Colombia COP
Costa Rica CRC
Czech Republic CZK
Dominican Republic DOP
England GBP
Georgia GEL
Iceland ISK
Indonesia IDR
Israel ILS
Jamaica JMD
Japan JPY
Kyrgyzstan KGS
Malawi MWK
Mexico MXN
New Zealand NZD
Nigeria NGN
Papua New Guinea PGK
Peru PEN
Philippines PHP
Serbia RSD
South Africa ZAR
South Korea KRW
Sweden SEK
São Tomé and Príncipe STD
Tunisia TND
Turkey TRY
Ukraine UAH
United States USD
Uruguay UYU
Venezuela VES
Name: currency_code, dtype: object
3.4. Data Visualization in Python#
Image from Jake VanderPlas. See here for a version with links to all of the packages!
Creating aggregated information about a data set is often done with the intent to share your results. A data visualization is an effective medium for displaying results, and there are many ways to create one in Python. In fact, so many visualization packages are available that there is even a website dedicated to helping people decide which to use. This reader focuses on static visualization, where the visualization is a still image. Some popular packages for creating static visualizations are:
matplotlib is the foundation for most other visualization packages. matplotlib is low-level, meaning it’s flexible but even simple plots may take 5 lines of code or more. It’s good to know a little bit about matplotlib, but it probably shouldn’t be your primary visualization package. Familiarity with MATLAB makes it easier to learn matplotlib.
pandas provides built-in plotting functions, which can be convenient but are more limited than what you’ll find in dedicated visualization packages. They’re also inconsistent about the expected format of the data.
plotnine is a copy of the popular R package ggplot2. The package uses the grammar of graphics, a convenient way to describe visualizations in terms of layers. Familiarity with R’s ggplot2 or Julia’s Gadfly.jl package makes it easier to learn plotnine (and vice-versa).
seaborn is designed specifically for making statistical plots. It’s well-documented and stable.
There are also many packages available for making interactive visualizations.
This reader focuses on plotnine, so that the visualization skills you learn here will also be relevant if you end up using R or Julia. plotnine has detailed documentation. It’s also useful to look at the ggplot2 documentation and cheatsheet.
3.5. Preparing to Visualize#
Before building a visualization, you will need to do a few preparatory steps.
3.5.1. Install and Import plotnine#
Note
As of writing, conda installs plotnine 0.9 and matplotlib 3.6.1 by default. This is a problem, because plotnine 0.9 is incompatible with matplotilb >= 3.6.
Incompatibility between different versions of packages is common and exactly what conda was designed to solve. In this case, you can simply make sure that you have a slightly older version of matplotlib by running this command in the Terminal:
conda install -c conda-forge 'matplotlib<3.6'
The plotnine developers have already fixed the problem in the newest version of plotnine, so once it’s available through conda, this workaround will no longer be necessary.
While Matplotlib is included with Anaconda, plotnine is not. You will need to install the plotnine package in order to use it. Section 1.4.2 showed you how to install packages with conda via the Terminal:
conda install -c conda-forge plotnine
In Section 1.4.3, you learned how to import a module in a Python package
with the import
keyword. Python also provides a from
keyword to import
specific objects from within a module, so that you can access them without the
module name as a prefix. The syntax is:
from MODULE import OBJECT
Replace MODULE
with the name of the module and OBJECT
with the name of the
object that you want to import.
For instance, if you import Pandas using:
from pandas import DataFrame
You can then write:
df = DataFrame()
You can also use the from
keyword to import all objects in a module with the
wildcard character *
. Generally you shouldn’t do this, because objects in a
module will overwrite objects in your code if they have the same name. However,
the plotnine package was designed to be imported this way:
from plotnine import *
3.5.2. Configure Jupyter#
Jupyter notebooks can display most static visualizations and some interactive visualizations. If you’re going to use visualization packages that depend on Matplotlib (such as plotnine), it’s a good idea to set up your notebook by running:
# Initialize matplotlib
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = [10, 8]
The last line sets the default size of plots. You can increase the numbers to make plots larger, or decrease them to make plots smaller.
Note
In older versions of Jupyter and IPython, it was also necessary to run the
special IPython command %matplotlib inline
to set up a notebook for plotting.
This is no longer necessary in modern versions, but you may still see people
use or mention it online. You can read more about this change in this
StackOverflow question.
3.5.3. Data Cleaning#
Finally, we need to do a small amount of data cleaning. The plots below will
focus on two variables, death_year
and scaled_bill_value
. But some rows
lack information for these variables, so they need to be removed. Along the
way, we will ensure that the variables’ datatypes are set correctly.
Tip
When making potentially destructive changes to a data set, it’s a good idea to reassign the altered data to a new variable.
Death year
no_death = banknotes["death_year"].isin([np.nan, "-"])
to_plot = banknotes[no_death == False].copy()
to_plot["death_year"] = to_plot["death_year"].astype(int)
Scaled bill value
no_scaled = to_plot["scaled_bill_value"].isna()
to_plot = to_plot[no_scaled == False]
You are now ready to make a plot.
3.6. The Grammar of Graphics#
Recall that plotnine is a clone of ggplot2. The “gg” in ggplot2 stands for grammar of graphics. The idea of a grammar of graphics is that visualizations can be built up in layers. Visualizations that adhere to this grammar must have:
Data
Geometry
Aesthetics
There are also several optional layers. Here are a few:
Layer |
Description |
---|---|
scales |
Title, label, and axis value settings |
facets |
Side-by-side plots |
guides |
Axis and legend position settings |
annotations |
Shapes that are not mapped to data |
coordinates |
Coordinate systems (Cartesian, logarithmic, polar) |
With all this in mind, it’s time to make a plot. But what kind of plot should we make? It depends on what we want to know about the data set. Suppose we want to understand the relationship between a banknote’s value and how long ago the person on the banknote died, as well as whether this is affected by gender. One way to show this is to make a scatter plot.
3.6.1. Layer 1: Data#
The data layer determines the data set used to make the plot. plotnine is designed to work with tidy data. Tidy means:
Each observation has its own row
Each feature has its own column
Each value has its own cell
Tidy data sets are convenient in general. A later lesson will cover how to make an untidy data set tidy. Until then, we’ll take it for granted that the data sets we work with are tidy.
To set up the data layer, call the ``ggplot` function on a Data Frame:
This returns a blank plot. We still need to add a few more layers.
3.6.2. Layer 2: Geometry#
The geometry layer determines the shape or appearance of the visual elements of the plot. In other words, the geometry layer determines what kind of plot to make: one with points, lines, boxes, or something else.
There are many different geometries available in plotnine. The package provides
a function for each geometry, always prefixed with geom_
.
To add a geometry layer to the plot, choose the geom_
function you want and
add it to the plot with the +
operator:
ggplot(to_plot) + geom_point()
---------------------------------------------------------------------------
PlotnineError Traceback (most recent call last)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/core/formatters.py:925, in IPythonDisplayFormatter.__call__(self, obj)
923 method = get_real_method(obj, self.print_method)
924 if method is not None:
--> 925 method()
926 return True
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:141, in ggplot._ipython_display_(self)
134 def _ipython_display_(self):
135 """
136 Display plot in the output of the cell
137
138 This method will always be called when a ggplot object is the
139 last in the cell.
140 """
--> 141 self._display()
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:175, in ggplot._display(self)
172 save_format = "png"
174 buf = BytesIO()
--> 175 self.save(buf, format=save_format, verbose=False)
176 display_func = get_display_function(format)
177 display_func(buf.getvalue())
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:663, in ggplot.save(self, filename, format, path, width, height, units, dpi, limitsize, verbose, **kwargs)
615 def save(
616 self,
617 filename: Optional[str | Path | BytesIO] = None,
(...)
626 **kwargs: Any,
627 ):
628 """
629 Save a ggplot object as an image file
630
(...)
661 Additional arguments to pass to matplotlib `savefig()`.
662 """
--> 663 sv = self.save_helper(
664 filename=filename,
665 format=format,
666 path=path,
667 width=width,
668 height=height,
669 units=units,
670 dpi=dpi,
671 limitsize=limitsize,
672 verbose=verbose,
673 **kwargs,
674 )
676 with plot_context(self).rc_context:
677 sv.figure.savefig(**sv.kwargs)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:612, in ggplot.save_helper(self, filename, format, path, width, height, units, dpi, limitsize, verbose, **kwargs)
609 if dpi is not None:
610 self.theme = self.theme + theme(dpi=dpi)
--> 612 figure = self.draw(show=False)
613 return mpl_save_view(figure, fig_kwargs)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:272, in ggplot.draw(self, show)
270 self = deepcopy(self)
271 with plot_context(self, show=show):
--> 272 self._build()
274 # setup
275 self.figure, self.axs = self.facet.setup(self)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:381, in ggplot._build(self)
377 layers.map_statistic(self)
379 # Prepare data in geoms
380 # e.g. from y and width to ymin and ymax
--> 381 layers.setup_data()
383 # Apply position adjustments
384 layers.compute_position(layout)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/layer.py:447, in Layers.setup_data(self)
445 def setup_data(self):
446 for l in self:
--> 447 l.setup_data()
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/layer.py:325, in layer.setup_data(self)
321 return
323 data = self.geom.setup_data(data)
--> 325 check_required_aesthetics(
326 self.geom.REQUIRED_AES,
327 set(data.columns) | set(self.geom.aes_params),
328 self.geom.__class__.__name__,
329 )
331 self.data = data
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/_utils/__init__.py:403, in check_required_aesthetics(required, present, name)
401 if missing_aes:
402 msg = "{} requires the following missing aesthetics: {}"
--> 403 raise PlotnineError(msg.format(name, ", ".join(missing_aes)))
PlotnineError: 'geom_point requires the following missing aesthetics: x, y'
/home/nick/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/lib/pretty.py:787: FutureWarning: Using repr(plot) to draw and show the plot figure is deprecated and will be removed in a future version. Use plot.show().
---------------------------------------------------------------------------
PlotnineError Traceback (most recent call last)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/core/formatters.py:711, in PlainTextFormatter.__call__(self, obj)
704 stream = StringIO()
705 printer = pretty.RepresentationPrinter(stream, self.verbose,
706 self.max_width, self.newline,
707 max_seq_length=self.max_seq_length,
708 singleton_pprinters=self.singleton_printers,
709 type_pprinters=self.type_printers,
710 deferred_pprinters=self.deferred_printers)
--> 711 printer.pretty(obj)
712 printer.flush()
713 return stream.getvalue()
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/lib/pretty.py:419, in RepresentationPrinter.pretty(self, obj)
408 return meth(obj, self, cycle)
409 if (
410 cls is not object
411 # check if cls defines __repr__
(...)
417 and callable(_safe_getattr(cls, "__repr__", None))
418 ):
--> 419 return _repr_pprint(obj, self, cycle)
421 return _default_pprint(obj, self, cycle)
422 finally:
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/lib/pretty.py:787, in _repr_pprint(obj, p, cycle)
785 """A pprint that just redirects to the normal repr function."""
786 # Find newlines and replace them with p.break_()
--> 787 output = repr(obj)
788 lines = output.splitlines()
789 with p.group():
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:131, in ggplot.__repr__(self)
125 msg = (
126 "Using repr(plot) to draw and show the plot figure is "
127 "deprecated and will be removed in a future version. "
128 "Use plot.show()."
129 )
130 warn(msg, category=FutureWarning, stacklevel=2)
--> 131 self.show()
132 return f"<Figure Size: ({W} x {H})>"
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:150, in ggplot.show(self)
143 def show(self):
144 """
145 Show plot using the matplotlib backend set by the user
146
147 Users should prefer this method instead of printing or repring
148 the object.
149 """
--> 150 self._display() if is_inline_backend() else self.draw(show=True)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:175, in ggplot._display(self)
172 save_format = "png"
174 buf = BytesIO()
--> 175 self.save(buf, format=save_format, verbose=False)
176 display_func = get_display_function(format)
177 display_func(buf.getvalue())
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:663, in ggplot.save(self, filename, format, path, width, height, units, dpi, limitsize, verbose, **kwargs)
615 def save(
616 self,
617 filename: Optional[str | Path | BytesIO] = None,
(...)
626 **kwargs: Any,
627 ):
628 """
629 Save a ggplot object as an image file
630
(...)
661 Additional arguments to pass to matplotlib `savefig()`.
662 """
--> 663 sv = self.save_helper(
664 filename=filename,
665 format=format,
666 path=path,
667 width=width,
668 height=height,
669 units=units,
670 dpi=dpi,
671 limitsize=limitsize,
672 verbose=verbose,
673 **kwargs,
674 )
676 with plot_context(self).rc_context:
677 sv.figure.savefig(**sv.kwargs)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:612, in ggplot.save_helper(self, filename, format, path, width, height, units, dpi, limitsize, verbose, **kwargs)
609 if dpi is not None:
610 self.theme = self.theme + theme(dpi=dpi)
--> 612 figure = self.draw(show=False)
613 return mpl_save_view(figure, fig_kwargs)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:272, in ggplot.draw(self, show)
270 self = deepcopy(self)
271 with plot_context(self, show=show):
--> 272 self._build()
274 # setup
275 self.figure, self.axs = self.facet.setup(self)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:381, in ggplot._build(self)
377 layers.map_statistic(self)
379 # Prepare data in geoms
380 # e.g. from y and width to ymin and ymax
--> 381 layers.setup_data()
383 # Apply position adjustments
384 layers.compute_position(layout)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/layer.py:447, in Layers.setup_data(self)
445 def setup_data(self):
446 for l in self:
--> 447 l.setup_data()
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/layer.py:325, in layer.setup_data(self)
321 return
323 data = self.geom.setup_data(data)
--> 325 check_required_aesthetics(
326 self.geom.REQUIRED_AES,
327 set(data.columns) | set(self.geom.aes_params),
328 self.geom.__class__.__name__,
329 )
331 self.data = data
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/_utils/__init__.py:403, in check_required_aesthetics(required, present, name)
401 if missing_aes:
402 msg = "{} requires the following missing aesthetics: {}"
--> 403 raise PlotnineError(msg.format(name, ", ".join(missing_aes)))
PlotnineError: 'geom_point requires the following missing aesthetics: x, y'
This returns an error message that we’re missing aesthetics x
and y
. We’ll
learn more about aesthetics in the next section, but this error message is
especially helpful: it tells us exactly what we’re missing. When you use a
geometry you’re unfamiliar with, it can be helpful to run the code for just the
data and geometry layer like this, to see exactly which aesthetics need to be
set.
As we’ll see later, it’s possible to add multiple geometries to a plot.
3.6.3. Layer 3: Aesthetics#
The aesthetic layer determines the relationship between the data and the geometry. Use the aesthetic layer to map features in the data to aesthetics (visual elements) of the geometry.
The aes
function creates an aesthetic layer. The syntax is:
aes(AESTHETIC = FEATURE, ...)
The names of the aesthetics depend on the geometry, but some common ones are
x
, y
, color
, fill
, shape
, and size
. There is more information about
and examples of aesthetic names in the documentation.
For example, we want to put death_year
on the x-axis and scalled_bill_value
on the y-axis. It’s best to use scaled_bill_value
here rather than
current_bill_value
because different countries use different scales of
curency. One United States Dollar is worth approximately one hundred Japanese
Yen, for example. Below, we will set the aesthetics for both of these values.
Notice however that the aesthetic layer is not added to the plot with the +
operator. Instead, it is passed as the second argument to the ggplot
function:
Per-geometry Aesthetics
When you add the aesthetic layer or pass it to the ggplot
function, it
applies to the entire plot. You can also set an aesthetic layer individually
for each geometry by passing the layer as the first argument in the geom_
function:
Tip
Enclose expressions with ()
to create multiline code. It would be possible to
write out all of the above on one line, but this would come at the expense of
readability.
This is really only useful when you have multiple geometries. As an example,
let’s color-code the points by gender. To do so, we need to convert gender
to
categorical data, which measures a qualitative category.
(ggplot(to_plot) +
geom_point(aes(x = "death_year", y = "scaled_bill_value", color = "factor(gender)"))
)
Now let’s add labels to each point. To do this, we need to add another geometry:
(ggplot(to_plot,
aes(x = "death_year", y = "scaled_bill_value", color = "factor(gender)",
label = "name")) +
geom_point() +
geom_text()
)
Where you put the aesthetics matters:
(ggplot(to_plot,
aes(x = "death_year", y = "scaled_bill_value", label = "name")) +
geom_point() +
geom_text(aes(color = "factor(gender)"))
)
Constant Aesthetics
If you want to set an aesthetic to a constant value, rather than one that’s data dependent, do so in the geometry layer rather than the aesthetic layer. For instance, suppose you want to use point shape rather than color to indicate gender, and you want to make all of the points blue.
(ggplot(to_plot,
aes(x = "death_year", y = "scaled_bill_value", shape = "factor(gender)")) +
geom_point(color = "blue")
)
If you set an aesthetic to a constant value inside of the aesthetic layer, the results you get might not be what you expect:
(ggplot(to_plot,
aes(x = "death_year", y = "scaled_bill_value", shape = "factor(gender)",
color = "blue")) +
geom_point()
)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/mapping/evaluation.py:223, in evaluate(aesthetics, data, env)
222 try:
--> 223 new_val = env.eval(col, inner_namespace=data)
224 except Exception as e:
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/mapping/_env.py:69, in Environment.eval(self, expr, inner_namespace)
68 code = _compile_eval(expr)
---> 69 return eval(
70 code, {}, StackedLookup([inner_namespace] + self.namespaces)
71 )
File <string-expression>:1
NameError: name 'blue' is not defined
The above exception was the direct cause of the following exception:
PlotnineError Traceback (most recent call last)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/core/formatters.py:925, in IPythonDisplayFormatter.__call__(self, obj)
923 method = get_real_method(obj, self.print_method)
924 if method is not None:
--> 925 method()
926 return True
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:141, in ggplot._ipython_display_(self)
134 def _ipython_display_(self):
135 """
136 Display plot in the output of the cell
137
138 This method will always be called when a ggplot object is the
139 last in the cell.
140 """
--> 141 self._display()
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:175, in ggplot._display(self)
172 save_format = "png"
174 buf = BytesIO()
--> 175 self.save(buf, format=save_format, verbose=False)
176 display_func = get_display_function(format)
177 display_func(buf.getvalue())
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:663, in ggplot.save(self, filename, format, path, width, height, units, dpi, limitsize, verbose, **kwargs)
615 def save(
616 self,
617 filename: Optional[str | Path | BytesIO] = None,
(...)
626 **kwargs: Any,
627 ):
628 """
629 Save a ggplot object as an image file
630
(...)
661 Additional arguments to pass to matplotlib `savefig()`.
662 """
--> 663 sv = self.save_helper(
664 filename=filename,
665 format=format,
666 path=path,
667 width=width,
668 height=height,
669 units=units,
670 dpi=dpi,
671 limitsize=limitsize,
672 verbose=verbose,
673 **kwargs,
674 )
676 with plot_context(self).rc_context:
677 sv.figure.savefig(**sv.kwargs)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:612, in ggplot.save_helper(self, filename, format, path, width, height, units, dpi, limitsize, verbose, **kwargs)
609 if dpi is not None:
610 self.theme = self.theme + theme(dpi=dpi)
--> 612 figure = self.draw(show=False)
613 return mpl_save_view(figure, fig_kwargs)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:272, in ggplot.draw(self, show)
270 self = deepcopy(self)
271 with plot_context(self, show=show):
--> 272 self._build()
274 # setup
275 self.figure, self.axs = self.facet.setup(self)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:362, in ggplot._build(self)
358 layout.setup(layers, self)
360 # Compute aesthetics to produce data with generalised
361 # variable names
--> 362 layers.compute_aesthetics(self)
364 # Transform data using all scales
365 layers.transform(scales)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/layer.py:457, in Layers.compute_aesthetics(self, plot)
455 def compute_aesthetics(self, plot: ggplot):
456 for l in self:
--> 457 l.compute_aesthetics(plot)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/layer.py:260, in layer.compute_aesthetics(self, plot)
253 def compute_aesthetics(self, plot: ggplot):
254 """
255 Return a dataframe where the columns match the aesthetic mappings
256
257 Transformations like 'factor(cyl)' and other
258 expression evaluation are made in here
259 """
--> 260 evaled = evaluate(self.mapping._starting, self.data, plot.environment)
261 evaled_aes = aes(**{str(col): col for col in evaled})
262 plot.scales.add_defaults(evaled, evaled_aes)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/mapping/evaluation.py:226, in evaluate(aesthetics, data, env)
224 except Exception as e:
225 msg = _TPL_EVAL_FAIL.format(ae, col, str(e))
--> 226 raise PlotnineError(msg) from e
228 try:
229 evaled[ae] = new_val
PlotnineError: "Could not evaluate the 'color' mapping: 'blue' (original error: name 'blue' is not defined)"
/home/nick/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/lib/pretty.py:787: FutureWarning: Using repr(plot) to draw and show the plot figure is deprecated and will be removed in a future version. Use plot.show().
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/mapping/evaluation.py:223, in evaluate(aesthetics, data, env)
222 try:
--> 223 new_val = env.eval(col, inner_namespace=data)
224 except Exception as e:
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/mapping/_env.py:69, in Environment.eval(self, expr, inner_namespace)
68 code = _compile_eval(expr)
---> 69 return eval(
70 code, {}, StackedLookup([inner_namespace] + self.namespaces)
71 )
File <string-expression>:1
NameError: name 'blue' is not defined
The above exception was the direct cause of the following exception:
PlotnineError Traceback (most recent call last)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/core/formatters.py:711, in PlainTextFormatter.__call__(self, obj)
704 stream = StringIO()
705 printer = pretty.RepresentationPrinter(stream, self.verbose,
706 self.max_width, self.newline,
707 max_seq_length=self.max_seq_length,
708 singleton_pprinters=self.singleton_printers,
709 type_pprinters=self.type_printers,
710 deferred_pprinters=self.deferred_printers)
--> 711 printer.pretty(obj)
712 printer.flush()
713 return stream.getvalue()
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/lib/pretty.py:419, in RepresentationPrinter.pretty(self, obj)
408 return meth(obj, self, cycle)
409 if (
410 cls is not object
411 # check if cls defines __repr__
(...)
417 and callable(_safe_getattr(cls, "__repr__", None))
418 ):
--> 419 return _repr_pprint(obj, self, cycle)
421 return _default_pprint(obj, self, cycle)
422 finally:
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/lib/pretty.py:787, in _repr_pprint(obj, p, cycle)
785 """A pprint that just redirects to the normal repr function."""
786 # Find newlines and replace them with p.break_()
--> 787 output = repr(obj)
788 lines = output.splitlines()
789 with p.group():
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:131, in ggplot.__repr__(self)
125 msg = (
126 "Using repr(plot) to draw and show the plot figure is "
127 "deprecated and will be removed in a future version. "
128 "Use plot.show()."
129 )
130 warn(msg, category=FutureWarning, stacklevel=2)
--> 131 self.show()
132 return f"<Figure Size: ({W} x {H})>"
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:150, in ggplot.show(self)
143 def show(self):
144 """
145 Show plot using the matplotlib backend set by the user
146
147 Users should prefer this method instead of printing or repring
148 the object.
149 """
--> 150 self._display() if is_inline_backend() else self.draw(show=True)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:175, in ggplot._display(self)
172 save_format = "png"
174 buf = BytesIO()
--> 175 self.save(buf, format=save_format, verbose=False)
176 display_func = get_display_function(format)
177 display_func(buf.getvalue())
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:663, in ggplot.save(self, filename, format, path, width, height, units, dpi, limitsize, verbose, **kwargs)
615 def save(
616 self,
617 filename: Optional[str | Path | BytesIO] = None,
(...)
626 **kwargs: Any,
627 ):
628 """
629 Save a ggplot object as an image file
630
(...)
661 Additional arguments to pass to matplotlib `savefig()`.
662 """
--> 663 sv = self.save_helper(
664 filename=filename,
665 format=format,
666 path=path,
667 width=width,
668 height=height,
669 units=units,
670 dpi=dpi,
671 limitsize=limitsize,
672 verbose=verbose,
673 **kwargs,
674 )
676 with plot_context(self).rc_context:
677 sv.figure.savefig(**sv.kwargs)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:612, in ggplot.save_helper(self, filename, format, path, width, height, units, dpi, limitsize, verbose, **kwargs)
609 if dpi is not None:
610 self.theme = self.theme + theme(dpi=dpi)
--> 612 figure = self.draw(show=False)
613 return mpl_save_view(figure, fig_kwargs)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:272, in ggplot.draw(self, show)
270 self = deepcopy(self)
271 with plot_context(self, show=show):
--> 272 self._build()
274 # setup
275 self.figure, self.axs = self.facet.setup(self)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:362, in ggplot._build(self)
358 layout.setup(layers, self)
360 # Compute aesthetics to produce data with generalised
361 # variable names
--> 362 layers.compute_aesthetics(self)
364 # Transform data using all scales
365 layers.transform(scales)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/layer.py:457, in Layers.compute_aesthetics(self, plot)
455 def compute_aesthetics(self, plot: ggplot):
456 for l in self:
--> 457 l.compute_aesthetics(plot)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/layer.py:260, in layer.compute_aesthetics(self, plot)
253 def compute_aesthetics(self, plot: ggplot):
254 """
255 Return a dataframe where the columns match the aesthetic mappings
256
257 Transformations like 'factor(cyl)' and other
258 expression evaluation are made in here
259 """
--> 260 evaled = evaluate(self.mapping._starting, self.data, plot.environment)
261 evaled_aes = aes(**{str(col): col for col in evaled})
262 plot.scales.add_defaults(evaled, evaled_aes)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/mapping/evaluation.py:226, in evaluate(aesthetics, data, env)
224 except Exception as e:
225 msg = _TPL_EVAL_FAIL.format(ae, col, str(e))
--> 226 raise PlotnineError(msg) from e
228 try:
229 evaled[ae] = new_val
PlotnineError: "Could not evaluate the 'color' mapping: 'blue' (original error: name 'blue' is not defined)"
3.6.4. Layer 4: Scales#
The scales layer controls the title, axis labels, and axis scales of the plot.
Most of the functions in the scales layer are prefixed with scale_
, but not
all of them.
The labs
function is especially important, because it’s used to set the title
and axis labels. All graphs need a title and axis labels.
3.6.5. Saving Plots#
If you assign a plot to a variable, you can use the save
method or the
ggsave
function to save that plot to a file:
plot = (
ggplot(to_plot,
aes(x = "death_year", y = "scaled_bill_value", shape = "factor(gender)")) +
geom_point() +
labs(x = "Death Year", y = "Scaled Bill Value",
title = "Does death year affect bill value?", shape = "Gender")
)
ggsave(plot, "myplot.pdf")
The file format is selected automatically based on the extension. Common formats are PNG and PDF.
3.6.6. Example: Bar Plot#
Now suppose you want to plot the number of banknotes with people from each profession in the banknotes data set. A bar plot is an appropriate way to represent this visually.
The geometry for a bar plot is geom_bar
. Since bar plots are mainly used to
display frequencies, the geom_bar
function automatically computes frequencies
when used in conjunction with the factor()
syntax from above.
We can also use a fill color to further breakdown the bars by gender. Here’s the code to make the bar plot:
(ggplot(to_plot,
aes(x = "factor(profession)", fill = "factor(gender)")) +
geom_bar(position = "dodge") +
theme(axis_text_x=element_text(rotation = 90))
)
The setting position = "dodge"
instructs geom_bar
to put the bars
side-by-side rather than stacking them. Adding theme
allows you to change how
the axis labels and ticks are formatted.
In some cases, you may want to make a bar plot with frequencies you’ve already
computed. To prevent geom_bar
from computing frequencies automatically, set
stat = "identity"
.
3.6.7. Visualization Design#
Designing high-quality visualizations goes beyond just mastering which Python functions to call. You also need to think carefully about what kind of data you have and what message you want to convey. This section provides a few guidelines.
The first step in data visualization is choosing an appropriate kind of plot. Here are some suggestions (not rules):
Feature 1 |
Feature 2 |
Plot |
---|---|---|
categorical |
categorical |
bar, dot |
categorical |
categorical |
bar, dot, mosaic |
numerical |
box, density, histogram |
|
numerical |
categorical |
box, density, ridge |
numerical |
numerical |
line, scatter, smooth scatter |
If you want to add a:
3rd numerical feature, use it to change point/line size
3rd categorical feature, use it to change point/line style
4th categorical feature, use side-by-side plots
Once you’ve selected a plot, here are some rules you should almost always follow:
Always add a title and axis labels. These should be descriptive, not variable names!
Specify units after the axis label if the axis has units. For instance, “Height (ft)”
Don’t forget that many people are colorblind! Also, plots are often printed in black and white. Use point and line styles to distinguish groups; color is optional
Add a legend whenever you’ve used more than one point or line style
Always write a few sentences explaining what the plot shows. Don’t describe the plot, because the reader can just look at it. Instead, explain what they can learn from the plot and point out important details that may be overlooked
For side-by-side plots, use the same axis scales for both plots so that comparing them is not deceptive
Visualization design is a deep topic, and whole books have been written about it. One resource where you can learn more is DataLab’s Principle’s of Data Visualization Workshop Reader.
3.7. Exercises#
3.7.1. Exercise#
Compute the number of banknotes that feature a person who died before 1990.
Of those people, how many were activists?
3.7.2. Exercise#
Compute the range of
first_appearance_year
for each country.Hint: this would be a good place to try out a multi-function aggregation…
How many unique values are there among the first and last values of
first_appearance_year
?
3.7.3. Exercise#
Compute the set of banknotes who died in this century.
Use plotnine’s
geom_segment
function to create a plot which shows the timespan between death year and first appearance as a horizontal segment for each banknote. Put the name of each person on the y-axis. Color code the segments by gender.Hint: you can make the plot more visually appealing if you first sort the death year. You can use the
.sort_values
method to sort a DataFrame on a column, or set of columns. Be aware that the default parameter for one of the arguments is probably not what you’re expecting.