Exploring Data

3. Exploring Data#

Now that you have a solid foundation in the basic functions and data structures of Python, you can move on to using it for data analysis. In this chapter, you’ll learn how to efficiently explore and summarize with visualizations and statistics. Along the way, you’ll also learn how to apply functions along entire sets of data in Pandas DataFrames and Series.

Learning Objectives

Describe how Python iterates over data
Write loops to do things repeatedly
Write list comprehensions to do things repeatedly
Use Pandas aggregation methods to explore a data set
Prepare data for visualization
Describe the grammar of graphics
Use the grammar of graphics to produce a plot
Identify where to go to learn more about making effective visualizations

3.1. Setup#

3.1.1. Packages#

As in the last chapter, you will be working with two primary packages: NumPy and Pandas. Later, you will load another set of packages to visualize your data.

import numpy as np
import pandas as pd

3.1.2. Data#

We will continue working with the banknotes data set. Once you’ve imported your packages, load this data in as well.

banknotes = pd.read_csv("data/banknotes.csv")

You’re now ready to go.

3.2. Iterating Over Data#

Before we go into data exploration in full, it’s important to understand how Python/Pandas computes summary statistics about a data set. Section 1.7.2 introduced column-wise operations in Pandas; you will learn more of them below. These operations are a convenient and efficient way to compute multiple results at once, and with only a few lines of code.

Under the hood, Pandas has to iterate over each value in a cell to perform operations like .mean or .min. We can do this too using a for-loop.

3.2.1. For-Loops#

For-loops iterate over some object and compute something for each element. Each one of these computations is one iteration. A for-loop begins with the for keyword, followed by:

A placeholder variable, which will be automatically signed to an element at the beginning of each iteration
The in keyword
An object with elements
A colon :

Code in the body of the loop must be indented by 4 spaces.

For example, to print out all the column names in banknotes.columns, you can write:

for column in banknotes.columns:
    print(column)

currency_code
country
currency_name
name
gender
bill_count
profession
known_for_being_first
current_bill_value
prop_total_bills
first_appearance_year
death_year
comments
hover_text
has_portrait
id
scaled_bill_value

Within the indented part of a for-loop, you can compute values, check conditions, etc.

for value in banknotes["bill_count"]:
    if value < 1:
        print(value)

Oftentimes you want to save the result of the code you perform within a for-loop. The easiest way to do this is by creating an empty list and using append to add values to it.

result = []
for value in banknotes["current_bill_value"]:
    if value % 25 == 0:
        result.append(value)

result

3.2.2. List Comprehensions#

A more efficient and succinct way to perform certain append operations is with a list comprehension. A list comprehension is very similar to a for-loop, but it automatically creates a new list based on what your iterations do. This means you do not need to create an empty list ahead of time.

The syntax for a list comprehension includes the keywords for and in, just like a for-loop. The difference is that in the list comprehension, the repeated code comes before the for keyword rather than after it, and the entire expression is enclosed in square brackets [ ].

Here’s a list comprehension that divides each value in the current_bill_value column by 2:

[value / 2 for value in banknotes["current_bill_value"]]

[50.0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
0,
0,
0,
0,
0,
0,
0,
5,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
0,
0,
0,
5,
0,
0,
0,
5,
0,
0,
5,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
0,
0,
0,
5,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
5,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
5,
0,
0,
0,
0,
0,
0,
5,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
0,
5,
0,
0,
5,
5,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
0,
0,
0,
0,
0]

List comprehensions can optionally include the if keyword and a condition at the end, to filter out some elements of the list:

[year for year in banknotes["first_appearance_year"] if year > 2012]

This is similar to subsetting in Pandas.

Note that you can assign the results of a list comprehension to a new variable and then perform further computations on them:

recent_years = [year for year in banknotes["first_appearance_year"] if year > 2012]
np.median(recent_years)

np.float64(2017.0)

You can learn more about comprehensions in the official Python documentation.

3.3. Aggregate Functions#

3.3.1. Aggregating a Column#

In Section 1.7.2, you learned how to compute the mean, minimum, and maximum values from a Series. Pandas offers a more generalized way to handle these functions through its .aggregate method. This method aggregates the elements of Series, reducing the Series to a smaller number of values (usually one value).

For example, to compute the median of all values in first_appearance_year:

banknotes["first_appearance_year"].aggregate('median')

np.float64(1996.0)

The .agg method is an alias for .aggregate. The Pandas documentation advises that you use the alias:

banknotes["first_appearance_year"].agg('median')

np.float64(1996.0)

You can pass functions to .agg in addition to names of functions:

banknotes["first_appearance_year"].agg(np.median)

/var/folders/t5/01_l718x40d70dt2m4lyvrb40000gp/T/ipykernel_49155/2993909395.py:1: FutureWarning: The provided callable <function median at 0x1142e3ec0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
  banknotes["first_appearance_year"].agg(np.median)

np.float64(1996.0)

The method is particularly powerful for its ability to handle multiple functions at once, using a list. Below, we compute the mean, median, and standard deviation for bill_count:

banknotes["current_bill_value"].agg([np.mean, np.median, np.std])

/var/folders/t5/01_l718x40d70dt2m4lyvrb40000gp/T/ipykernel_49155/3929709255.py:1: FutureWarning: The provided callable <function mean at 0x10df5e8e0> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  banknotes["current_bill_value"].agg([np.mean, np.median, np.std])
/var/folders/t5/01_l718x40d70dt2m4lyvrb40000gp/T/ipykernel_49155/3929709255.py:1: FutureWarning: The provided callable <function median at 0x1142e3ec0> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
  banknotes["current_bill_value"].agg([np.mean, np.median, np.std])
/var/folders/t5/01_l718x40d70dt2m4lyvrb40000gp/T/ipykernel_49155/3929709255.py:1: FutureWarning: The provided callable <function std at 0x10df5ea20> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  banknotes["current_bill_value"].agg([np.mean, np.median, np.std])

mean       4038.956989
median      100.000000
std       14336.386917
Name: current_bill_value, dtype: float64

Aggregation methods can also work on multiple columns at once:

banknotes[["current_bill_value", "scaled_bill_value"]].agg(np.mean)

/var/folders/t5/01_l718x40d70dt2m4lyvrb40000gp/T/ipykernel_49155/1876906337.py:1: FutureWarning: The provided callable <function mean at 0x10df5e8e0> is currently using DataFrame.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  banknotes[["current_bill_value", "scaled_bill_value"]].agg(np.mean)

current_bill_value    4038.956989
scaled_bill_value        0.306058
dtype: float64

3.3.2. Aggregating within Groups#

Aggregation is especially useful when combined with grouping. The .groupby method groups rows of a DataFrame using the columns you specify. The grouping columns should generally be categories rather than decimal numbers. For example, to group the banknotes by gender and then count how many entries are in each group:

banknotes.groupby("gender").size()

gender
F     59
M    220
dtype: int64

Use bracket notation to look at a specific column for each group:

banknotes.groupby("gender")["current_bill_value"].mean()

gender
F    2062.745763
M    4568.940909
Name: current_bill_value, dtype: float64

It’s also possible to group by multiple conditions:

banknotes.groupby(["gender", "profession"]).size()

gender  profession      
F       Activist             4
        Head of Gov't        1
        Monarch              8
        Musician             5
        Other                3
        Performer            1
        Politician           4
        Religious figure     2
        Revolutionary        9
        STEM                 2
        Visual Artist        6
        Writer              14
M       Educator             4
        Founder             45
        Head of Gov't       42
        Military            13
        Monarch             10
        Musician             7
        Other                2
        Performer            2
        Politician          23
        Religious figure     1
        Revolutionary       19
        STEM                14
        Visual Artist        7
        Writer              31
dtype: int64

By default, the grouping columns are moved to the index of the result. You can prevent this by setting as_index = False in .groupby:

banknotes.groupby(["gender", "profession"], as_index = False).size()

	gender	profession	size
0	F	Activist	4
1	F	Head of Gov't	1
2	F	Monarch	8
3	F	Musician	5
4	F	Other	3
5	F	Performer	1
6	F	Politician	4
7	F	Religious figure	2
8	F	Revolutionary	9
9	F	STEM	2
10	F	Visual Artist	6
11	F	Writer	14
12	M	Educator	4
13	M	Founder	45
14	M	Head of Gov't	42
15	M	Military	13
16	M	Monarch	10
17	M	Musician	7
18	M	Other	2
19	M	Performer	2
20	M	Politician	23
21	M	Religious figure	1
22	M	Revolutionary	19
23	M	STEM	14
24	M	Visual Artist	7
25	M	Writer	31

Tip

You can also reset the index on a DataFrame, so that the current indexes become columns with the .reset_index method.

Leaving the grouping columns in the index is often convenient because you can easily access results for the groups you’re interested in:

grouped = banknotes.groupby(["gender", "profession"]).size()

grouped.loc[:, "Visual Artist"]

gender
F    6
M    7
dtype: int64

A few aggregation functions only make sense when used together with groups. One is the .first method, which returns the first element or row. The .first method is especially useful if all the values in a group are the same and you want to reduce the data to one row per group. For instance, the same country appears across multiple rows in our data set. With .first, you can select the corresponding currency code:

banknotes.groupby("country")["currency_code"].first()

country
Argentina                 ARS
Australia                 AUD
Bangladesh                BDT
Bolivia                   BOB
Canada                    CAD
Cape Verde                CVE
Chile                     CLP
China                     RMB
Colombia                  COP
Costa Rica                CRC
Czech Republic            CZK
Dominican Republic        DOP
England                   GBP
Georgia                   GEL
Iceland                   ISK
Indonesia                 IDR
Israel                    ILS
Jamaica                   JMD
Japan                     JPY
Kyrgyzstan                KGS
Malawi                    MWK
Mexico                    MXN
New Zealand               NZD
Nigeria                   NGN
Papua New Guinea          PGK
Peru                      PEN
Philippines               PHP
Serbia                    RSD
South Africa              ZAR
South Korea               KRW
Sweden                    SEK
São Tomé and Príncipe     STD
Tunisia                   TND
Turkey                    TRY
Ukraine                   UAH
United States             USD
Uruguay                   UYU
Venezuela                VES​
Name: currency_code, dtype: object

3.4. Data Visualization in Python#

A network of Python visualization packages.

Image from Jake VanderPlas. See here for a version with links to all of the packages!

Creating aggregated information about a data set is often done with the intent to share your results. A data visualization is an effective medium for displaying results, and there are many ways to create one in Python. In fact, so many visualization packages are available that there is even a website dedicated to helping people decide which to use. This reader focuses on static visualization, where the visualization is a still image. Some popular packages for creating static visualizations are:

matplotlib is the foundation for most other visualization packages. matplotlib is low-level, meaning it’s flexible but even simple plots may take 5 lines of code or more. It’s good to know a little bit about matplotlib, but it probably shouldn’t be your primary visualization package. Familiarity with MATLAB makes it easier to learn matplotlib.
pandas provides built-in plotting functions, which can be convenient but are more limited than what you’ll find in dedicated visualization packages. They’re also inconsistent about the expected format of the data.
plotnine is a copy of the popular R package ggplot2. The package uses the grammar of graphics, a convenient way to describe visualizations in terms of layers. Familiarity with R’s ggplot2 or Julia’s Gadfly.jl package makes it easier to learn plotnine (and vice-versa).
seaborn is designed specifically for making statistical plots. It’s well-documented and stable.

There are also many packages available for making interactive visualizations.

This reader focuses on plotnine, so that the visualization skills you learn here will also be relevant if you end up using R or Julia. plotnine has detailed documentation. It’s also useful to look at the ggplot2 documentation and cheatsheet.

3.5. Installing Packages#

While Matplotlib is included with Anaconda, plotnine is not. You will need to install the plotnine package in order to use it.

You can use conda, a standalone program included with Anaconda, to install packages. To get started, you need to open a terminal, a text interface for running programs on your computer. In JupyterLab, you can open a terminal with the menu option File -> New -> Terminal

Caution

A terminal looks a lot like a Python console, but doesn’t accept Python code as input! To learn more about how to use terminals than we explain here, see DataLab’s Introduction to the Unix Command Line workshop reader.

The command to install a package called PACKAGE is:

conda install -c conda-forge PACKAGE

So you can install plotnine with the command:

conda install -c conda-forge plotnine

Note

Conda downloads packages from online package repositories. The default repository is maintained by Anaconda, but the packages there tend to be slightly out of date. The community maintains another repository, called conda-forge, that’s updated more frequently and has a wider variety of packages.

The -c conda-forge tells conda to use the conda-forge package repository.

You can learn more about conda from the official website and DataLab’s Intermediate Python workshop reader.

After installing plotnine, close the terminal and go back to your Python console or notebook.

3.6. Preparing to Visualize#

Before building a visualization, you will need to do a few preparatory steps.

3.6.1. Import plotnine#

In Section 1.4.2, you learned how to import a module in a Python package with the import keyword. Python also provides a from keyword to import specific objects from within a module, so that you can access them without the module name as a prefix. The syntax is:

from MODULE import OBJECT

Replace MODULE with the name of the module and OBJECT with the name of the object that you want to import.

For instance, if you import Pandas using:

from pandas import DataFrame

You can then write:

df = DataFrame()

You can also use the from keyword to import all objects in a module with the wildcard character *. Generally you shouldn’t do this, because objects in a module will overwrite objects in your code if they have the same name. However, the plotnine package was designed to be imported this way:

from plotnine import *

3.6.2. Configure Jupyter#

Jupyter notebooks can display most static visualizations and some interactive visualizations. If you’re going to use visualization packages that depend on Matplotlib (such as plotnine), it’s a good idea to set up your notebook by running:

# Initialize matplotlib
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = [10, 8]

The last line sets the default size of plots. You can increase the numbers to make plots larger, or decrease them to make plots smaller.

Note

In older versions of Jupyter and IPython, it was also necessary to run the special IPython command %matplotlib inline to set up a notebook for plotting. This is no longer necessary in modern versions, but you may still see people use or mention it online. You can read more about this change in this StackOverflow question.

3.6.3. Data Cleaning#

Finally, we need to do a small amount of data cleaning. The plots below will focus on two variables, death_year and scaled_bill_value. But some rows lack information for these variables, so they need to be removed. Along the way, we will ensure that the variables’ datatypes are set correctly.

Tip

When making potentially destructive changes to a data set, it’s a good idea to reassign the altered data to a new variable.

Death year

no_death = banknotes["death_year"].isin([np.nan, "-"])
to_plot = banknotes[no_death == False].copy()
to_plot["death_year"] = to_plot["death_year"].astype(int)

Scaled bill value

no_scaled = to_plot["scaled_bill_value"].isna()
to_plot = to_plot[no_scaled == False]

You are now ready to make a plot.

3.7. The Grammar of Graphics#

Recall that plotnine is a clone of ggplot2. The “gg” in ggplot2 stands for grammar of graphics. The idea of a grammar of graphics is that visualizations can be built up in layers. Visualizations that adhere to this grammar must have:

Data
Geometry
Aesthetics

There are also several optional layers. Here are a few:

Layer	Description
scales	Title, label, and axis value settings
facets	Side-by-side plots
guides	Axis and legend position settings
annotations	Shapes that are not mapped to data
coordinates	Coordinate systems (Cartesian, logarithmic, polar)

With all this in mind, it’s time to make a plot. But what kind of plot should we make? It depends on what we want to know about the data set. Suppose we want to understand the relationship between a banknote’s value and how long ago the person on the banknote died, as well as whether this is affected by gender. One way to show this is to make a scatter plot.

3.7.1. Layer 1: Data#

The data layer determines the data set used to make the plot. plotnine is designed to work with tidy data. Tidy means:

Each observation has its own row
Each feature has its own column
Each value has its own cell

Tidy data sets are convenient in general. A later lesson will cover how to make an untidy data set tidy. Until then, we’ll take it for granted that the data sets we work with are tidy.

To set up the data layer, call the ``ggplot` function on a Data Frame:

ggplot(to_plot)

../_images/fe6ad515cd273bdf0712a4a3a9d00a0c44a32f023adf8a48556a1ffe43656510.png

This returns a blank plot. We still need to add a few more layers.

3.7.2. Layer 2: Geometry#

The geometry layer determines the shape or appearance of the visual elements of the plot. In other words, the geometry layer determines what kind of plot to make: one with points, lines, boxes, or something else.

There are many different geometries available in plotnine. The package provides a function for each geometry, always prefixed with geom_.

To add a geometry layer to the plot, choose the geom_ function you want and add it to the plot with the + operator:

ggplot(to_plot) + geom_point()

---------------------------------------------------------------------------
PlotnineError                             Traceback (most recent call last)
File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/core/formatters.py:925, in IPythonDisplayFormatter.__call__(self, obj)
method = get_real_method(obj, self.print_method)
if method is not None:
--> 925     method()
   return True

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:141, in ggplot._ipython_display_(self)
def _ipython_display_(self):
   """
   Display plot in the output of the cell

   This method will always be called when a ggplot object is the
   last in the cell.
   """
--> 141     self._display()

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:175, in ggplot._display(self)
   save_format = "png"
buf = BytesIO()
--> 175 self.save(buf, format=save_format, verbose=False)
display_func = get_display_function(format)
display_func(buf.getvalue())

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:663, in ggplot.save(self, filename, format, path, width, height, units, dpi, limitsize, verbose, **kwargs)
def save(
   self,
   filename: Optional[str | Path | BytesIO] = None,
   (...)
   **kwargs: Any,
):
   """
   Save a ggplot object as an image file

   (...)
       Additional arguments to pass to matplotlib `savefig()`.
   """
--> 663     sv = self.save_helper(
       filename=filename,
       format=format,
       path=path,
       width=width,
       height=height,
       units=units,
       dpi=dpi,
       limitsize=limitsize,
       verbose=verbose,
       **kwargs,
   )
   with plot_context(self).rc_context:
       sv.figure.savefig(**sv.kwargs)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:612, in ggplot.save_helper(self, filename, format, path, width, height, units, dpi, limitsize, verbose, **kwargs)
if dpi is not None:
   self.theme = self.theme + theme(dpi=dpi)
--> 612 figure = self.draw(show=False)
return mpl_save_view(figure, fig_kwargs)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:272, in ggplot.draw(self, show)
self = deepcopy(self)
with plot_context(self, show=show):
--> 272     self._build()
   # setup
   self.figure, self.axs = self.facet.setup(self)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:381, in ggplot._build(self)
layers.map_statistic(self)
# Prepare data in geoms
# e.g. from y and width to ymin and ymax
--> 381 layers.setup_data()
# Apply position adjustments
layers.compute_position(layout)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/layer.py:447, in Layers.setup_data(self)
def setup_data(self):
   for l in self:
--> 447         l.setup_data()

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/layer.py:325, in layer.setup_data(self)
   return
data = self.geom.setup_data(data)
--> 325 check_required_aesthetics(
   self.geom.REQUIRED_AES,
   set(data.columns) | set(self.geom.aes_params),
   self.geom.__class__.__name__,
)
self.data = data

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/_utils/__init__.py:403, in check_required_aesthetics(required, present, name)
if missing_aes:
   msg = "{} requires the following missing aesthetics: {}"
--> 403     raise PlotnineError(msg.format(name, ", ".join(missing_aes)))

PlotnineError: 'geom_point requires the following missing aesthetics: y, x'

/Users/elisehellwig/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/lib/pretty.py:787: FutureWarning: Using repr(plot) to draw and show the plot figure is deprecated and will be removed in a future version. Use plot.show().

---------------------------------------------------------------------------
PlotnineError                             Traceback (most recent call last)
File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/core/formatters.py:711, in PlainTextFormatter.__call__(self, obj)
stream = StringIO()
printer = pretty.RepresentationPrinter(stream, self.verbose,
   self.max_width, self.newline,
   max_seq_length=self.max_seq_length,
   singleton_pprinters=self.singleton_printers,
   type_pprinters=self.type_printers,
   deferred_pprinters=self.deferred_printers)
--> 711 printer.pretty(obj)
printer.flush()
return stream.getvalue()

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/lib/pretty.py:419, in RepresentationPrinter.pretty(self, obj)
                       return meth(obj, self, cycle)
               if (
                   cls is not object
                   # check if cls defines __repr__
   (...)
                   and callable(_safe_getattr(cls, "__repr__", None))
               ):
--> 419                     return _repr_pprint(obj, self, cycle)
   return _default_pprint(obj, self, cycle)
finally:

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/lib/pretty.py:787, in _repr_pprint(obj, p, cycle)
"""A pprint that just redirects to the normal repr function."""
# Find newlines and replace them with p.break_()
--> 787 output = repr(obj)
lines = output.splitlines()
with p.group():

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:131, in ggplot.__repr__(self)
msg = (
   "Using repr(plot) to draw and show the plot figure is "
   "deprecated and will be removed in a future version. "
   "Use plot.show()."
)
warn(msg, category=FutureWarning, stacklevel=2)
--> 131 self.show()
return f"<Figure Size: ({W} x {H})>"

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:150, in ggplot.show(self)
def show(self):
   """
   Show plot using the matplotlib backend set by the user

   Users should prefer this method instead of printing or repring
   the object.
   """
--> 150     self._display() if is_inline_backend() else self.draw(show=True)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:175, in ggplot._display(self)
   save_format = "png"
buf = BytesIO()
--> 175 self.save(buf, format=save_format, verbose=False)
display_func = get_display_function(format)
display_func(buf.getvalue())

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:663, in ggplot.save(self, filename, format, path, width, height, units, dpi, limitsize, verbose, **kwargs)
def save(
   self,
   filename: Optional[str | Path | BytesIO] = None,
   (...)
   **kwargs: Any,
):
   """
   Save a ggplot object as an image file

   (...)
       Additional arguments to pass to matplotlib `savefig()`.
   """
--> 663     sv = self.save_helper(
       filename=filename,
       format=format,
       path=path,
       width=width,
       height=height,
       units=units,
       dpi=dpi,
       limitsize=limitsize,
       verbose=verbose,
       **kwargs,
   )
   with plot_context(self).rc_context:
       sv.figure.savefig(**sv.kwargs)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:612, in ggplot.save_helper(self, filename, format, path, width, height, units, dpi, limitsize, verbose, **kwargs)
if dpi is not None:
   self.theme = self.theme + theme(dpi=dpi)
--> 612 figure = self.draw(show=False)
return mpl_save_view(figure, fig_kwargs)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:272, in ggplot.draw(self, show)
self = deepcopy(self)
with plot_context(self, show=show):
--> 272     self._build()
   # setup
   self.figure, self.axs = self.facet.setup(self)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:381, in ggplot._build(self)
layers.map_statistic(self)
# Prepare data in geoms
# e.g. from y and width to ymin and ymax
--> 381 layers.setup_data()
# Apply position adjustments
layers.compute_position(layout)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/layer.py:447, in Layers.setup_data(self)
def setup_data(self):
   for l in self:
--> 447         l.setup_data()

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/layer.py:325, in layer.setup_data(self)
   return
data = self.geom.setup_data(data)
--> 325 check_required_aesthetics(
   self.geom.REQUIRED_AES,
   set(data.columns) | set(self.geom.aes_params),
   self.geom.__class__.__name__,
)
self.data = data

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/_utils/__init__.py:403, in check_required_aesthetics(required, present, name)
if missing_aes:
   msg = "{} requires the following missing aesthetics: {}"
--> 403     raise PlotnineError(msg.format(name, ", ".join(missing_aes)))

PlotnineError: 'geom_point requires the following missing aesthetics: y, x'

This returns an error message that we’re missing aesthetics x and y. We’ll learn more about aesthetics in the next section, but this error message is especially helpful: it tells us exactly what we’re missing. When you use a geometry you’re unfamiliar with, it can be helpful to run the code for just the data and geometry layer like this, to see exactly which aesthetics need to be set.

As we’ll see later, it’s possible to add multiple geometries to a plot.

3.7.3. Layer 3: Aesthetics#

The aesthetic layer determines the relationship between the data and the geometry. Use the aesthetic layer to map features in the data to aesthetics (visual elements) of the geometry.

The aes function creates an aesthetic layer. The syntax is:

aes(AESTHETIC = FEATURE, ...)

The names of the aesthetics depend on the geometry, but some common ones are x, y, color, fill, shape, and size. There is more information about and examples of aesthetic names in the documentation.

For example, we want to put death_year on the x-axis and scalled_bill_value on the y-axis. It’s best to use scaled_bill_value here rather than current_bill_value because different countries use different scales of curency. One United States Dollar is worth approximately one hundred Japanese Yen, for example. Below, we will set the aesthetics for both of these values. Notice however that the aesthetic layer is not added to the plot with the + operator. Instead, it is passed as the second argument to the ggplot function:

ggplot(
    to_plot,
    aes(x = "death_year", y = "scaled_bill_value")
) + geom_point()

../_images/7dc01b6236a421f2d83416dc8aad7c09a00217538fb5f40e38c0e582196d537e.png

Per-geometry Aesthetics

When you add the aesthetic layer or pass it to the ggplot function, it applies to the entire plot. You can also set an aesthetic layer individually for each geometry by passing the layer as the first argument in the geom_ function:

(ggplot(to_plot) +
    geom_point(aes(x = "death_year", y = "scaled_bill_value"))
)

Tip

Enclose expressions with () to create multiline code. It would be possible to write out all of the above on one line, but this would come at the expense of readability.

This is really only useful when you have multiple geometries. As an example, let’s color-code the points by gender. To do so, we need to convert gender to categorical data, which measures a qualitative category.

(ggplot(to_plot) +
    geom_point(aes(x = "death_year", y = "scaled_bill_value", color = "factor(gender)"))
)

../_images/3b07296ee00261a2afef33e1eabca94d709e6e54a4adbc649f550a635877f450.png

Now let’s add labels to each point. To do this, we need to add another geometry:

(ggplot(to_plot,
    aes(x = "death_year", y = "scaled_bill_value", color = "factor(gender)",
        label = "name")) +
    geom_point() + 
    geom_text()
)

../_images/57d5276b527a53a89eece48acdd642b985ed3c3351b105a28d6a14738895ac36.png

Where you put the aesthetics matters:

(ggplot(to_plot,
    aes(x = "death_year", y = "scaled_bill_value", label = "name")) + 
    geom_point() + 
    geom_text(aes(color = "factor(gender)"))
)

../_images/40573cf9298787d4963c022388377bed3ce7f0244b04e4b07df32225f2328c98.png

Constant Aesthetics

If you want to set an aesthetic to a constant value, rather than one that’s data dependent, do so in the geometry layer rather than the aesthetic layer. For instance, suppose you want to use point shape rather than color to indicate gender, and you want to make all of the points blue.

(ggplot(to_plot,
    aes(x = "death_year", y = "scaled_bill_value", shape = "factor(gender)")) +
    geom_point(color = "blue")
)

../_images/c8e2d91120bd6adcab68566cbb53cc618d63e850840ff59aa56b2f3414ab8040.png

If you set an aesthetic to a constant value inside of the aesthetic layer, the results you get might not be what you expect:

(ggplot(to_plot,
    aes(x = "death_year", y = "scaled_bill_value", shape = "factor(gender)",
        color = "blue")) +
    geom_point()
)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/mapping/evaluation.py:223, in evaluate(aesthetics, data, env)
    222 try:
--> 223     new_val = env.eval(col, inner_namespace=data)
    224 except Exception as e:

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/mapping/_env.py:69, in Environment.eval(self, expr, inner_namespace)
     68 code = _compile_eval(expr)
---> 69 return eval(
     70     code, {}, StackedLookup([inner_namespace] + self.namespaces)
     71 )

File <string-expression>:1

NameError: name 'blue' is not defined

The above exception was the direct cause of the following exception:

PlotnineError                             Traceback (most recent call last)
File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/core/formatters.py:925, in IPythonDisplayFormatter.__call__(self, obj)
    923 method = get_real_method(obj, self.print_method)
    924 if method is not None:
--> 925     method()
    926     return True

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:141, in ggplot._ipython_display_(self)
    134 def _ipython_display_(self):
    135     """
    136     Display plot in the output of the cell
    137 
    138     This method will always be called when a ggplot object is the
    139     last in the cell.
    140     """
--> 141     self._display()

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:175, in ggplot._display(self)
    172     save_format = "png"
    174 buf = BytesIO()
--> 175 self.save(buf, format=save_format, verbose=False)
    176 display_func = get_display_function(format)
    177 display_func(buf.getvalue())

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:663, in ggplot.save(self, filename, format, path, width, height, units, dpi, limitsize, verbose, **kwargs)
    615 def save(
    616     self,
    617     filename: Optional[str | Path | BytesIO] = None,
   (...)
    626     **kwargs: Any,
    627 ):
    628     """
    629     Save a ggplot object as an image file
    630 
   (...)
    661         Additional arguments to pass to matplotlib `savefig()`.
    662     """
--> 663     sv = self.save_helper(
    664         filename=filename,
    665         format=format,
    666         path=path,
    667         width=width,
    668         height=height,
    669         units=units,
    670         dpi=dpi,
    671         limitsize=limitsize,
    672         verbose=verbose,
    673         **kwargs,
    674     )
    676     with plot_context(self).rc_context:
    677         sv.figure.savefig(**sv.kwargs)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:612, in ggplot.save_helper(self, filename, format, path, width, height, units, dpi, limitsize, verbose, **kwargs)
    609 if dpi is not None:
    610     self.theme = self.theme + theme(dpi=dpi)
--> 612 figure = self.draw(show=False)
    613 return mpl_save_view(figure, fig_kwargs)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:272, in ggplot.draw(self, show)
    270 self = deepcopy(self)
    271 with plot_context(self, show=show):
--> 272     self._build()
    274     # setup
    275     self.figure, self.axs = self.facet.setup(self)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:362, in ggplot._build(self)
    358 layout.setup(layers, self)
    360 # Compute aesthetics to produce data with generalised
    361 # variable names
--> 362 layers.compute_aesthetics(self)
    364 # Transform data using all scales
    365 layers.transform(scales)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/layer.py:457, in Layers.compute_aesthetics(self, plot)
    455 def compute_aesthetics(self, plot: ggplot):
    456     for l in self:
--> 457         l.compute_aesthetics(plot)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/layer.py:260, in layer.compute_aesthetics(self, plot)
    253 def compute_aesthetics(self, plot: ggplot):
    254     """
    255     Return a dataframe where the columns match the aesthetic mappings
    256 
    257     Transformations like 'factor(cyl)' and other
    258     expression evaluation are  made in here
    259     """
--> 260     evaled = evaluate(self.mapping._starting, self.data, plot.environment)
    261     evaled_aes = aes(**{str(col): col for col in evaled})
    262     plot.scales.add_defaults(evaled, evaled_aes)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/mapping/evaluation.py:226, in evaluate(aesthetics, data, env)
    224 except Exception as e:
    225     msg = _TPL_EVAL_FAIL.format(ae, col, str(e))
--> 226     raise PlotnineError(msg) from e
    228 try:
    229     evaled[ae] = new_val

PlotnineError: "Could not evaluate the 'color' mapping: 'blue' (original error: name 'blue' is not defined)"

/Users/elisehellwig/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/lib/pretty.py:787: FutureWarning: Using repr(plot) to draw and show the plot figure is deprecated and will be removed in a future version. Use plot.show().

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/mapping/evaluation.py:223, in evaluate(aesthetics, data, env)
try:
--> 223     new_val = env.eval(col, inner_namespace=data)
except Exception as e:

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/mapping/_env.py:69, in Environment.eval(self, expr, inner_namespace)
code = _compile_eval(expr)
---> 69 return eval(
   code, {}, StackedLookup([inner_namespace] + self.namespaces)
)

File <string-expression>:1

NameError: name 'blue' is not defined

The above exception was the direct cause of the following exception:

PlotnineError                             Traceback (most recent call last)
File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/core/formatters.py:711, in PlainTextFormatter.__call__(self, obj)
stream = StringIO()
printer = pretty.RepresentationPrinter(stream, self.verbose,
   self.max_width, self.newline,
   max_seq_length=self.max_seq_length,
   singleton_pprinters=self.singleton_printers,
   type_pprinters=self.type_printers,
   deferred_pprinters=self.deferred_printers)
--> 711 printer.pretty(obj)
printer.flush()
return stream.getvalue()

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/lib/pretty.py:419, in RepresentationPrinter.pretty(self, obj)
                       return meth(obj, self, cycle)
               if (
                   cls is not object
                   # check if cls defines __repr__
   (...)
                   and callable(_safe_getattr(cls, "__repr__", None))
               ):
--> 419                     return _repr_pprint(obj, self, cycle)
   return _default_pprint(obj, self, cycle)
finally:

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/IPython/lib/pretty.py:787, in _repr_pprint(obj, p, cycle)
"""A pprint that just redirects to the normal repr function."""
# Find newlines and replace them with p.break_()
--> 787 output = repr(obj)
lines = output.splitlines()
with p.group():

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:131, in ggplot.__repr__(self)
msg = (
   "Using repr(plot) to draw and show the plot figure is "
   "deprecated and will be removed in a future version. "
   "Use plot.show()."
)
warn(msg, category=FutureWarning, stacklevel=2)
--> 131 self.show()
return f"<Figure Size: ({W} x {H})>"

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:150, in ggplot.show(self)
def show(self):
   """
   Show plot using the matplotlib backend set by the user

   Users should prefer this method instead of printing or repring
   the object.
   """
--> 150     self._display() if is_inline_backend() else self.draw(show=True)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:175, in ggplot._display(self)
   save_format = "png"
buf = BytesIO()
--> 175 self.save(buf, format=save_format, verbose=False)
display_func = get_display_function(format)
display_func(buf.getvalue())

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:663, in ggplot.save(self, filename, format, path, width, height, units, dpi, limitsize, verbose, **kwargs)
def save(
   self,
   filename: Optional[str | Path | BytesIO] = None,
   (...)
   **kwargs: Any,
):
   """
   Save a ggplot object as an image file

   (...)
       Additional arguments to pass to matplotlib `savefig()`.
   """
--> 663     sv = self.save_helper(
       filename=filename,
       format=format,
       path=path,
       width=width,
       height=height,
       units=units,
       dpi=dpi,
       limitsize=limitsize,
       verbose=verbose,
       **kwargs,
   )
   with plot_context(self).rc_context:
       sv.figure.savefig(**sv.kwargs)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:612, in ggplot.save_helper(self, filename, format, path, width, height, units, dpi, limitsize, verbose, **kwargs)
if dpi is not None:
   self.theme = self.theme + theme(dpi=dpi)
--> 612 figure = self.draw(show=False)
return mpl_save_view(figure, fig_kwargs)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:272, in ggplot.draw(self, show)
self = deepcopy(self)
with plot_context(self, show=show):
--> 272     self._build()
   # setup
   self.figure, self.axs = self.facet.setup(self)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/ggplot.py:362, in ggplot._build(self)
layout.setup(layers, self)
# Compute aesthetics to produce data with generalised
# variable names
--> 362 layers.compute_aesthetics(self)
# Transform data using all scales
layers.transform(scales)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/layer.py:457, in Layers.compute_aesthetics(self, plot)
def compute_aesthetics(self, plot: ggplot):
   for l in self:
--> 457         l.compute_aesthetics(plot)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/layer.py:260, in layer.compute_aesthetics(self, plot)
def compute_aesthetics(self, plot: ggplot):
   """
   Return a dataframe where the columns match the aesthetic mappings

   Transformations like 'factor(cyl)' and other
   expression evaluation are  made in here
   """
--> 260     evaled = evaluate(self.mapping._starting, self.data, plot.environment)
   evaled_aes = aes(**{str(col): col for col in evaled})
   plot.scales.add_defaults(evaled, evaled_aes)

File ~/Teaching/workshop_python_basics/.pixi/envs/default/lib/python3.12/site-packages/plotnine/mapping/evaluation.py:226, in evaluate(aesthetics, data, env)
except Exception as e:
   msg = _TPL_EVAL_FAIL.format(ae, col, str(e))
--> 226     raise PlotnineError(msg) from e
try:
   evaled[ae] = new_val

PlotnineError: "Could not evaluate the 'color' mapping: 'blue' (original error: name 'blue' is not defined)"

3.7.4. Layer 4: Scales#

The scales layer controls the title, axis labels, and axis scales of the plot. Most of the functions in the scales layer are prefixed with scale_, but not all of them.

The labs function is especially important, because it’s used to set the title and axis labels. All graphs need a title and axis labels.

(ggplot(to_plot,
    aes(x = "death_year", y = "scaled_bill_value", shape = "factor(gender)")) + 
    geom_point() +
    labs(x = "Death Year", y = "Scaled Bill Value",
         title = "Does death year affect bill value?", shape = "Gender")
)

../_images/9638a2510f593dba39ee1f31137e7815d4ab8c7721d25afc3ec0b99c8fc83f3e.png

3.7.5. Saving Plots#

If you assign a plot to a variable, you can use the save method or the ggsave function to save that plot to a file:

plot = (
    ggplot(to_plot,
    aes(x = "death_year", y = "scaled_bill_value", shape = "factor(gender)")) +
    geom_point() +
    labs(x = "Death Year", y = "Scaled Bill Value", 
         title = "Does death year affect bill value?", shape = "Gender")
)

ggsave(plot, "myplot.pdf")

The file format is selected automatically based on the extension. Common formats are PNG and PDF.

3.7.6. Example: Bar Plot#

Now suppose you want to plot the number of banknotes with people from each profession in the banknotes data set. A bar plot is an appropriate way to represent this visually.

The geometry for a bar plot is geom_bar. Since bar plots are mainly used to display frequencies, the geom_bar function automatically computes frequencies when used in conjunction with the factor() syntax from above.

We can also use a fill color to further breakdown the bars by gender. Here’s the code to make the bar plot:

(ggplot(to_plot,
    aes(x = "factor(profession)", fill = "factor(gender)")) +
    geom_bar(position = "dodge") + 
    theme(axis_text_x=element_text(rotation = 90))
)

../_images/6b63f3014e7b3bbd8d6b6c56f41bf93057aa92704a0599ffe6423034c7b4e790.png

The setting position = "dodge" instructs geom_bar to put the bars side-by-side rather than stacking them. Adding theme allows you to change how the axis labels and ticks are formatted.

In some cases, you may want to make a bar plot with frequencies you’ve already computed. To prevent geom_bar from computing frequencies automatically, set stat = "identity".

3.7.7. Visualization Design#

Designing high-quality visualizations goes beyond just mastering which Python functions to call. You also need to think carefully about what kind of data you have and what message you want to convey. This section provides a few guidelines.

The first step in data visualization is choosing an appropriate kind of plot. Here are some suggestions (not rules):

Feature 1	Feature 2	Plot
categorical	categorical	bar, dot
categorical	categorical	bar, dot, mosaic
numerical		box, density, histogram
numerical	categorical	box, density, ridge
numerical	numerical	line, scatter, smooth scatter

If you want to add a:

3rd numerical feature, use it to change point/line size
3rd categorical feature, use it to change point/line style
4th categorical feature, use side-by-side plots

Once you’ve selected a plot, here are some rules you should almost always follow:

Always add a title and axis labels. These should be descriptive, not variable names!
Specify units after the axis label if the axis has units. For instance, “Height (ft)”
Don’t forget that many people are colorblind! Also, plots are often printed in black and white. Use point and line styles to distinguish groups; color is optional
Add a legend whenever you’ve used more than one point or line style
Always write a few sentences explaining what the plot shows. Don’t describe the plot, because the reader can just look at it. Instead, explain what they can learn from the plot and point out important details that may be overlooked
For side-by-side plots, use the same axis scales for both plots so that comparing them is not deceptive

Visualization design is a deep topic, and whole books have been written about it. One resource where you can learn more is DataLab’s Principle’s of Data Visualization Workshop Reader.

3.8. Exercises#

3.8.1. Exercise#

Compute the number of banknotes that feature a person who died before 1990.
Of those people, how many were activists?

3.8.2. Exercise#

Compute the range of first_appearance_year for each country.
- Hint: this would be a good place to try out a multi-function aggregation…
How many unique values are there among the first and last values of first_appearance_year?

3.8.3. Exercise#

Compute the set of banknotes who died in this century.
Use plotnine’s geom_segment function to create a plot which shows the timespan between death year and first appearance as a horizontal segment for each banknote. Put the name of each person on the y-axis. Color code the segments by gender.
- Hint: you can make the plot more visually appealing if you first sort the death year. You can use the .sort_values method to sort a DataFrame on a column, or set of columns. Be aware that the default parameter for one of the arguments is probably not what you’re expecting.