2. Data Structures#

The previous chapter introduced Python, providing enough background to do simple computations on data sets. This chapter focuses on the foundational knowledge and skills you’ll need to use Python effectively in the long term. Specifically, it begins with a deep dive into data structures and data types in Python and Pandas. Then, it explains how to use this knowledge during data analysis.

Learning Objectives

  • Create Pandas Series, NumPy arrays, lists, and tuples

  • Check the type and class of an object

  • Convert an object into a different type

  • Describe and differentiate None, NA, and NaN

  • Index sequences with empty, integer, string, and logical arguments

  • Negate or combine conditions with logic operations

  • Subset Series objects and DataFrames

  • Find and remove missing values in a DataFrame and Series

2.1. Setup#

2.1.1. Packages#

We will be working with two packages in this chapter, NumPy and Pandas. Start by using what you learned in Section 1.4.2 to load these packages with their conventional aliases:

import numpy as np
import pandas as pd

2.1.2. Data#

Section 1 described how to use the Pandas package to load a tabular dataset into a DataFrame. As an example, you saw how to load the banknotes dataset. You’ll need that dataset for the examples in this chapter as well, so load a fresh copy of it:

banknotes = pd.read_csv("data/banknotes.csv")
banknotes.head()
currency_code country currency_name name gender bill_count profession known_for_being_first current_bill_value prop_total_bills first_appearance_year death_year comments hover_text has_portrait id scaled_bill_value
0 ARS Argentina Argentinian Peso Eva Perón F 1.0 Activist No 100 NaN 2012 1952 NaN NaN True ARS_Evita 1.000000
1 ARS Argentina Argentinian Peso Julio Argentino Roca M 1.0 Head of Gov't No 100 NaN 1988 1914 NaN NaN True ARS_Argentino 1.000000
2 ARS Argentina Argentinian Peso Domingo Faustino Sarmiento M 1.0 Head of Gov't No 50 NaN 1999 1888 NaN NaN True ARS_Domingo 0.444444
3 ARS Argentina Argentinian Peso Juan Manuel de Rosas M 1.0 Politician No 20 NaN 1992 1877 NaN NaN True ARS_Rosas 0.111111
4 ARS Argentina Argentinian Peso Manuel Belgrano M 1.0 Founder Yes 10 NaN 1970 1820 Came up with the first Argentine flag. Designed first Argentine flag True ARS_Belgrano 0.000000

Now you’re ready for the chapter.

2.2. Containers for Data#

A data structure is a collection of data organized in a particular way. In Python, data structures are also called containers, because they contain data. Containers make working with lots of data manageable and efficient. DataFrames, introduced in the previous chapter, are an example of a two-dimensional data structure. In this section, you’ll learn about several one-dimensional data structures that are fundamental to programming in Python.

2.2.1. Pandas Series#

Recall that you can select a single column from a DataFrame with the square brackets [ ]. Select the current_bill_value column from the banknotes dataset:

banknotes["current_bill_value"]
0      100
1      100
2       50
3       20
4       10
      ... 
274     10
275     20
276     50
277    100
278    200
Name: current_bill_value, Length: 279, dtype: int64

Notice that Python prints the column differently from the banknotes DataFrame. For instance, the length of the column is displayed rather than the number of rows. This is a visual clue: the column is not a DataFrame. Instead, it’s a Pandas Series, a one-dimensional container for values. Every column in a DataFrame is a Series.

Series and DataFrames are the fundamental data structures for data analysis in Pandas, and they have many common features. As you’ll learn in this chapter, both allow you to:

  • summarize data

  • handle missing data

  • reshape and transform data

  • subset and filter data

  • merge and combine data

You’ll be working with the current_bill_value column for the next few examples, so go ahead and assign it to a variable:

bill_value = banknotes["current_bill_value"]

The values in a Series (and many other kinds of data structures) are called elements, and the length of a Series is the number of elements it contains. Series are ordered, which means the elements have specific positions. The 1st element in bill_value is 100, the 2nd element is again 100, the 3rd element is 50, the 4th is 20, and so on.

You can get the length of a Series and many other types of objects with Python’s built-in len function:

len(bill_value)
279

A Series can also contain metadata, extra information about its elements. Metadata can usually be accessed through attributes (see Section 1.2.5). Here are a few examples of metadata this Series contains (more on this later):

bill_value.name
'current_bill_value'
bill_value.shape
(279,)

Finally, notice that the elements of bill_value are all integers. For any given Series, the elements will usually all be the same qualitative type of data (integers, decimal numbers, strings, and so on). In other words, the elements are usually homogeneous. There are some exceptions, and you’ll learn more about element types in Section 2.3.

2.2.2. NumPy Arrays#

Under the hood, Pandas Series are based on other data structure, the NumPy array (or ndarray). You can think of a NumPy array as a stripped-down Series: an ordered, one-dimensional container for values without the extra metadata and functionality.

Tip

Series tend to be a good choice for data analysis, while arrays tend to be a good choice for sophisticated mathematical computations (such as simulations).

Most examples in this reader use Series and DataFrames, but it will be pointed out anywhere it’s important to use a NumPy array.

You can convert a Series to an array with the .to_numpy method:

bill_value.to_numpy()
array([   100,    100,     50,     20,     10,     50,     10,     20,
           10,     50,    100,    100,     20,      5,     20,   1000,
          500,    200,    100,     50,     10,      5,      2,     50,
           50,     10,    100,    200,    100,     50,     10,     10,
           20,    200,     20,    200,    100,     20,      5,     50,
          100,     10,     10,     10,     10,     10,     20,   5000,
         1000,  10000,  20000,   2000,  50000,  20000,   2000,  10000,
       100000,   1000,  20000,   5000,  50000,   2000,  10000,  10000,
         5000,  20000,  50000,   2000,   1000,   2000,   5000,   1000,
          200,    500,   5000,   2000,   1000,    500,    200,    100,
          200,    200,    200,    500,   2000,    100,   2000,    100,
          100,     20,    500,      5,     50,     20,     10,      5,
           10,     20,     50,      5,      2,    200,      1,     10,
           20,     50,    100,    500, 100000,   5000, 100000,  10000,
        50000,  20000,  50000,  20000,   2000,   1000,   2000,  10000,
          200,    100,     50,     20,   2000,    500,  10000,   5000,
         1000,    500,   1000,    100,     50,    500,   1000,  10000,
         5000,   5000,    500,    200,     20,    100,     50,   1000,
         5000,   1000,  50000,  10000,   1000,    200,     20,     50,
          500,   2000,    100,    500,    500,   1000,   1000,   1000,
          500,     20,    200,    200,    100,    100,     20,    500,
         1000,     10,    100,      5,     20,    200,   1000,      5,
           50,    100,     10,     20,    100,     20,     10,     10,
           50,    100,    200,     50,    500,    100,    500,    200,
           20,     50,   1000,   1000,   1000,      1,      5,     10,
           20,     50,    100,   5000,   2000,    100,   1000,    500,
          200,     50,     10,     20,    200,    500,     20,    100,
           50,   1000, 100000,    200,    500,   1000,   5000,  10000,
        20000,  50000,     10,     20,     10,      5,      5,     10,
           20,     50,    100,    200,     10,      5,     50,     20,
          100,    200,     50,   1000,     20,    200,    100,    500,
           10,      5,      2,      1,     50,      2,      5,      1,
           20,    100,     10,   1000,    100,     20,    500,    200,
           50,   2000,    500,     50,      2,    200,     20,    100,
           10,      5,     10,     20,     50,    100,    200])

Conversely, you can convert an array into a Series with the pd.Series function. You’ll see some examples of this function later on.

2.2.3. Lists#

Series and arrays are designed for data analysis and mathematical computations, respectively. In contrast, a list is a general-purpose one-dimensional container. Lists are built into Python, so they’re probably the most common kind of container, and you don’t need to load any modules in order to use them.

You can make a list by enclosing comma-separated values in square brackets [], like this:

x = [1, 2, 3]
x
[1, 2, 3]

Like a Series, a list is ordered, so it has a first element, second element, and so on up to the length of the list. You can get the length of a list with the len function:

len(x)
3

Lists can be empty:

[]
[]

You can convert many types objects into lists with the list function:

list(bill_value)
[100,
 100,
 50,
 20,
 10,
 50,
 10,
 20,
 10,
 50,
 100,
 100,
 20,
 5,
 20,
 1000,
 500,
 200,
 100,
 50,
 10,
 5,
 2,
 50,
 50,
 10,
 100,
 200,
 100,
 50,
 10,
 10,
 20,
 200,
 20,
 200,
 100,
 20,
 5,
 50,
 100,
 10,
 10,
 10,
 10,
 10,
 20,
 5000,
 1000,
 10000,
 20000,
 2000,
 50000,
 20000,
 2000,
 10000,
 100000,
 1000,
 20000,
 5000,
 50000,
 2000,
 10000,
 10000,
 5000,
 20000,
 50000,
 2000,
 1000,
 2000,
 5000,
 1000,
 200,
 500,
 5000,
 2000,
 1000,
 500,
 200,
 100,
 200,
 200,
 200,
 500,
 2000,
 100,
 2000,
 100,
 100,
 20,
 500,
 5,
 50,
 20,
 10,
 5,
 10,
 20,
 50,
 5,
 2,
 200,
 1,
 10,
 20,
 50,
 100,
 500,
 100000,
 5000,
 100000,
 10000,
 50000,
 20000,
 50000,
 20000,
 2000,
 1000,
 2000,
 10000,
 200,
 100,
 50,
 20,
 2000,
 500,
 10000,
 5000,
 1000,
 500,
 1000,
 100,
 50,
 500,
 1000,
 10000,
 5000,
 5000,
 500,
 200,
 20,
 100,
 50,
 1000,
 5000,
 1000,
 50000,
 10000,
 1000,
 200,
 20,
 50,
 500,
 2000,
 100,
 500,
 500,
 1000,
 1000,
 1000,
 500,
 20,
 200,
 200,
 100,
 100,
 20,
 500,
 1000,
 10,
 100,
 5,
 20,
 200,
 1000,
 5,
 50,
 100,
 10,
 20,
 100,
 20,
 10,
 10,
 50,
 100,
 200,
 50,
 500,
 100,
 500,
 200,
 20,
 50,
 1000,
 1000,
 1000,
 1,
 5,
 10,
 20,
 50,
 100,
 5000,
 2000,
 100,
 1000,
 500,
 200,
 50,
 10,
 20,
 200,
 500,
 20,
 100,
 50,
 1000,
 100000,
 200,
 500,
 1000,
 5000,
 10000,
 20000,
 50000,
 10,
 20,
 10,
 5,
 5,
 10,
 20,
 50,
 100,
 200,
 10,
 5,
 50,
 20,
 100,
 200,
 50,
 1000,
 20,
 200,
 100,
 500,
 10,
 5,
 2,
 1,
 50,
 2,
 5,
 1,
 20,
 100,
 10,
 1000,
 100,
 20,
 500,
 200,
 50,
 2000,
 500,
 50,
 2,
 200,
 20,
 100,
 10,
 5,
 10,
 20,
 50,
 100,
 200]

Unlike a Series, the elements of a list can be qualitatively different. There is no expectation that they will be homogeneous. For instance, this list contains a number, string, and another list (with one element):

li = [8, "hello", [4.2]]
li
[8, 'hello', [4.2]]

2.2.4. Indexing#

So far you’ve learned two ways to use square brackets []:

  1. To select columns from a DataFrame, as in banknotes["country"]

  2. To create lists, as in ["a", "b", 1]

The first case is an example of indexing, which means getting or setting elements of a container. The square brackets [] are Python’s indexing operator.

You can use indexing to get an element of a list based on the element’s position. Python uses zero-based indexing, which means the positions of elements are counted starting from 0 rather than 1. So the first element of a list is at position 0, the second is at position 1, and so on.

Note

Many programming languages use zero-based indexing. It may seem strange at first, but it makes some kinds of computations simpler by eliminating the need to add or subtract 1.

The indexing operator requires at least one argument, called the index, which goes inside of the square brackets []. The index says which elements you want to get. For DataFrames, you used column names as the index. For a list, you can use a position. So the code to get the first element of the list li is:

li[0]
8

Likewise, to get the third element:

li[2]
[4.2]

The same idea extends to containers stored inside of other containers. For example, to get the value stored in the list inside of x:

li[2][0]
4.2

You can set the element of a list by assigning a value at that index. So the code to change the first element of x to the string “hi” is:

li[0] = "hi"
li
['hi', 'hello', [4.2]]

2.2.5. References#

Assigning elements of a container is not without complication. Suppose you assign a list to a variable x and then create a new variable, y, from x. If you change an element of y, it will also change x:

x = [1, 2]
y = x
y[0] = 10
x
[10, 2]

This happens because of how Python handles containers. When you create a container, Python stores it in your computer’s memory. If you then assign the container to a variable, the variable points, or refers, to the location of the container in memory. If you create a second variable from the first, both will refer to the same location. As a result, operations on one variable will affect the value of the other, because there’s really only one container in memory and both variables refer to it.

The example above uses lists, but other containers such as Series and DataFrames behave the same way. The variable bill_value is just reference to a column in the banknotes DataFrame.

If you want to assign an independent copy of a container to a variable rather than a reference, you need to use a function or method to explicitly make a copy. Many containers have a .copy method that makes a copy:

x = [1, 2]
y = x.copy()
y[0] = 10
x
[1, 2]

2.2.6. Tuples#

References can be confusing, and if you know that the elements of a container shouldn’t change, one way to prevent problems is to use a tuple. Like a list, a tuple is a one-dimensional container. The key difference is that tuples are immutable: once you create a tuple, you cannot alter it nor its elements.

You can make a tuple by enclosing comma-separated values in parentheses (), like this:

(1, 2)
(1, 2)

You can also convert another container into a tuple with the tuple function:

x = [1, 2]
y = x
x = tuple(x)
y[0] = 10
x
(1, 2)

2.3. Data Types#

Data can be categorized into different types based on sets of shared characteristics. For instance, statisticians tend to think about whether data are numeric or categorical:

  • numeric

    • continuous (real or complex numbers)

    • discrete (integers)

  • categorical

    • nominal (categories with no ordering)

    • ordinal (categories with some ordering)

Of course, other types of data, like graphs (networks) and natural language (books, speech, and so on), are also possible. Categorizing data this way is useful for reasoning about which methods to apply to which data.

Python and most other programming languages also categorize data by type. To check the type of an object in Python, use the built-in type function. Recall you used this function to check the type of the banknotes DataFrame in Section 1.7:

type(banknotes)
pandas.core.frame.DataFrame

Take a look at the types of a few other objects:

type(bill_value)
pandas.core.series.Series
type(bill_value[0])
numpy.int64
type("hi")
str
type(x)
tuple

Note

In Python 3, class is just another word for type. The type function returns the class of an object. Python also provides a class keyword to create your own classes. Creating classes is beyond the scope of this reader, but is explained in detail in most Python programming textbooks.

For Pandas Series and DataFrames, the type function returns the type of container, but doesn’t return any information about the types of the elements. The same is true for the NumPy arrays.

Section 1.7.1 described one way to print the types of the elements in a Pandas object: by calling the .info method. In the printout, the element types are listed in the Dtype column:

banknotes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279 entries, 0 to 278
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   currency_code          279 non-null    object 
 1   country                279 non-null    object 
 2   currency_name          279 non-null    object 
 3   name                   279 non-null    object 
 4   gender                 279 non-null    object 
 5   bill_count             279 non-null    float64
 6   profession             279 non-null    object 
 7   known_for_being_first  279 non-null    object 
 8   current_bill_value     279 non-null    int64  
 9   prop_total_bills       59 non-null     float64
 10  first_appearance_year  279 non-null    int64  
 11  death_year             272 non-null    object 
 12  comments               119 non-null    object 
 13  hover_text             89 non-null     object 
 14  has_portrait           279 non-null    bool   
 15  id                     279 non-null    object 
 16  scaled_bill_value      278 non-null    float64
dtypes: bool(1), float64(3), int64(2), object(11)
memory usage: 35.3+ KB

The column label Dtype is short for “data types”. You can also access the element types for a DataFrame with the .dtypes attribute:

banknotes.dtypes
currency_code             object
country                   object
currency_name             object
name                      object
gender                    object
bill_count               float64
profession                object
known_for_being_first     object
current_bill_value         int64
prop_total_bills         float64
first_appearance_year      int64
death_year                object
comments                  object
hover_text                object
has_portrait                bool
id                        object
scaled_bill_value        float64
dtype: object

For a Series or NumPy array, you can instead use .dtype to get the element type:

bill_value.dtype
dtype('int64')

Some of the element types listed for the banknotes DataFrame are built into Python, while others are provided by Pandas and NumPy. At the expense of being more complicated, the Pandas/NumPy types tend to be more specific and consistent. They provide programmers with greater control over how data are stored in memory, which makes it possible to write more efficient code. For computations that generate or process a large amount of data, as is often the case in research computing, efficiency is a major concern.

Here’s a non-exhaustive table of data types that you’ll often encounter in data analysis:

Built-in

Pandas/NumPy

Example

Description

bool

True, False

Boolean values

int

int32, int64

-8, 0, 42

Whole numbers

float

float32, float64

-2.1, 0.5

Decimal numbers

complex

complex64, complex128

3j, 1-2j

Complex numbers

str

"hi", "2.1"

Text strings

datetime

datetime64

Dates and times

For most of the built-in types, you can explicitly construct an object with that type by calling the function with the same name as the type. For instance, here’s a way to construct an integer (type int):

n = int(4)
type(n)
int

This example is a bit silly, since you could just write 4 instead of int(4) and you’d still get an integer:

n = 4
type(n)
int

That said, suppose you want to construct an integer from a decimal number:

n = int(4.67)
type(n)
int
n
4

Calling int forces the value to be an integer, and the numbers after the decimal point are removed.

Decimal numbers like 4.67 are better represented by a floating point number, or float. Use this when you need decimal precision of any kind:

n = 4.67
type(n)
float
n
4.67

Notice that the Pandas/NumPy types have the same names as the built-in types, but with a number appended to the end. Section 2.3.3 explains what those numbers mean.

2.3.1. Strings & The object Dtype#

Strings (type str), which were introduced in Section 1.3, are a bit more complicated than Boolean values and numbers because they have many attributes and methods associated with them.

Recall that you can use double " or single ' quotes to construct a string:

"Hello, world!"
'Hello, world!'

In Pandas and NumPy, strings usually associated with the object data type (printed as object or O). For example, look at the names column in the banknotes data:

banknotes["name"]
0                       Eva Perón
1            Julio Argentino Roca
2      Domingo Faustino Sarmiento
3            Juan Manuel de Rosas
4                 Manuel Belgrano
                  ...            
274                Nelson Mandela
275                Nelson Mandela
276                Nelson Mandela
277                Nelson Mandela
278                Nelson Mandela
Name: name, Length: 279, dtype: object

The object data type is provided as a catch-all for non-numeric data types. For example, if you create a Series from several different types of data, Pandas will choose object as the element type:

mixed = pd.Series(["hi", 1, True])
mixed
0      hi
1       1
2    True
dtype: object

The individual elements of an object Series retain their original data types:

type(mixed[0])
str
type(mixed[2])
bool

So one way to think about the object data type is as an invisible wrapper around each element’s original type. The Series can claim all of its elements are generic “objects”, but when you access an element the wrapper is peeled off and you get the original type.

You’re most likely to encounter the object type when working with Series or arrays of strings. In that case, you can generally assume all of the elements are type str. If you’re ever unsure of the type of an element, you can always use type to check.

Note

NumPy doesn’t have a dedicated string type because the way strings are stored in memory is very different from the way numbers are stored. Since Pandas is based on NumPy, until recently Pandas didn’t have a dedicated string type either. So both use object as the element type for Series and arrays of strings.

As of Pandas 1.0, the developers have added an experimental string type so that users can distinguish Series of strings from Series of mixed types. Hopefully in the future the string type will become the main way to handle strings rather than an experimental feature.

2.3.2. Coercion & Conversion#

Although bool, int, and float are different types, in most situations Python will automatically convert between them as needed. For example, you can multiply a floating point number by an integer and then add a Boolean value:

n = 3.1 * 2 + True
n
7.2

First, the integer 2 is converted to floating point number and multiplied by 3.1, yielding 6.2. Then the Boolean True is converted to a floating point number and added to 6.2. In Python and most other programming languages, False corresponds to 0 and True corresponds to 1. Thus the result is 7.2, a floating point number:

type(n)
float

This automatic conversion of types is known as implicit coercion. Conversion always proceeds from less general to more general types, so that no information is lost.

Implicit coercion usually only applies to numeric types (including Boolean values). Mixing other types will usually cause an error. For instance, you can’t add a number to a string:

"hi" + 1
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[44], line 1
----> 1 "hi" + 1

TypeError: can only concatenate str (not "int") to str

Implicit coercion also works for numeric Pandas/NumPy types. For example, you can multiply bill_value by one and a half times its current value:

bill_value * 1.5
0      150.0
1      150.0
2       75.0
3       30.0
4       15.0
       ...  
274     15.0
275     30.0
276     75.0
277    150.0
278    300.0
Name: current_bill_value, Length: 279, dtype: float64

Notice that the dtype has changed from int64 to float64.

Type conversion is when you explicitly convert an object from one type to another. You already saw examples of this with the int and float functions in Section 2.3. Here are a few more:

bool(0)
False
str(105)
'105'

Python can even convert strings into numbers and Boolean values:

float("7.3")
7.3
bool("True")
True

Note however that such operations have to be logically sound. This will not work:

int("Hello world!")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[50], line 1
----> 1 int("Hello world!")

ValueError: invalid literal for int() with base 10: 'Hello world!'

For a Pandas Series or NumPy array, you can use the .astype method to convert the elements to a specific type. Pass the name of the target type to the method as a string. For example, here’s how to convert the bill_value elements to float64:

bill_value.astype("float64")
0      100.0
1      100.0
2       50.0
3       20.0
4       10.0
       ...  
274     10.0
275     20.0
276     50.0
277    100.0
278    200.0
Name: current_bill_value, Length: 279, dtype: float64

2.3.3. Bit Sizes & Memory#

Recall the table of types from the beginning of this Section (Section 2.3). The names of the Pandas/NumPy types in the table all end in numbers such as 32 or 64. These numbers indicate the bit size, the number of bits of memory used to store a value of that type.

For example, a single value with type int64 uses 64 bits of memory. So the int64 series bill_value uses about 64 bits per element, for a total of:

64 * len(bill_value) # bits
17856

In contrast, Python’s built-in data types don’t specify how much memory they use:

type(3)
int

In fact, the amount of memory can vary depending on your computer’s hardware and operating system!

Why is bit size important? Your computer has a limited amount of memory, so it’s a good habit to only use what you need. The tradeoff is that using more memory allows you to use larger or more precise numbers.

For instance, a 64-bit integer can hold values between -9,223,372,036,854,775,808 and 9,223,372,036,854,775,807, whereas a 16-bit integer can only hold values between -32,768 and 32,767. Understanding this matters when you’re working with large numbers. Without that knowledge, you might assign 32,768 to an int16 variable and find that you’ve caused an overflow error.

The same holds for instances where you need a certain amount of precision in your data. For example, Python and NumPy have the ability to represent irrational numbers, like pi. Ultimately, your computer has to represent such numbers with decimal values, so the number of decimal places a variable can hold will affect what pi “means” in your code:

np.pi
3.141592653589793

Tip

You can also use bit sizes to estimate the amount of memory your data will require, as we did for the bill_value object. When a computation runs out of memory, an estimate of how much memory is necessary can help you understand whether to get better hardware or to change your computing strategy.

2.4. Indexing in Pandas#

If you want to inspect the elements of a Series more closely, you can use indexing. Conceptually, indexing a Series is very similar to indexing a list or tuple, but Pandas offers additional ways to select and subset data via indexes.

2.4.1. What’s an Index?#

A Pandas index is more than just a positional location. Indexes serve three important roles:

  1. As metadata to provide additional context about a data set

  2. As a way to explicitly and automatically align data

  3. As a convenience for getting and setting subsets of data

The index of a series is available via the index attribute:

bill_value.index
RangeIndex(start=0, stop=279, step=1)

Index labels can be numbers, strings, dates, or other values. Pandas provides subclasses of Index for specific purposes, which you can read more about here.

Indexes, like tuples, are immutable. The labels in an index cannot be changed. That said, every index also has a name that can be changed:

bill_value.name = "bill_value"
bill_value
0      100
1      100
2       50
3       20
4       10
      ... 
274     10
275     20
276     50
277    100
278    200
Name: bill_value, Length: 279, dtype: int64

The above also applies to DataFrames:

banknotes.index
RangeIndex(start=0, stop=279, step=1)

Oftentimes, an index is a range of numbers, but this can be changed. The code below uses the .set_index method to change the index of banknotes to country:

banknotes.set_index("country", inplace = True)
banknotes.index
Index(['Argentina', 'Argentina', 'Argentina', 'Argentina', 'Argentina',
       'Australia', 'Australia', 'Australia', 'Australia', 'Australia',
       ...
       'Venezuela', 'Venezuela', 'Venezuela', 'Venezuela', 'Venezuela',
       'South Africa', 'South Africa', 'South Africa', 'South Africa',
       'South Africa'],
      dtype='object', name='country', length=279)

The inplace argument instructs Pandas to change the index directly without making a copy first, so that we don’t have to reassign banknotes.index explicitly.

2.4.2. Indexing by Position#

Changing the index affects how you select elements. There are three main methods for accessing specific values in Pandas:

  1. By integer position

  2. By label/name

  3. By a condition

To access elements in a series by integer position, use .iloc:

bill_value.iloc[5]
np.int64(50)

Using .iloc is extensible to sequences of values:

bill_value.iloc[[5, 15, 25, 35]]
5       50
15    1000
25      10
35     200
Name: bill_value, dtype: int64

Use a slice to select a range of elements. The syntax for a slice is start:stop:step, with the second colon : and arguments being optional. This syntax also applies to lists. For example:

bill_value.iloc[0:5]
0    100
1    100
2     50
3     20
4     10
Name: bill_value, dtype: int64

Below, we use a slice to get every twentieth element in the Series:

bill_value.iloc[::20]
0        100
20        10
40       100
60     50000
80       200
100        2
120      200
140       20
160      500
180      100
200       20
220      500
240      100
260      100
Name: bill_value, dtype: int64

Slices also accept negative values. This counts back from the end of a sequence. For instance:

bill_value.iloc[-5:]
274     10
275     20
276     50
277    100
278    200
Name: bill_value, dtype: int64

The result is the same as if you had used the .tail method:

bill_value.tail()
274     10
275     20
276     50
277    100
278    200
Name: bill_value, dtype: int64

2.4.3. Indexing by Label#

Use .loc to index a Series or DataFrame by label:

banknotes.loc["Peru"]
currency_code currency_name name gender bill_count profession known_for_being_first current_bill_value prop_total_bills first_appearance_year death_year comments hover_text has_portrait id scaled_bill_value
country
Peru PEN Sol Jorge Basadre Grohmann M 1.0 Politician No 100 NaN 1991 1980 NaN NaN False PEN_Basadre 0.473684
Peru PEN Sol Raúl Porras Barrenechea M 1.0 Politician No 20 NaN 1991 1960 NaN NaN False PEN_Raul 0.052632
Peru PEN Sol María Isabel Granda y Larco F 1.0 Musician No 10 NaN 2021 1983 NaN NaN False PEN_Granda 0.000000
Peru PEN Sol José Abelardo Quiñones Gonzales M 1.0 Military No 10 NaN 1991 1941 NaN NaN False PEN_Abelardo 0.000000
Peru PEN Sol Abraham Valdelomar Pinto M 1.0 Writer No 50 NaN 1991 1919 NaN NaN False PEN_Pinto 0.210526
Peru PEN Sol Pedro Paulet M 1.0 STEM Yes 100 NaN 2021 1945 Alleged first person to build a liquid-propell... Alleged first person to build a liquid-propell... False PEN_Paulet 0.473684
Peru PEN Sol Santa Rosa de Lima F 1.0 Religious figure Yes 200 NaN 1995 1617 First catholic saint of the Americas First Catholic saint of the Americas False PEN_Santa 1.000000

You can select specific columns as well:

banknotes.loc["Peru", "name"]
country
Peru             Jorge Basadre Grohmann
Peru            Raúl Porras Barrenechea
Peru        María Isabel Granda y Larco
Peru    José Abelardo Quiñones Gonzales
Peru           Abraham Valdelomar Pinto
Peru                       Pedro Paulet
Peru                 Santa Rosa de Lima
Name: name, dtype: object

Just as with .iloc, it’s possible to pass sequences into .loc:

banknotes.loc[["Peru", "Serbia", "Ukraine"]]
currency_code currency_name name gender bill_count profession known_for_being_first current_bill_value prop_total_bills first_appearance_year death_year comments hover_text has_portrait id scaled_bill_value
country
Peru PEN Sol Jorge Basadre Grohmann M 1.0 Politician No 100 NaN 1991 1980 NaN NaN False PEN_Basadre 0.473684
Peru PEN Sol Raúl Porras Barrenechea M 1.0 Politician No 20 NaN 1991 1960 NaN NaN False PEN_Raul 0.052632
Peru PEN Sol María Isabel Granda y Larco F 1.0 Musician No 10 NaN 2021 1983 NaN NaN False PEN_Granda 0.000000
Peru PEN Sol José Abelardo Quiñones Gonzales M 1.0 Military No 10 NaN 1991 1941 NaN NaN False PEN_Abelardo 0.000000
Peru PEN Sol Abraham Valdelomar Pinto M 1.0 Writer No 50 NaN 1991 1919 NaN NaN False PEN_Pinto 0.210526
Peru PEN Sol Pedro Paulet M 1.0 STEM Yes 100 NaN 2021 1945 Alleged first person to build a liquid-propell... Alleged first person to build a liquid-propell... False PEN_Paulet 0.473684
Peru PEN Sol Santa Rosa de Lima F 1.0 Religious figure Yes 200 NaN 1995 1617 First catholic saint of the Americas First Catholic saint of the Americas False PEN_Santa 1.000000
Serbia RSD Serbian dinar Slobodan Jovanovic M 1.0 Head of Gov't No 5000 NaN 2003 1958 writer, politician, diplomat and prime ministe... NaN True RSD_Slobodan 1.000000
Serbia RSD Serbian dinar Milutin Milanković M 1.0 STEM No 2000 NaN 2011 1958 NaN NaN True RSD_Milutin 0.398798
Serbia RSD Serbian dinar Nikola Tesla M 1.0 STEM No 100 NaN 2003 1943 NaN NaN True RSD_Tesla 0.018036
Serbia RSD Serbian dinar Dorde Vajfert M 1.0 Other No 1000 NaN 2003 1937 was governer of the national Bank of Serbia. e... NaN True RSD_Dorde 0.198397
Serbia RSD Serbian dinar Jovan Cvijic M 1.0 STEM No 500 NaN 2004 1927 NaN NaN True RSD_Cvijic 0.098196
Serbia RSD Serbian dinar Nadežda Petrović F 1.0 Visual Artist No 200 NaN 2005 1915 NaN NaN True RSD_Petrovic 0.038076
Serbia RSD Serbian dinar Stevan Stevanovic Mokranjac M 1.0 Musician Yes 50 NaN 2005 1914 Composer. was part of Serbia's first string qu... Member of Serbia's first string quartet True RSD_Stevanovic 0.008016
Serbia RSD Serbian dinar Vuk Stefanovic Karadžic M 1.0 Writer Yes 10 NaN 2006 1864 wrote the 1st dictionary in the reformed Serbi... Wrote the first dictionary in the reformed Ser... True RSD_Vuk 0.000000
Serbia RSD Serbian dinar Petar Petrovic Njegoš M 1.0 Writer No 20 NaN 2006 1851 wrote some of the most imporant Serbian litera... NaN True RSD_Njegos 0.002004
Ukraine UAH hryvna Mykhailo Hrusheskyi M 1.0 Politician No 50 NaN 1992 1934 also an author, president of Central Rada (Sov... NaN False UAH_Mykhailo 0.049049
Ukraine UAH hryvna Volodymyr Vernadskyi M 1.0 STEM No 1000 NaN 2019 1945 NaN NaN False UAH_Vernadskyi 1.000000
Ukraine UAH hryvna Ivan Franko M 1.0 Writer Yes 20 NaN 1992 1916 1st author of detective novels and modern poet... First author of detective novels and modern po... False UAH_Franko 0.019019
Ukraine UAH hryvna Lesya Ukrainka F 1.0 Writer No 200 NaN 2001 1913 NaN NaN False UAH_Ukrainka 0.199199
Ukraine UAH hryvna Taras Shevchenko M 1.0 Writer No 100 NaN 1992 1841 NaN NaN False UAH_Taras 0.099099
Ukraine UAH hryvna Hryhoriy Skovoroda M 1.0 Writer No 500 NaN 2006 1794 NaN NaN False UAH_Skovoroda 0.499499
Ukraine UAH hryvna Ivan Mazepa M 1.0 Head of Gov't No 10 NaN 1992 1709 elected "Hetman of Zaporizhian Host" . also se... NaN False UAH_Mazepa 0.009009
Ukraine UAH hryvna Bogdan Khmelnitsky M 1.0 Head of Gov't No 5 NaN 1992 1657 1st "hetman of ukraine", also military leader NaN False UAH_Bogdon 0.004004
Ukraine UAH hryvna Yaroslav the Wise M 1.0 Monarch Yes 2 NaN 1992 1054 1st christian prince of Kiev First Christian Prince of Kiev False UAH_Yaroslav 0.001001
Ukraine UAH hryvna Volodymyr the Great M 1.0 Monarch No 1 NaN 1992 1015 NaN NaN False UAH_Great 0.000000

This can be a very powerful operation, but it’s easy to get mixed up when labels are integers, as with the bill_value data.

For example, this:

bill_value.loc[0:5]
0    100
1    100
2     50
3     20
4     10
5     50
Name: bill_value, dtype: int64

Is NOT the same as this:

bill_value.iloc[0:5]
0    100
1    100
2     50
3     20
4     10
Name: bill_value, dtype: int64

Recall that bracket notation selects columns in DataFrames. With a Series, the same notation acts as another way to perform .loc operations:

bill_value[0:5]
0    100
1    100
2     50
3     20
4     10
Name: bill_value, dtype: int64

Finally, .iloc and .loc can be used in tandem with one another. This is called chaining. Below, we use the country-indexed banknotes DataFrame to select all rows with “Peru.” Then, we select the second row from this subset.

banknotes.loc["Peru"].iloc[1]
currency_code                                PEN
currency_name                                Sol
name                     Raúl Porras Barrenechea
gender                                         M
bill_count                                   1.0
profession                            Politician
known_for_being_first                         No
current_bill_value                            20
prop_total_bills                             NaN
first_appearance_year                       1991
death_year                                  1960
comments                                     NaN
hover_text                                   NaN
has_portrait                               False
id                                      PEN_Raul
scaled_bill_value                       0.052632
Name: Peru, dtype: object

2.4.4. Indexing by a Condition#

The last way to index in Pandas is by condition. Pandas does this by evaluating a condition and returning a Boolean Series or array. This is by far the most powerful method of indexing in Pandas.

For example, suppose you want to find bill values that are divisible by 25. You can use the modulo operator % to get the remainder when one positive integer is divided by another. So the condition to test for divisibility by 25 is:

bill_value % 25 == 0
0       True
1       True
2       True
3      False
4      False
       ...  
274    False
275    False
276     True
277     True
278     True
Name: bill_value, Length: 279, dtype: bool

The result is a Boolean Series with as many elements as bill_value. You can use this condition in .loc to get only the elements where the result was True:

bill_value.loc[bill_value % 25 == 0]
0      100
1      100
2       50
5       50
9       50
      ... 
269    200
271    100
276     50
277    100
278    200
Name: bill_value, Length: 194, dtype: int64

You can also use square brackets [] without .loc to index by condition:

bill_value[bill_value - 100 > 5]
15     1000
16      500
17      200
27      200
33      200
       ... 
263     200
265    2000
266     500
269     200
278     200
Name: bill_value, Length: 130, dtype: int64

With a DataFrame, indexing by condition gives you a subset of the rows:

banknotes[banknotes["currency_code"] == "MWK"]
currency_code currency_name name gender bill_count profession known_for_being_first current_bill_value prop_total_bills first_appearance_year death_year comments hover_text has_portrait id scaled_bill_value
country
Malawi MWK Kwacha Dr. Hastings Kamuzu Banda M 1.0 Head of Gov't Yes 1000 NaN 1971 1997 1st President and 1st Prime Minister of Malawi First President and Prime Minister of Malawi False MWK_Hastings 0.494949
Malawi MWK Kwacha Rose Lomathinda Chibambo F 1.0 Politician Yes 200 NaN 2012 2016 1st female minister in the independent Malawi ... First female minister in the independent Malaw... False MWK_Rose 0.090909
Malawi MWK Kwacha Inkosi ya Makhosi M'mbelwa II M 1.0 Monarch No 20 NaN 2012 1959 cannot find info about this person NaN False MWK_Mmbelwa 0.000000
Malawi MWK Kwacha Inkosi Ya Mokhosi Gomani II M 1.0 Monarch No 50 NaN 2012 1954 NaN NaN False MWK_Gomani 0.015152
Malawi MWK Kwacha Reverend John Chilembwe M 1.0 Revolutionary No 500 NaN 1997 1915 Known for organizing an uprising against the c... NaN False MWK_Chilembwe 0.242424
Malawi MWK Kwacha Reverend John Chilembwe M 1.0 Revolutionary No 2000 NaN 1997 1915 Known for organizing an uprising against the c... NaN False MWK_Chilembwe 1.000000
Malawi MWK Kwacha James Federick Sangala M 1.0 Politician No 100 NaN 2012 1974 NaN NaN False MWK_Sangala 0.040404

If you want to specify specific columns, use .loc:

banknotes.loc[banknotes["current_bill_value"] == 10.0, "currency_name"]
country
Argentina          Argentinian Peso
Australia         Australian Dollar
Australia         Australian Dollar
Bangladesh                     Taka
Bolivia                   Boliviano
Bolivia                   Boliviano
Bolivia                   Boliviano
Canada              Canadian Dollar
Canada              Canadian Dollar
Canada              Canadian Dollar
Canada              Canadian Dollar
Canada              Canadian Dollar
England                       pound
England                       pound
Georgia                        lari
Nigeria                       naira
New Zealand      New Zealand dollar
Peru                            Sol
Peru                            Sol
China                      Renminbi
Serbia                Serbian dinar
Tunisia              Tunisian dinar
Tunisia              Tunisian dinar
Turkey                         Lira
Turkey                         Lira
Ukraine                      hryvna
United States             US dollar
Venezuela        Venezuelan bolivar
South Africa                   rand
Name: currency_name, dtype: object

The above lets you select multiple columns, but you could also do the following:

cols = ["currency_code", "currency_name"]
banknotes[cols].loc[banknotes["current_bill_value"] == 10.0]
currency_code currency_name
country
Argentina ARS Argentinian Peso
Australia AUD Australian Dollar
Australia AUD Australian Dollar
Bangladesh BDT Taka
Bolivia BOB Boliviano
Bolivia BOB Boliviano
Bolivia BOB Boliviano
Canada CAD Canadian Dollar
Canada CAD Canadian Dollar
Canada CAD Canadian Dollar
Canada CAD Canadian Dollar
Canada CAD Canadian Dollar
England GBP pound
England GBP pound
Georgia GEL lari
Nigeria NGN naira
New Zealand NZD New Zealand dollar
Peru PEN Sol
Peru PEN Sol
China RMB Renminbi
Serbia RSD Serbian dinar
Tunisia TND Tunisian dinar
Tunisia TND Tunisian dinar
Turkey TRY Lira
Turkey TRY Lira
Ukraine UAH hryvna
United States USD US dollar
Venezuela VES Venezuelan bolivar
South Africa ZAR rand

2.5. Special Values#

You may have noticed that some of the data in banknotes is missing. This is common, and it’s important to understand how to handle missing or invalid values.

There are many reasons that could cause these values to be missing or incomplete, and as a result, Pandas provides lots of flexibility for detecting and handling these values.

In Pandas, these special values are generally treated as missing values in the dataset, and are represented by the NumPy nan type. This reduces some of the nuance of data values and types, but was seemingly done for computational performance reasons.

banknotes.iloc[-25]
currency_code                        USD
currency_name                  US dollar
name                     Abraham Lincoln
gender                                 M
bill_count                           1.0
profession                 Head of Gov't
known_for_being_first                 No
current_bill_value                     5
prop_total_bills                    0.06
first_appearance_year               1914
death_year                          1865
comments                             NaN
hover_text                           NaN
has_portrait                        True
id                           USD_Lincoln
scaled_bill_value               0.040404
Name: United States, dtype: object

2.5.1. Types of Values Considered Missing by Pandas#

In addition to np.nan (which displays as NaN), Pandas interprets several other values as missing. This includes Python’s None type, as well as Pandas’ experimental NA types.

Python’s None type represents something that has no value. It often comes about as the return of a function, if something hasn’t been defined yet, or if something wasn’t found.

When creating a Series, we can pass this value:

pd.Series([None, "one", "two"])
0    None
1     one
2     two
dtype: object

Be aware that None is a Python object, and in the above example, the datatype of the series became ‘object’. If we specify a datatype explicitly then Pandas will convert it to one of its representations:

pd.Series([1.5, 2.0, 3, None], dtype="float")
0    1.5
1    2.0
2    3.0
3    NaN
dtype: float64

2.5.2. Reading in Missing Values from a CSV file#

An obvious source of missing or incomplete values is the data itself. When the data was collected, there may have been reasons to code missing data. For example, in collection of survey responses, there may be times where the answer was not applicable.

Another example would be if a measurement was not taken on some of the samples. Obviously, there are no rules on how this was represented in the data set. However there are several conventions, and Pandas is aware of many of them.

When reading data from a CSV file, Pandas will automatically detect missing values. By default, it will convert any empty cell, or string such as ‘na’, ‘nan’, ‘null’, ‘N/A’, and other variants to NaN. A full list can be found in the Pandas documentation.

2.5.3. Detecting Missing Values#

To detect missing values, Pandas provides two complementary methods: .isna and .notna.

We can see information about missing values with the .count method on DataFrames:

banknotes.count()
currency_code            279
currency_name            279
name                     279
gender                   279
bill_count               279
profession               279
known_for_being_first    279
current_bill_value       279
prop_total_bills          59
first_appearance_year    279
death_year               272
comments                 119
hover_text                89
has_portrait             279
id                       279
scaled_bill_value        278
dtype: int64

If we look at the hover_text columns of the DataFrame, we can see what those missing values look like.

ht = banknotes["hover_text"]
ht.count()
np.int64(89)

The return of .isna is a Boolean Series indicating which of the values are considered missing:

ht.isna()
country
Argentina        True
Argentina        True
Argentina        True
Argentina        True
Argentina       False
                ...  
South Africa    False
South Africa    False
South Africa    False
South Africa    False
South Africa    False
Name: hover_text, Length: 279, dtype: bool

The reverse—to see which values are not considered missing—is returned with .notna:

ht.notna()
country
Argentina       False
Argentina       False
Argentina       False
Argentina       False
Argentina        True
                ...  
South Africa     True
South Africa     True
South Africa     True
South Africa     True
South Africa     True
Name: hover_text, Length: 279, dtype: bool

2.5.4. Replacing Missing Values#

We can use this Boolean Series to subset with .loc. For example, to keep only the values that aren’t missing:

ht.loc[ht.notna()]
country
Argentina                           Designed first Argentine flag
Australia       First Australian Aboriginal writer to be publi...
Australia       First person appointed Dame Commander of the B...
Australia       Founded Royal Flying Doctor Service, the world...
Australia       First Australian woman to serve as a member of...
                                      ...                        
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
Name: hover_text, Length: 89, dtype: object

Pandas also provides a shortcut with the .dropna method:

ht.dropna()
country
Argentina                           Designed first Argentine flag
Australia       First Australian Aboriginal writer to be publi...
Australia       First person appointed Dame Commander of the B...
Australia       Founded Royal Flying Doctor Service, the world...
Australia       First Australian woman to serve as a member of...
                                      ...                        
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
Name: hover_text, Length: 89, dtype: object

Another strategy may be to fill the missing values. We could do so using the .fillna method:

ht.fillna(-1, inplace=True)
ht
country
Argentina                                                      -1
Argentina                                                      -1
Argentina                                                      -1
Argentina                                                      -1
Argentina                           Designed first Argentine flag
                                      ...                        
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
Name: hover_text, Length: 279, dtype: object

Additionally, the data set may have its own indicator for missing values, e.g “” or 0. We can convert those to missing using the .replace method:

ht.replace(-1, np.nan)
country
Argentina                                                     NaN
Argentina                                                     NaN
Argentina                                                     NaN
Argentina                                                     NaN
Argentina                           Designed first Argentine flag
                                      ...                        
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
Name: hover_text, Length: 279, dtype: object

2.6. Exercises#

2.6.1. Exercise#

Python’s range function offers another way to create a sequence of numbers. Read the help file for this function.

  1. Create an example range. How does this differ from a list?

  2. Describe the three arguments that you can use in range. Give examples of each.

  3. Convert one of those ranges to a list and print it to screen. What changes in the way Python represents this sequence?

2.6.2. Exercise#

Return to the discussion in Section 2.3.2.

  1. Why does "3" + 4 raise an error?

  2. Why does True - 1 return 0?

  3. Why does int(4.6) < 4.6 return True?

2.6.3. Exercise#

Use a search engine or consult StackOverflow to figure out how to subset a DataFrame with multiple conditions.

  1. Create a new DataFrame from banknotes with the following conditions: current bill value is less than or equal to 20; gender is female; contains the columns country, name, comments, has_portrait

  2. Use a Pandas function to count the number of entries that have portraits. How many are there?

  3. Return the last available comment. What does it say?