Data Structures

2. Data Structures#

The previous chapter introduced Python, providing enough background to do simple computations on data sets. This chapter focuses on the foundational knowledge and skills you’ll need to use Python effectively in the long term. Specifically, it begins with a deep dive into data structures and data types in Python and Pandas. Then, it explains how to use this knowledge during data analysis.

Learning Objectives

Create Pandas Series, NumPy arrays, lists, and tuples
Check the type and class of an object
Convert an object into a different type
Describe and differentiate None, NA, and NaN
Index sequences with empty, integer, string, and logical arguments
Negate or combine conditions with logic operations
Subset Series objects and DataFrames
Find and remove missing values in a DataFrame and Series

2.1. Setup#

2.1.1. Packages#

We will be working with two packages in this chapter, NumPy and Pandas. Start by using what you learned in Section 1.4.2 to load these packages with their conventional aliases:

import numpy as np
import pandas as pd

2.1.2. Data#

Section 1 described how to use the Pandas package to load a tabular dataset into a DataFrame. As an example, you saw how to load the banknotes dataset. You’ll need that dataset for the examples in this chapter as well, so load a fresh copy of it:

banknotes = pd.read_csv("data/banknotes.csv")
banknotes.head()

	currency_code	country	currency_name	name	gender	bill_count	profession	known_for_being_first	current_bill_value	prop_total_bills	first_appearance_year	death_year	comments	hover_text	has_portrait	id	scaled_bill_value
0	ARS	Argentina	Argentinian Peso	Eva Perón	F	1.0	Activist	No	100	NaN	2012	1952	NaN	NaN	True	ARS_Evita	1.000000
1	ARS	Argentina	Argentinian Peso	Julio Argentino Roca	M	1.0	Head of Gov't	No	100	NaN	1988	1914	NaN	NaN	True	ARS_Argentino	1.000000
2	ARS	Argentina	Argentinian Peso	Domingo Faustino Sarmiento	M	1.0	Head of Gov't	No	50	NaN	1999	1888	NaN	NaN	True	ARS_Domingo	0.444444
3	ARS	Argentina	Argentinian Peso	Juan Manuel de Rosas	M	1.0	Politician	No	20	NaN	1992	1877	NaN	NaN	True	ARS_Rosas	0.111111
4	ARS	Argentina	Argentinian Peso	Manuel Belgrano	M	1.0	Founder	Yes	10	NaN	1970	1820	Came up with the first Argentine flag.	Designed first Argentine flag	True	ARS_Belgrano	0.000000

Now you’re ready for the chapter.

2.2. Containers for Data#

A data structure is a collection of data organized in a particular way. In Python, data structures are also called containers, because they contain data. Containers make working with lots of data manageable and efficient. DataFrames, introduced in the previous chapter, are an example of a two-dimensional data structure. In this section, you’ll learn about several one-dimensional data structures that are fundamental to programming in Python.

2.2.1. Pandas Series#

Recall that you can select a single column from a DataFrame with the square brackets [ ]. Select the current_bill_value column from the banknotes dataset:

banknotes["current_bill_value"]

    100
    100
     50
     20
     10
      ... 
   10
   20
   50
  100
  200
Name: current_bill_value, Length: 279, dtype: int64

Notice that Python prints the column differently from the banknotes DataFrame. For instance, the length of the column is displayed rather than the number of rows. This is a visual clue: the column is not a DataFrame. Instead, it’s a Pandas Series, a one-dimensional container for values. Every column in a DataFrame is a Series.

Series and DataFrames are the fundamental data structures for data analysis in Pandas, and they have many common features. As you’ll learn in this chapter, both allow you to:

summarize data
handle missing data
reshape and transform data
subset and filter data
merge and combine data

You’ll be working with the current_bill_value column for the next few examples, so go ahead and assign it to a variable:

bill_value = banknotes["current_bill_value"]

The values in a Series (and many other kinds of data structures) are called elements, and the length of a Series is the number of elements it contains. Series are ordered, which means the elements have specific positions. The 1st element in bill_value is 100, the 2nd element is again 100, the 3rd element is 50, the 4th is 20, and so on.

You can get the length of a Series and many other types of objects with Python’s built-in len function:

len(bill_value)

A Series can also contain metadata, extra information about its elements. Metadata can usually be accessed through attributes (see Section 1.2.5). Here are a few examples of metadata this Series contains (more on this later):

bill_value.name

'current_bill_value'

bill_value.shape

(279,)

Finally, notice that the elements of bill_value are all integers. For any given Series, the elements will usually all be the same qualitative type of data (integers, decimal numbers, strings, and so on). In other words, the elements are usually homogeneous. There are some exceptions, and you’ll learn more about element types in Section 2.3.

2.2.2. NumPy Arrays#

Under the hood, Pandas Series are based on other data structure, the NumPy array (or ndarray). You can think of a NumPy array as a stripped-down Series: an ordered, one-dimensional container for values without the extra metadata and functionality.

Tip

Series tend to be a good choice for data analysis, while arrays tend to be a good choice for sophisticated mathematical computations (such as simulations).

Most examples in this reader use Series and DataFrames, but it will be pointed out anywhere it’s important to use a NumPy array.

You can convert a Series to an array with the .to_numpy method:

bill_value.to_numpy()

array([   100,    100,     50,     20,     10,     50,     10,     20,
           10,     50,    100,    100,     20,      5,     20,   1000,
          500,    200,    100,     50,     10,      5,      2,     50,
           50,     10,    100,    200,    100,     50,     10,     10,
           20,    200,     20,    200,    100,     20,      5,     50,
          100,     10,     10,     10,     10,     10,     20,   5000,
         1000,  10000,  20000,   2000,  50000,  20000,   2000,  10000,
       100000,   1000,  20000,   5000,  50000,   2000,  10000,  10000,
         5000,  20000,  50000,   2000,   1000,   2000,   5000,   1000,
          200,    500,   5000,   2000,   1000,    500,    200,    100,
          200,    200,    200,    500,   2000,    100,   2000,    100,
          100,     20,    500,      5,     50,     20,     10,      5,
           10,     20,     50,      5,      2,    200,      1,     10,
           20,     50,    100,    500, 100000,   5000, 100000,  10000,
        50000,  20000,  50000,  20000,   2000,   1000,   2000,  10000,
          200,    100,     50,     20,   2000,    500,  10000,   5000,
         1000,    500,   1000,    100,     50,    500,   1000,  10000,
         5000,   5000,    500,    200,     20,    100,     50,   1000,
         5000,   1000,  50000,  10000,   1000,    200,     20,     50,
          500,   2000,    100,    500,    500,   1000,   1000,   1000,
          500,     20,    200,    200,    100,    100,     20,    500,
         1000,     10,    100,      5,     20,    200,   1000,      5,
           50,    100,     10,     20,    100,     20,     10,     10,
           50,    100,    200,     50,    500,    100,    500,    200,
           20,     50,   1000,   1000,   1000,      1,      5,     10,
           20,     50,    100,   5000,   2000,    100,   1000,    500,
          200,     50,     10,     20,    200,    500,     20,    100,
           50,   1000, 100000,    200,    500,   1000,   5000,  10000,
        20000,  50000,     10,     20,     10,      5,      5,     10,
           20,     50,    100,    200,     10,      5,     50,     20,
          100,    200,     50,   1000,     20,    200,    100,    500,
           10,      5,      2,      1,     50,      2,      5,      1,
           20,    100,     10,   1000,    100,     20,    500,    200,
           50,   2000,    500,     50,      2,    200,     20,    100,
           10,      5,     10,     20,     50,    100,    200])

Conversely, you can convert an array into a Series with the pd.Series function. You’ll see some examples of this function later on.

2.2.3. Lists#

Series and arrays are designed for data analysis and mathematical computations, respectively. In contrast, a list is a general-purpose one-dimensional container. Lists are built into Python, so they’re probably the most common kind of container, and you don’t need to load any modules in order to use them.

You can make a list by enclosing comma-separated values in square brackets [], like this:

x = [1, 2, 3]
x

[1, 2, 3]

Like a Series, a list is ordered, so it has a first element, second element, and so on up to the length of the list. You can get the length of a list with the len function:

len(x)

Lists can be empty:

[]

[]

You can convert many types objects into lists with the list function:

list(bill_value)

Unlike a Series, the elements of a list can be qualitatively different. There is no expectation that they will be homogeneous. For instance, this list contains a number, string, and another list (with one element):

li = [8, "hello", [4.2]]
li

[8, 'hello', [4.2]]

2.2.4. Indexing#

So far you’ve learned two ways to use square brackets []:

To select columns from a DataFrame, as in banknotes["country"]
To create lists, as in ["a", "b", 1]

The first case is an example of indexing, which means getting or setting elements of a container. The square brackets [] are Python’s indexing operator.

You can use indexing to get an element of a list based on the element’s position. Python uses zero-based indexing, which means the positions of elements are counted starting from 0 rather than 1. So the first element of a list is at position 0, the second is at position 1, and so on.

Note

Many programming languages use zero-based indexing. It may seem strange at first, but it makes some kinds of computations simpler by eliminating the need to add or subtract 1.

The indexing operator requires at least one argument, called the index, which goes inside of the square brackets []. The index says which elements you want to get. For DataFrames, you used column names as the index. For a list, you can use a position. So the code to get the first element of the list li is:

li[0]

Likewise, to get the third element:

li[2]

[4.2]

The same idea extends to containers stored inside of other containers. For example, to get the value stored in the list inside of x:

li[2][0]

4.2

You can set the element of a list by assigning a value at that index. So the code to change the first element of x to the string “hi” is:

li[0] = "hi"
li

['hi', 'hello', [4.2]]

2.2.5. References#

Assigning elements of a container is not without complication. Suppose you assign a list to a variable x and then create a new variable, y, from x. If you change an element of y, it will also change x:

x = [1, 2]
y = x
y[0] = 10
x

[10, 2]

This happens because of how Python handles containers. When you create a container, Python stores it in your computer’s memory. If you then assign the container to a variable, the variable points, or refers, to the location of the container in memory. If you create a second variable from the first, both will refer to the same location. As a result, operations on one variable will affect the value of the other, because there’s really only one container in memory and both variables refer to it.

The example above uses lists, but other containers such as Series and DataFrames behave the same way. The variable bill_value is just reference to a column in the banknotes DataFrame.

If you want to assign an independent copy of a container to a variable rather than a reference, you need to use a function or method to explicitly make a copy. Many containers have a .copy method that makes a copy:

x = [1, 2]
y = x.copy()
y[0] = 10
x

[1, 2]

2.2.6. Tuples#

References can be confusing, and if you know that the elements of a container shouldn’t change, one way to prevent problems is to use a tuple. Like a list, a tuple is a one-dimensional container. The key difference is that tuples are immutable: once you create a tuple, you cannot alter it nor its elements.

You can make a tuple by enclosing comma-separated values in parentheses (), like this:

(1, 2)

(1, 2)

You can also convert another container into a tuple with the tuple function:

x = [1, 2]
y = x
x = tuple(x)
y[0] = 10
x

(1, 2)

2.3. Data Types#

Data can be categorized into different types based on sets of shared characteristics. For instance, statisticians tend to think about whether data are numeric or categorical:

numeric
- continuous (real or complex numbers)
- discrete (integers)
categorical
- nominal (categories with no ordering)
- ordinal (categories with some ordering)

Of course, other types of data, like graphs (networks) and natural language (books, speech, and so on), are also possible. Categorizing data this way is useful for reasoning about which methods to apply to which data.

Python and most other programming languages also categorize data by type. To check the type of an object in Python, use the built-in type function. Recall you used this function to check the type of the banknotes DataFrame in Section 1.7:

type(banknotes)

pandas.core.frame.DataFrame

Take a look at the types of a few other objects:

type(bill_value)

pandas.core.series.Series

type(bill_value[0])

numpy.int64

type("hi")

str

type(x)

tuple

Note

In Python 3, class is just another word for type. The type function returns the class of an object. Python also provides a class keyword to create your own classes. Creating classes is beyond the scope of this reader, but is explained in detail in most Python programming textbooks.

For Pandas Series and DataFrames, the type function returns the type of container, but doesn’t return any information about the types of the elements. The same is true for the NumPy arrays.

Section 1.7.1 described one way to print the types of the elements in a Pandas object: by calling the .info method. In the printout, the element types are listed in the Dtype column:

banknotes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279 entries, 0 to 278
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 currency_code          279 non-null    object 
 country                279 non-null    object 
 currency_name          279 non-null    object 
 name                   279 non-null    object 
 gender                 279 non-null    object 
 bill_count             279 non-null    float64
 profession             279 non-null    object 
 known_for_being_first  279 non-null    object 
 current_bill_value     279 non-null    int64  
 prop_total_bills       59 non-null     float64
first_appearance_year  279 non-null    int64  
death_year             272 non-null    object 
comments               119 non-null    object 
hover_text             89 non-null     object 
has_portrait           279 non-null    bool   
id                     279 non-null    object 
scaled_bill_value      278 non-null    float64
dtypes: bool(1), float64(3), int64(2), object(11)
memory usage: 35.3+ KB

The column label Dtype is short for “data types”. You can also access the element types for a DataFrame with the .dtypes attribute:

banknotes.dtypes

currency_code             object
country                   object
currency_name             object
name                      object
gender                    object
bill_count               float64
profession                object
known_for_being_first     object
current_bill_value         int64
prop_total_bills         float64
first_appearance_year      int64
death_year                object
comments                  object
hover_text                object
has_portrait                bool
id                        object
scaled_bill_value        float64
dtype: object

For a Series or NumPy array, you can instead use .dtype to get the element type:

bill_value.dtype

dtype('int64')

Some of the element types listed for the banknotes DataFrame are built into Python, while others are provided by Pandas and NumPy. At the expense of being more complicated, the Pandas/NumPy types tend to be more specific and consistent. They provide programmers with greater control over how data are stored in memory, which makes it possible to write more efficient code. For computations that generate or process a large amount of data, as is often the case in research computing, efficiency is a major concern.

Here’s a non-exhaustive table of data types that you’ll often encounter in data analysis:

Built-in	Pandas/NumPy	Example	Description
`bool`		`True`, `False`	Boolean values
`int`	`int32`, `int64`	`-8`, `0`, `42`	Whole numbers
`float`	`float32`, `float64`	`-2.1`, `0.5`	Decimal numbers
`complex`	`complex64`, `complex128`	`3j`, `1-2j`	Complex numbers
`str`		`"hi"`, `"2.1"`	Text strings
`datetime`	`datetime64`		Dates and times

For most of the built-in types, you can explicitly construct an object with that type by calling the function with the same name as the type. For instance, here’s a way to construct an integer (type int):

n = int(4)
type(n)

int

This example is a bit silly, since you could just write 4 instead of int(4) and you’d still get an integer:

n = 4
type(n)

int

That said, suppose you want to construct an integer from a decimal number:

n = int(4.67)
type(n)

int

Calling int forces the value to be an integer, and the numbers after the decimal point are removed.

Decimal numbers like 4.67 are better represented by a floating point number, or float. Use this when you need decimal precision of any kind:

n = 4.67
type(n)

float

4.67

Notice that the Pandas/NumPy types have the same names as the built-in types, but with a number appended to the end. Section 2.3.3 explains what those numbers mean.

2.3.1. Strings & The `object` Dtype#

Strings (type str), which were introduced in Section 1.3, are a bit more complicated than Boolean values and numbers because they have many attributes and methods associated with them.

Recall that you can use double " or single ' quotes to construct a string:

"Hello, world!"

'Hello, world!'

In Pandas and NumPy, strings usually associated with the object data type (printed as object or O). For example, look at the names column in the banknotes data:

banknotes["name"]

                     Eva Perón
          Julio Argentino Roca
    Domingo Faustino Sarmiento
          Juan Manuel de Rosas
               Manuel Belgrano
                  ...            
              Nelson Mandela
              Nelson Mandela
              Nelson Mandela
              Nelson Mandela
              Nelson Mandela
Name: name, Length: 279, dtype: object

The object data type is provided as a catch-all for non-numeric data types. For example, if you create a Series from several different types of data, Pandas will choose object as the element type:

mixed = pd.Series(["hi", 1, True])
mixed

    hi
     1
  True
dtype: object

The individual elements of an object Series retain their original data types:

type(mixed[0])

str

type(mixed[2])

bool

So one way to think about the object data type is as an invisible wrapper around each element’s original type. The Series can claim all of its elements are generic “objects”, but when you access an element the wrapper is peeled off and you get the original type.

You’re most likely to encounter the object type when working with Series or arrays of strings. In that case, you can generally assume all of the elements are type str. If you’re ever unsure of the type of an element, you can always use type to check.

Note

NumPy doesn’t have a dedicated string type because the way strings are stored in memory is very different from the way numbers are stored. Since Pandas is based on NumPy, until recently Pandas didn’t have a dedicated string type either. So both use object as the element type for Series and arrays of strings.

As of Pandas 1.0, the developers have added an experimental string type so that users can distinguish Series of strings from Series of mixed types. Hopefully in the future the string type will become the main way to handle strings rather than an experimental feature.

2.3.2. Coercion & Conversion#

Although bool, int, and float are different types, in most situations Python will automatically convert between them as needed. For example, you can multiply a floating point number by an integer and then add a Boolean value:

n = 3.1 * 2 + True
n

7.2

First, the integer 2 is converted to floating point number and multiplied by 3.1, yielding 6.2. Then the Boolean True is converted to a floating point number and added to 6.2. In Python and most other programming languages, False corresponds to 0 and True corresponds to 1. Thus the result is 7.2, a floating point number:

type(n)

float

This automatic conversion of types is known as implicit coercion. Conversion always proceeds from less general to more general types, so that no information is lost.

Implicit coercion usually only applies to numeric types (including Boolean values). Mixing other types will usually cause an error. For instance, you can’t add a number to a string:

"hi" + 1

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[44], line 1
----> 1 "hi" + 1

TypeError: can only concatenate str (not "int") to str

Implicit coercion also works for numeric Pandas/NumPy types. For example, you can multiply bill_value by one and a half times its current value:

bill_value * 1.5

    150.0
    150.0
     75.0
     30.0
     15.0
       ...  
   15.0
   30.0
   75.0
  150.0
  300.0
Name: current_bill_value, Length: 279, dtype: float64

Notice that the dtype has changed from int64 to float64.

Type conversion is when you explicitly convert an object from one type to another. You already saw examples of this with the int and float functions in Section 2.3. Here are a few more:

bool(0)

False

str(105)

'105'

Python can even convert strings into numbers and Boolean values:

float("7.3")

7.3

bool("True")

True

Note however that such operations have to be logically sound. This will not work:

int("Hello world!")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[50], line 1
----> 1 int("Hello world!")

ValueError: invalid literal for int() with base 10: 'Hello world!'

For a Pandas Series or NumPy array, you can use the .astype method to convert the elements to a specific type. Pass the name of the target type to the method as a string. For example, here’s how to convert the bill_value elements to float64:

bill_value.astype("float64")

    100.0
    100.0
     50.0
     20.0
     10.0
       ...  
   10.0
   20.0
   50.0
  100.0
  200.0
Name: current_bill_value, Length: 279, dtype: float64

2.3.3. Bit Sizes & Memory#

Recall the table of types from the beginning of this Section (Section 2.3). The names of the Pandas/NumPy types in the table all end in numbers such as 32 or 64. These numbers indicate the bit size, the number of bits of memory used to store a value of that type.

For example, a single value with type int64 uses 64 bits of memory. So the int64 series bill_value uses about 64 bits per element, for a total of:

64 * len(bill_value) # bits

In contrast, Python’s built-in data types don’t specify how much memory they use:

type(3)

int

In fact, the amount of memory can vary depending on your computer’s hardware and operating system!

Why is bit size important? Your computer has a limited amount of memory, so it’s a good habit to only use what you need. The tradeoff is that using more memory allows you to use larger or more precise numbers.

For instance, a 64-bit integer can hold values between -9,223,372,036,854,775,808 and 9,223,372,036,854,775,807, whereas a 16-bit integer can only hold values between -32,768 and 32,767. Understanding this matters when you’re working with large numbers. Without that knowledge, you might assign 32,768 to an int16 variable and find that you’ve caused an overflow error.

The same holds for instances where you need a certain amount of precision in your data. For example, Python and NumPy have the ability to represent irrational numbers, like pi. Ultimately, your computer has to represent such numbers with decimal values, so the number of decimal places a variable can hold will affect what pi “means” in your code:

np.pi

3.141592653589793

Tip

You can also use bit sizes to estimate the amount of memory your data will require, as we did for the bill_value object. When a computation runs out of memory, an estimate of how much memory is necessary can help you understand whether to get better hardware or to change your computing strategy.

2.4. Indexing in Pandas#

If you want to inspect the elements of a Series more closely, you can use indexing. Conceptually, indexing a Series is very similar to indexing a list or tuple, but Pandas offers additional ways to select and subset data via indexes.

2.4.1. What’s an Index?#

A Pandas index is more than just a positional location. Indexes serve three important roles:

As metadata to provide additional context about a data set
As a way to explicitly and automatically align data
As a convenience for getting and setting subsets of data

The index of a series is available via the index attribute:

bill_value.index

RangeIndex(start=0, stop=279, step=1)

Index labels can be numbers, strings, dates, or other values. Pandas provides subclasses of Index for specific purposes, which you can read more about here.

Indexes, like tuples, are immutable. The labels in an index cannot be changed. That said, every index also has a name that can be changed:

bill_value.name = "bill_value"
bill_value

    100
    100
     50
     20
     10
      ... 
   10
   20
   50
  100
  200
Name: bill_value, Length: 279, dtype: int64

The above also applies to DataFrames:

banknotes.index

RangeIndex(start=0, stop=279, step=1)

Oftentimes, an index is a range of numbers, but this can be changed. The code below uses the .set_index method to change the index of banknotes to country:

banknotes.set_index("country", inplace = True)
banknotes.index

Index(['Argentina', 'Argentina', 'Argentina', 'Argentina', 'Argentina',
       'Australia', 'Australia', 'Australia', 'Australia', 'Australia',
       ...
       'Venezuela', 'Venezuela', 'Venezuela', 'Venezuela', 'Venezuela',
       'South Africa', 'South Africa', 'South Africa', 'South Africa',
       'South Africa'],
      dtype='object', name='country', length=279)

The inplace argument instructs Pandas to change the index directly without making a copy first, so that we don’t have to reassign banknotes.index explicitly.

2.4.2. Indexing by Position#

Changing the index affects how you select elements. There are three main methods for accessing specific values in Pandas:

By integer position
By label/name
By a condition

To access elements in a series by integer position, use .iloc:

bill_value.iloc[5]

np.int64(50)

Using .iloc is extensible to sequences of values:

bill_value.iloc[[5, 15, 25, 35]]

     50
  1000
    10
   200
Name: bill_value, dtype: int64

Use a slice to select a range of elements. The syntax for a slice is start:stop:step, with the second colon : and arguments being optional. This syntax also applies to lists. For example:

bill_value.iloc[0:5]

  100
  100
   50
   20
   10
Name: bill_value, dtype: int64

Below, we use a slice to get every twentieth element in the Series:

bill_value.iloc[::20]

      100
      10
     100
   50000
     200
      2
    200
     20
    500
    100
     20
    500
    100
    100
Name: bill_value, dtype: int64

Slices also accept negative values. This counts back from the end of a sequence. For instance:

bill_value.iloc[-5:]

   10
   20
   50
  100
  200
Name: bill_value, dtype: int64

The result is the same as if you had used the .tail method:

bill_value.tail()

   10
   20
   50
  100
  200
Name: bill_value, dtype: int64

2.4.3. Indexing by Label#

Use .loc to index a Series or DataFrame by label:

banknotes.loc["Peru"]

	currency_code	currency_name	name	gender	bill_count	profession	known_for_being_first	current_bill_value	prop_total_bills	first_appearance_year	death_year	comments	hover_text	has_portrait	id	scaled_bill_value
country
Peru	PEN	Sol	Jorge Basadre Grohmann	M	1.0	Politician	No	100	NaN	1991	1980	NaN	NaN	False	PEN_Basadre	0.473684
Peru	PEN	Sol	Raúl Porras Barrenechea	M	1.0	Politician	No	20	NaN	1991	1960	NaN	NaN	False	PEN_Raul	0.052632
Peru	PEN	Sol	María Isabel Granda y Larco	F	1.0	Musician	No	10	NaN	2021	1983	NaN	NaN	False	PEN_Granda	0.000000
Peru	PEN	Sol	José Abelardo Quiñones Gonzales	M	1.0	Military	No	10	NaN	1991	1941	NaN	NaN	False	PEN_Abelardo	0.000000
Peru	PEN	Sol	Abraham Valdelomar Pinto	M	1.0	Writer	No	50	NaN	1991	1919	NaN	NaN	False	PEN_Pinto	0.210526
Peru	PEN	Sol	Pedro Paulet	M	1.0	STEM	Yes	100	NaN	2021	1945	Alleged first person to build a liquid-propell...	Alleged first person to build a liquid-propell...	False	PEN_Paulet	0.473684
Peru	PEN	Sol	Santa Rosa de Lima	F	1.0	Religious figure	Yes	200	NaN	1995	1617	First catholic saint of the Americas	First Catholic saint of the Americas	False	PEN_Santa	1.000000

You can select specific columns as well:

banknotes.loc["Peru", "name"]

country
Peru             Jorge Basadre Grohmann
Peru            Raúl Porras Barrenechea
Peru        María Isabel Granda y Larco
Peru    José Abelardo Quiñones Gonzales
Peru           Abraham Valdelomar Pinto
Peru                       Pedro Paulet
Peru                 Santa Rosa de Lima
Name: name, dtype: object

Just as with .iloc, it’s possible to pass sequences into .loc:

banknotes.loc[["Peru", "Serbia", "Ukraine"]]

	currency_code	currency_name	name	gender	bill_count	profession	known_for_being_first	current_bill_value	prop_total_bills	first_appearance_year	death_year	comments	hover_text	has_portrait	id	scaled_bill_value
country
Peru	PEN	Sol	Jorge Basadre Grohmann	M	1.0	Politician	No	100	NaN	1991	1980	NaN	NaN	False	PEN_Basadre	0.473684
Peru	PEN	Sol	Raúl Porras Barrenechea	M	1.0	Politician	No	20	NaN	1991	1960	NaN	NaN	False	PEN_Raul	0.052632
Peru	PEN	Sol	María Isabel Granda y Larco	F	1.0	Musician	No	10	NaN	2021	1983	NaN	NaN	False	PEN_Granda	0.000000
Peru	PEN	Sol	José Abelardo Quiñones Gonzales	M	1.0	Military	No	10	NaN	1991	1941	NaN	NaN	False	PEN_Abelardo	0.000000
Peru	PEN	Sol	Abraham Valdelomar Pinto	M	1.0	Writer	No	50	NaN	1991	1919	NaN	NaN	False	PEN_Pinto	0.210526
Peru	PEN	Sol	Pedro Paulet	M	1.0	STEM	Yes	100	NaN	2021	1945	Alleged first person to build a liquid-propell...	Alleged first person to build a liquid-propell...	False	PEN_Paulet	0.473684
Peru	PEN	Sol	Santa Rosa de Lima	F	1.0	Religious figure	Yes	200	NaN	1995	1617	First catholic saint of the Americas	First Catholic saint of the Americas	False	PEN_Santa	1.000000
Serbia	RSD	Serbian dinar	Slobodan Jovanovic	M	1.0	Head of Gov't	No	5000	NaN	2003	1958	writer, politician, diplomat and prime ministe...	NaN	True	RSD_Slobodan	1.000000
Serbia	RSD	Serbian dinar	Milutin Milanković	M	1.0	STEM	No	2000	NaN	2011	1958	NaN	NaN	True	RSD_Milutin	0.398798
Serbia	RSD	Serbian dinar	Nikola Tesla	M	1.0	STEM	No	100	NaN	2003	1943	NaN	NaN	True	RSD_Tesla	0.018036
Serbia	RSD	Serbian dinar	Dorde Vajfert	M	1.0	Other	No	1000	NaN	2003	1937	was governer of the national Bank of Serbia. e...	NaN	True	RSD_Dorde	0.198397
Serbia	RSD	Serbian dinar	Jovan Cvijic	M	1.0	STEM	No	500	NaN	2004	1927	NaN	NaN	True	RSD_Cvijic	0.098196
Serbia	RSD	Serbian dinar	Nadežda Petrović	F	1.0	Visual Artist	No	200	NaN	2005	1915	NaN	NaN	True	RSD_Petrovic	0.038076
Serbia	RSD	Serbian dinar	Stevan Stevanovic Mokranjac	M	1.0	Musician	Yes	50	NaN	2005	1914	Composer. was part of Serbia's first string qu...	Member of Serbia's first string quartet	True	RSD_Stevanovic	0.008016
Serbia	RSD	Serbian dinar	Vuk Stefanovic Karadžic	M	1.0	Writer	Yes	10	NaN	2006	1864	wrote the 1st dictionary in the reformed Serbi...	Wrote the first dictionary in the reformed Ser...	True	RSD_Vuk	0.000000
Serbia	RSD	Serbian dinar	Petar Petrovic Njegoš	M	1.0	Writer	No	20	NaN	2006	1851	wrote some of the most imporant Serbian litera...	NaN	True	RSD_Njegos	0.002004
Ukraine	UAH	hryvna	Mykhailo Hrusheskyi	M	1.0	Politician	No	50	NaN	1992	1934	also an author, president of Central Rada (Sov...	NaN	False	UAH_Mykhailo	0.049049
Ukraine	UAH	hryvna	Volodymyr Vernadskyi	M	1.0	STEM	No	1000	NaN	2019	1945	NaN	NaN	False	UAH_Vernadskyi	1.000000
Ukraine	UAH	hryvna	Ivan Franko	M	1.0	Writer	Yes	20	NaN	1992	1916	1st author of detective novels and modern poet...	First author of detective novels and modern po...	False	UAH_Franko	0.019019
Ukraine	UAH	hryvna	Lesya Ukrainka	F	1.0	Writer	No	200	NaN	2001	1913	NaN	NaN	False	UAH_Ukrainka	0.199199
Ukraine	UAH	hryvna	Taras Shevchenko	M	1.0	Writer	No	100	NaN	1992	1841	NaN	NaN	False	UAH_Taras	0.099099
Ukraine	UAH	hryvna	Hryhoriy Skovoroda	M	1.0	Writer	No	500	NaN	2006	1794	NaN	NaN	False	UAH_Skovoroda	0.499499
Ukraine	UAH	hryvna	Ivan Mazepa	M	1.0	Head of Gov't	No	10	NaN	1992	1709	elected "Hetman of Zaporizhian Host" . also se...	NaN	False	UAH_Mazepa	0.009009
Ukraine	UAH	hryvna	Bogdan Khmelnitsky	M	1.0	Head of Gov't	No	5	NaN	1992	1657	1st "hetman of ukraine", also military leader	NaN	False	UAH_Bogdon	0.004004
Ukraine	UAH	hryvna	Yaroslav the Wise	M	1.0	Monarch	Yes	2	NaN	1992	1054	1st christian prince of Kiev	First Christian Prince of Kiev	False	UAH_Yaroslav	0.001001
Ukraine	UAH	hryvna	Volodymyr the Great	M	1.0	Monarch	No	1	NaN	1992	1015	NaN	NaN	False	UAH_Great	0.000000

This can be a very powerful operation, but it’s easy to get mixed up when labels are integers, as with the bill_value data.

For example, this:

bill_value.loc[0:5]

  100
  100
   50
   20
   10
   50
Name: bill_value, dtype: int64

Is NOT the same as this:

bill_value.iloc[0:5]

  100
  100
   50
   20
   10
Name: bill_value, dtype: int64

Recall that bracket notation selects columns in DataFrames. With a Series, the same notation acts as another way to perform .loc operations:

bill_value[0:5]

  100
  100
   50
   20
   10
Name: bill_value, dtype: int64

Finally, .iloc and .loc can be used in tandem with one another. This is called chaining. Below, we use the country-indexed banknotes DataFrame to select all rows with “Peru.” Then, we select the second row from this subset.

banknotes.loc["Peru"].iloc[1]

currency_code                                PEN
currency_name                                Sol
name                     Raúl Porras Barrenechea
gender                                         M
bill_count                                   1.0
profession                            Politician
known_for_being_first                         No
current_bill_value                            20
prop_total_bills                             NaN
first_appearance_year                       1991
death_year                                  1960
comments                                     NaN
hover_text                                   NaN
has_portrait                               False
id                                      PEN_Raul
scaled_bill_value                       0.052632
Name: Peru, dtype: object

2.4.4. Indexing by a Condition#

The last way to index in Pandas is by condition. Pandas does this by evaluating a condition and returning a Boolean Series or array. This is by far the most powerful method of indexing in Pandas.

For example, suppose you want to find bill values that are divisible by 25. You can use the modulo operator % to get the remainder when one positive integer is divided by another. So the condition to test for divisibility by 25 is:

bill_value % 25 == 0

     True
     True
     True
    False
    False
       ...  
  False
  False
   True
   True
   True
Name: bill_value, Length: 279, dtype: bool

The result is a Boolean Series with as many elements as bill_value. You can use this condition in .loc to get only the elements where the result was True:

bill_value.loc[bill_value % 25 == 0]

    100
    100
     50
     50
     50
      ... 
  200
  100
   50
  100
  200
Name: bill_value, Length: 194, dtype: int64

You can also use square brackets [] without .loc to index by condition:

bill_value[bill_value - 100 > 5]

   1000
    500
    200
    200
    200
       ... 
   200
  2000
   500
   200
   200
Name: bill_value, Length: 130, dtype: int64

With a DataFrame, indexing by condition gives you a subset of the rows:

banknotes[banknotes["currency_code"] == "MWK"]

	currency_code	currency_name	name	gender	bill_count	profession	known_for_being_first	current_bill_value	prop_total_bills	first_appearance_year	death_year	comments	hover_text	has_portrait	id	scaled_bill_value
country
Malawi	MWK	Kwacha	Dr. Hastings Kamuzu Banda	M	1.0	Head of Gov't	Yes	1000	NaN	1971	1997	1st President and 1st Prime Minister of Malawi	First President and Prime Minister of Malawi	False	MWK_Hastings	0.494949
Malawi	MWK	Kwacha	Rose Lomathinda Chibambo	F	1.0	Politician	Yes	200	NaN	2012	2016	1st female minister in the independent Malawi ...	First female minister in the independent Malaw...	False	MWK_Rose	0.090909
Malawi	MWK	Kwacha	Inkosi ya Makhosi M'mbelwa II	M	1.0	Monarch	No	20	NaN	2012	1959	cannot find info about this person	NaN	False	MWK_Mmbelwa	0.000000
Malawi	MWK	Kwacha	Inkosi Ya Mokhosi Gomani II	M	1.0	Monarch	No	50	NaN	2012	1954	NaN	NaN	False	MWK_Gomani	0.015152
Malawi	MWK	Kwacha	Reverend John Chilembwe	M	1.0	Revolutionary	No	500	NaN	1997	1915	Known for organizing an uprising against the c...	NaN	False	MWK_Chilembwe	0.242424
Malawi	MWK	Kwacha	Reverend John Chilembwe	M	1.0	Revolutionary	No	2000	NaN	1997	1915	Known for organizing an uprising against the c...	NaN	False	MWK_Chilembwe	1.000000
Malawi	MWK	Kwacha	James Federick Sangala	M	1.0	Politician	No	100	NaN	2012	1974	NaN	NaN	False	MWK_Sangala	0.040404

If you want to specify specific columns, use .loc:

banknotes.loc[banknotes["current_bill_value"] == 10.0, "currency_name"]

country
Argentina          Argentinian Peso
Australia         Australian Dollar
Australia         Australian Dollar
Bangladesh                     Taka
Bolivia                   Boliviano
Bolivia                   Boliviano
Bolivia                   Boliviano
Canada              Canadian Dollar
Canada              Canadian Dollar
Canada              Canadian Dollar
Canada              Canadian Dollar
Canada              Canadian Dollar
England                       pound
England                       pound
Georgia                        lari
Nigeria                       naira
New Zealand      New Zealand dollar
Peru                            Sol
Peru                            Sol
China                      Renminbi
Serbia                Serbian dinar
Tunisia              Tunisian dinar
Tunisia              Tunisian dinar
Turkey                         Lira
Turkey                         Lira
Ukraine                      hryvna
United States             US dollar
Venezuela        Venezuelan bolivar
South Africa                   rand
Name: currency_name, dtype: object

The above lets you select multiple columns, but you could also do the following:

cols = ["currency_code", "currency_name"]
banknotes[cols].loc[banknotes["current_bill_value"] == 10.0]

	currency_code	currency_name
country
Argentina	ARS	Argentinian Peso
Australia	AUD	Australian Dollar
Australia	AUD	Australian Dollar
Bangladesh	BDT	Taka
Bolivia	BOB	Boliviano
Bolivia	BOB	Boliviano
Bolivia	BOB	Boliviano
Canada	CAD	Canadian Dollar
Canada	CAD	Canadian Dollar
Canada	CAD	Canadian Dollar
Canada	CAD	Canadian Dollar
Canada	CAD	Canadian Dollar
England	GBP	pound
England	GBP	pound
Georgia	GEL	lari
Nigeria	NGN	naira
New Zealand	NZD	New Zealand dollar
Peru	PEN	Sol
Peru	PEN	Sol
China	RMB	Renminbi
Serbia	RSD	Serbian dinar
Tunisia	TND	Tunisian dinar
Tunisia	TND	Tunisian dinar
Turkey	TRY	Lira
Turkey	TRY	Lira
Ukraine	UAH	hryvna
United States	USD	US dollar
Venezuela	VES	Venezuelan bolivar
South Africa	ZAR	rand

2.5. Special Values#

You may have noticed that some of the data in banknotes is missing. This is common, and it’s important to understand how to handle missing or invalid values.

There are many reasons that could cause these values to be missing or incomplete, and as a result, Pandas provides lots of flexibility for detecting and handling these values.

In Pandas, these special values are generally treated as missing values in the dataset, and are represented by the NumPy nan type. This reduces some of the nuance of data values and types, but was seemingly done for computational performance reasons.

banknotes.iloc[-25]

currency_code                        USD
currency_name                  US dollar
name                     Abraham Lincoln
gender                                 M
bill_count                           1.0
profession                 Head of Gov't
known_for_being_first                 No
current_bill_value                     5
prop_total_bills                    0.06
first_appearance_year               1914
death_year                          1865
comments                             NaN
hover_text                           NaN
has_portrait                        True
id                           USD_Lincoln
scaled_bill_value               0.040404
Name: United States, dtype: object

2.5.1. Types of Values Considered Missing by Pandas#

In addition to np.nan (which displays as NaN), Pandas interprets several other values as missing. This includes Python’s None type, as well as Pandas’ experimental NA types.

Python’s None type represents something that has no value. It often comes about as the return of a function, if something hasn’t been defined yet, or if something wasn’t found.

When creating a Series, we can pass this value:

pd.Series([None, "one", "two"])

  None
   one
   two
dtype: object

Be aware that None is a Python object, and in the above example, the datatype of the series became ‘object’. If we specify a datatype explicitly then Pandas will convert it to one of its representations:

pd.Series([1.5, 2.0, 3, None], dtype="float")

  1.5
  2.0
  3.0
  NaN
dtype: float64

2.5.2. Reading in Missing Values from a CSV file#

An obvious source of missing or incomplete values is the data itself. When the data was collected, there may have been reasons to code missing data. For example, in collection of survey responses, there may be times where the answer was not applicable.

Another example would be if a measurement was not taken on some of the samples. Obviously, there are no rules on how this was represented in the data set. However there are several conventions, and Pandas is aware of many of them.

When reading data from a CSV file, Pandas will automatically detect missing values. By default, it will convert any empty cell, or string such as ‘na’, ‘nan’, ‘null’, ‘N/A’, and other variants to NaN. A full list can be found in the Pandas documentation.

2.5.3. Detecting Missing Values#

To detect missing values, Pandas provides two complementary methods: .isna and .notna.

We can see information about missing values with the .count method on DataFrames:

banknotes.count()

currency_code            279
currency_name            279
name                     279
gender                   279
bill_count               279
profession               279
known_for_being_first    279
current_bill_value       279
prop_total_bills          59
first_appearance_year    279
death_year               272
comments                 119
hover_text                89
has_portrait             279
id                       279
scaled_bill_value        278
dtype: int64

If we look at the hover_text columns of the DataFrame, we can see what those missing values look like.

ht = banknotes["hover_text"]
ht.count()

np.int64(89)

The return of .isna is a Boolean Series indicating which of the values are considered missing:

ht.isna()

country
Argentina        True
Argentina        True
Argentina        True
Argentina        True
Argentina       False
                ...  
South Africa    False
South Africa    False
South Africa    False
South Africa    False
South Africa    False
Name: hover_text, Length: 279, dtype: bool

The reverse—to see which values are not considered missing—is returned with .notna:

ht.notna()

country
Argentina       False
Argentina       False
Argentina       False
Argentina       False
Argentina        True
                ...  
South Africa     True
South Africa     True
South Africa     True
South Africa     True
South Africa     True
Name: hover_text, Length: 279, dtype: bool

2.5.4. Replacing Missing Values#

We can use this Boolean Series to subset with .loc. For example, to keep only the values that aren’t missing:

ht.loc[ht.notna()]

country
Argentina                           Designed first Argentine flag
Australia       First Australian Aboriginal writer to be publi...
Australia       First person appointed Dame Commander of the B...
Australia       Founded Royal Flying Doctor Service, the world...
Australia       First Australian woman to serve as a member of...
                                      ...                        
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
Name: hover_text, Length: 89, dtype: object

Pandas also provides a shortcut with the .dropna method:

ht.dropna()

country
Argentina                           Designed first Argentine flag
Australia       First Australian Aboriginal writer to be publi...
Australia       First person appointed Dame Commander of the B...
Australia       Founded Royal Flying Doctor Service, the world...
Australia       First Australian woman to serve as a member of...
                                      ...                        
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
Name: hover_text, Length: 89, dtype: object

Another strategy may be to fill the missing values. We could do so using the .fillna method:

ht.fillna(-1, inplace=True)
ht

country
Argentina                                                      -1
Argentina                                                      -1
Argentina                                                      -1
Argentina                                                      -1
Argentina                           Designed first Argentine flag
                                      ...                        
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
Name: hover_text, Length: 279, dtype: object

Additionally, the data set may have its own indicator for missing values, e.g “” or 0. We can convert those to missing using the .replace method:

ht.replace(-1, np.nan)

country
Argentina                                                     NaN
Argentina                                                     NaN
Argentina                                                     NaN
Argentina                                                     NaN
Argentina                           Designed first Argentine flag
                                      ...                        
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
South Africa    First Black state leader and first President o...
Name: hover_text, Length: 279, dtype: object

2.6. Exercises#

2.6.1. Exercise#

Python’s range function offers another way to create a sequence of numbers. Read the help file for this function.

Create an example range. How does this differ from a list?
Describe the three arguments that you can use in range. Give examples of each.
Convert one of those ranges to a list and print it to screen. What changes in the way Python represents this sequence?

2.6.2. Exercise#

Return to the discussion in Section 2.3.2.

Why does "3" + 4 raise an error?
Why does True - 1 return 0?
Why does int(4.6) < 4.6 return True?

2.6.3. Exercise#

Use a search engine or consult StackOverflow to figure out how to subset a DataFrame with multiple conditions.

Create a new DataFrame from banknotes with the following conditions: current bill value is less than or equal to 20; gender is female; contains the columns country, name, comments, has_portrait
Use a Pandas function to count the number of entries that have portraits. How many are there?
Return the last available comment. What does it say?

Data Structures

Contents

2. Data Structures#

2.1. Setup#

2.1.1. Packages#

2.1.2. Data#

2.2. Containers for Data#

2.2.1. Pandas Series#

2.2.2. NumPy Arrays#

2.2.3. Lists#

2.2.4. Indexing#

2.2.5. References#

2.2.6. Tuples#

2.3. Data Types#

2.3.1. Strings & The object Dtype#

2.3.2. Coercion & Conversion#

2.3.3. Bit Sizes & Memory#

2.4. Indexing in Pandas#

2.4.1. What’s an Index?#

2.4.2. Indexing by Position#

2.4.3. Indexing by Label#

2.4.4. Indexing by a Condition#

2.5. Special Values#

2.5.1. Types of Values Considered Missing by Pandas#

2.5.2. Reading in Missing Values from a CSV file#

2.5.3. Detecting Missing Values#

2.5.4. Replacing Missing Values#

2.6. Exercises#

2.6.1. Exercise#

2.6.2. Exercise#

2.6.3. Exercise#

2.3.1. Strings & The `object` Dtype#