2. Data Structures#
The previous chapter introduced Python, providing enough background to do simple computations on data sets. This chapter focuses on the foundational knowledge and skills you’ll need to use Python effectively in the long term. Specifically, it begins with a deep dive into data structures and data types in Python and Pandas. Then, it explains how to use this knowledge during data analysis.
Learning Objectives
Create Pandas Series, NumPy arrays, lists, and tuples
Check the type and class of an object
Convert an object into a different type
Describe and differentiate
None
,NA
, andNaN
Index sequences with empty, integer, string, and logical arguments
Negate or combine conditions with logic operations
Subset Series objects and DataFrames
Find and remove missing values in a DataFrame and Series
2.1. Setup#
2.1.1. Packages#
We will be working with two packages in this chapter, NumPy and Pandas. Start by using what you learned in Section 1.4.3 to load these packages with their conventional aliases:
import numpy as np
import pandas as pd
2.1.2. Data#
Section 1 described how to use the Pandas package to load a tabular dataset into a DataFrame. As an example, you saw how to load the banknotes dataset. You’ll need that dataset for the examples in this chapter as well, so load a fresh copy of it:
banknotes = pd.read_csv("data/banknotes.csv")
banknotes.head()
currency_code | country | currency_name | name | gender | bill_count | profession | known_for_being_first | current_bill_value | prop_total_bills | first_appearance_year | death_year | comments | hover_text | has_portrait | id | scaled_bill_value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ARS | Argentina | Argentinian Peso | Eva Perón | F | 1.0 | Activist | No | 100 | NaN | 2012 | 1952 | NaN | NaN | True | ARS_Evita | 1.000000 |
1 | ARS | Argentina | Argentinian Peso | Julio Argentino Roca | M | 1.0 | Head of Gov't | No | 100 | NaN | 1988 | 1914 | NaN | NaN | True | ARS_Argentino | 1.000000 |
2 | ARS | Argentina | Argentinian Peso | Domingo Faustino Sarmiento | M | 1.0 | Head of Gov't | No | 50 | NaN | 1999 | 1888 | NaN | NaN | True | ARS_Domingo | 0.444444 |
3 | ARS | Argentina | Argentinian Peso | Juan Manuel de Rosas | M | 1.0 | Politician | No | 20 | NaN | 1992 | 1877 | NaN | NaN | True | ARS_Rosas | 0.111111 |
4 | ARS | Argentina | Argentinian Peso | Manuel Belgrano | M | 1.0 | Founder | Yes | 10 | NaN | 1970 | 1820 | Came up with the first Argentine flag. | Designed first Argentine flag | True | ARS_Belgrano | 0.000000 |
Now you’re ready for the chapter.
2.2. Containers for Data#
A data structure is a collection of data organized in a particular way. In Python, data structures are also called containers, because they contain data. Containers make working with lots of data manageable and efficient. DataFrames, introduced in the previous chapter, are an example of a two-dimensional data structure. In this section, you’ll learn about several one-dimensional data structures that are fundamental to programming in Python.
2.2.1. Pandas Series#
Recall that you can select a single column from a DataFrame with the square
brackets [ ]
. Select the current_bill_value
column from the banknotes
dataset:
banknotes["current_bill_value"]
0 100
1 100
2 50
3 20
4 10
...
274 10
275 20
276 50
277 100
278 200
Name: current_bill_value, Length: 279, dtype: int64
Notice that Python prints the column differently from the banknotes
DataFrame. For instance, the length of the column is displayed rather than the
number of rows. This is a visual clue: the column is not a DataFrame. Instead,
it’s a Pandas Series, a one-dimensional container for values. Every column
in a DataFrame is a Series.
Series and DataFrames are the fundamental data structures for data analysis in Pandas, and they have many common features. As you’ll learn in this chapter, both allow you to:
summarize data
handle missing data
reshape and transform data
subset and filter data
merge and combine data
You’ll be working with the current_bill_value
column for the next few
examples, so go ahead and assign it to a variable:
bill_value = banknotes["current_bill_value"]
The values in a Series (and many other kinds of data structures) are called
elements, and the length of a Series is the number of elements it
contains. Series are ordered, which means the elements have specific
positions. The 1st element in bill_value
is 100
, the 2nd element is again
100
, the 3rd element is 50
, the 4th is 20
, and so on.
You can get the length of a Series and many other types of objects with
Python’s built-in len
function:
len(bill_value)
279
A Series can also contain metadata, extra information about its elements. Metadata can usually be accessed through attributes (see Section 1.2.5). Here are a few examples of metadata this Series contains (more on this later):
bill_value.name
'current_bill_value'
bill_value.shape
(279,)
Finally, notice that the elements of bill_value
are all integers. For any
given Series, the elements will usually all be the same qualitative type of
data (integers, decimal numbers, strings, and so on). In other words, the
elements are usually homogeneous. There are some exceptions, and you’ll
learn more about element types in Section 2.3.
2.2.2. NumPy Arrays#
Under the hood, Pandas Series are based on other data structure, the NumPy
array (or ndarray
). You can think of a NumPy array as a stripped-down
Series: an ordered, one-dimensional container for values without the extra
metadata and functionality.
Tip
Series tend to be a good choice for data analysis, while arrays tend to be a good choice for sophisticated mathematical computations (such as simulations).
Most examples in this reader use Series and DataFrames, but it will be pointed out anywhere it’s important to use a NumPy array.
You can convert a Series to an array with the .to_numpy
method:
bill_value.to_numpy()
array([ 100, 100, 50, 20, 10, 50, 10, 20,
10, 50, 100, 100, 20, 5, 20, 1000,
500, 200, 100, 50, 10, 5, 2, 50,
50, 10, 100, 200, 100, 50, 10, 10,
20, 200, 20, 200, 100, 20, 5, 50,
100, 10, 10, 10, 10, 10, 20, 5000,
1000, 10000, 20000, 2000, 50000, 20000, 2000, 10000,
100000, 1000, 20000, 5000, 50000, 2000, 10000, 10000,
5000, 20000, 50000, 2000, 1000, 2000, 5000, 1000,
200, 500, 5000, 2000, 1000, 500, 200, 100,
200, 200, 200, 500, 2000, 100, 2000, 100,
100, 20, 500, 5, 50, 20, 10, 5,
10, 20, 50, 5, 2, 200, 1, 10,
20, 50, 100, 500, 100000, 5000, 100000, 10000,
50000, 20000, 50000, 20000, 2000, 1000, 2000, 10000,
200, 100, 50, 20, 2000, 500, 10000, 5000,
1000, 500, 1000, 100, 50, 500, 1000, 10000,
5000, 5000, 500, 200, 20, 100, 50, 1000,
5000, 1000, 50000, 10000, 1000, 200, 20, 50,
500, 2000, 100, 500, 500, 1000, 1000, 1000,
500, 20, 200, 200, 100, 100, 20, 500,
1000, 10, 100, 5, 20, 200, 1000, 5,
50, 100, 10, 20, 100, 20, 10, 10,
50, 100, 200, 50, 500, 100, 500, 200,
20, 50, 1000, 1000, 1000, 1, 5, 10,
20, 50, 100, 5000, 2000, 100, 1000, 500,
200, 50, 10, 20, 200, 500, 20, 100,
50, 1000, 100000, 200, 500, 1000, 5000, 10000,
20000, 50000, 10, 20, 10, 5, 5, 10,
20, 50, 100, 200, 10, 5, 50, 20,
100, 200, 50, 1000, 20, 200, 100, 500,
10, 5, 2, 1, 50, 2, 5, 1,
20, 100, 10, 1000, 100, 20, 500, 200,
50, 2000, 500, 50, 2, 200, 20, 100,
10, 5, 10, 20, 50, 100, 200])
Conversely, you can convert an array into a Series with the pd.Series
function. You’ll see some examples of this function later on.
2.2.3. Lists#
Series and arrays are designed for data analysis and mathematical computations, respectively. In contrast, a list is a general-purpose one-dimensional container. Lists are built into Python, so they’re probably the most common kind of container, and you don’t need to load any modules in order to use them.
You can make a list by enclosing comma-separated values in square brackets
[]
, like this:
x = [1, 2, 3]
x
[1, 2, 3]
Like a Series, a list is ordered, so it has a first element, second element,
and so on up to the length of the list. You can get the length of a list with
the len
function:
len(x)
3
Lists can be empty:
[]
[]
You can convert many types objects into lists with the list
function:
list(bill_value)
[100,
100,
50,
20,
10,
50,
10,
20,
10,
50,
100,
100,
20,
5,
20,
1000,
500,
200,
100,
50,
10,
5,
2,
50,
50,
10,
100,
200,
100,
50,
10,
10,
20,
200,
20,
200,
100,
20,
5,
50,
100,
10,
10,
10,
10,
10,
20,
5000,
1000,
10000,
20000,
2000,
50000,
20000,
2000,
10000,
100000,
1000,
20000,
5000,
50000,
2000,
10000,
10000,
5000,
20000,
50000,
2000,
1000,
2000,
5000,
1000,
200,
500,
5000,
2000,
1000,
500,
200,
100,
200,
200,
200,
500,
2000,
100,
2000,
100,
100,
20,
500,
5,
50,
20,
10,
5,
10,
20,
50,
5,
2,
200,
1,
10,
20,
50,
100,
500,
100000,
5000,
100000,
10000,
50000,
20000,
50000,
20000,
2000,
1000,
2000,
10000,
200,
100,
50,
20,
2000,
500,
10000,
5000,
1000,
500,
1000,
100,
50,
500,
1000,
10000,
5000,
5000,
500,
200,
20,
100,
50,
1000,
5000,
1000,
50000,
10000,
1000,
200,
20,
50,
500,
2000,
100,
500,
500,
1000,
1000,
1000,
500,
20,
200,
200,
100,
100,
20,
500,
1000,
10,
100,
5,
20,
200,
1000,
5,
50,
100,
10,
20,
100,
20,
10,
10,
50,
100,
200,
50,
500,
100,
500,
200,
20,
50,
1000,
1000,
1000,
1,
5,
10,
20,
50,
100,
5000,
2000,
100,
1000,
500,
200,
50,
10,
20,
200,
500,
20,
100,
50,
1000,
100000,
200,
500,
1000,
5000,
10000,
20000,
50000,
10,
20,
10,
5,
5,
10,
20,
50,
100,
200,
10,
5,
50,
20,
100,
200,
50,
1000,
20,
200,
100,
500,
10,
5,
2,
1,
50,
2,
5,
1,
20,
100,
10,
1000,
100,
20,
500,
200,
50,
2000,
500,
50,
2,
200,
20,
100,
10,
5,
10,
20,
50,
100,
200]
Unlike a Series, the elements of a list can be qualitatively different. There is no expectation that they will be homogeneous. For instance, this list contains a number, string, and another list (with one element):
li = [8, "hello", [4.2]]
li
[8, 'hello', [4.2]]
2.2.4. Indexing#
So far you’ve learned two ways to use square brackets []
:
To select columns from a DataFrame, as in
banknotes["country"]
To create lists, as in
["a", "b", 1]
The first case is an example of indexing, which means getting or setting
elements of a container. The square brackets []
are Python’s indexing
operator.
You can use indexing to get an element of a list based on the element’s position. Python uses zero-based indexing, which means the positions of elements are counted starting from 0 rather than 1. So the first element of a list is at position 0, the second is at position 1, and so on.
Note
Many programming languages use zero-based indexing. It may seem strange at first, but it makes some kinds of computations simpler by eliminating the need to add or subtract 1.
The indexing operator requires at least one argument, called the index,
which goes inside of the square brackets []
. The index says which elements
you want to get. For DataFrames, you used column names as the index. For a
list, you can use a position. So the code to get the first element of the list
li
is:
li[0]
8
Likewise, to get the third element:
li[2]
[4.2]
The same idea extends to containers stored inside of other containers. For
example, to get the value stored in the list inside of x
:
li[2][0]
4.2
You can set the element of a list by assigning a value at that index. So the
code to change the first element of x
to the string “hi” is:
li[0] = "hi"
li
['hi', 'hello', [4.2]]
2.2.5. References#
Assigning elements of a container is not without complication. Suppose you
assign a list to a variable x
and then create a new variable, y
, from x
.
If you change an element of y
, it will also change x
:
x = [1, 2]
y = x
y[0] = 10
x
[10, 2]
This happens because of how Python handles containers. When you create a container, Python stores it in your computer’s memory. If you then assign the container to a variable, the variable points, or refers, to the location of the container in memory. If you create a second variable from the first, both will refer to the same location. As a result, operations on one variable will affect the value of the other, because there’s really only one container in memory and both variables refer to it.
The example above uses lists, but other containers such as Series and
DataFrames behave the same way. The variable bill_value
is just reference to
a column in the banknotes
DataFrame.
If you want to assign an independent copy of a container to a variable rather
than a reference, you need to use a function or method to explicitly make a
copy. Many containers have a .copy
method that makes a copy:
x = [1, 2]
y = x.copy()
y[0] = 10
x
[1, 2]
2.2.6. Tuples#
References can be confusing, and if you know that the elements of a container shouldn’t change, one way to prevent problems is to use a tuple. Like a list, a tuple is a one-dimensional container. The key difference is that tuples are immutable: once you create a tuple, you cannot alter it nor its elements.
You can make a tuple by enclosing comma-separated values in parentheses ()
,
like this:
(1, 2)
(1, 2)
You can also convert another container into a tuple with the tuple
function:
x = [1, 2]
y = x
x = tuple(x)
y[0] = 10
x
(1, 2)
2.3. Data Types#
Data can be categorized into different types based on sets of shared characteristics. For instance, statisticians tend to think about whether data are numeric or categorical:
numeric
continuous (real or complex numbers)
discrete (integers)
categorical
nominal (categories with no ordering)
ordinal (categories with some ordering)
Of course, other types of data, like graphs (networks) and natural language (books, speech, and so on), are also possible. Categorizing data this way is useful for reasoning about which methods to apply to which data.
Python and most other programming languages also categorize data by type. To
check the type of an object in Python, use the built-in type
function. Recall
you used this function to check the type of the banknotes DataFrame in
Section 1.7:
type(banknotes)
pandas.core.frame.DataFrame
Take a look at the types of a few other objects:
type(bill_value)
pandas.core.series.Series
type(bill_value[0])
numpy.int64
type("hi")
str
type(x)
tuple
Note
In Python 3, class is just another word for type. The type
function
returns the class of an object. Python also provides a class
keyword to
create your own classes. Creating classes is beyond the scope of this reader,
but is explained in detail in most Python programming textbooks.
For Pandas Series and DataFrames, the type
function returns the type of
container, but doesn’t return any information about the types of the elements.
The same is true for the NumPy arrays.
Section 1.7.1 described one way to print the types of the elements
in a Pandas object: by calling the .info
method. In the printout, the element
types are listed in the Dtype
column:
banknotes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279 entries, 0 to 278
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 currency_code 279 non-null object
1 country 279 non-null object
2 currency_name 279 non-null object
3 name 279 non-null object
4 gender 279 non-null object
5 bill_count 279 non-null float64
6 profession 279 non-null object
7 known_for_being_first 279 non-null object
8 current_bill_value 279 non-null int64
9 prop_total_bills 59 non-null float64
10 first_appearance_year 279 non-null int64
11 death_year 272 non-null object
12 comments 119 non-null object
13 hover_text 89 non-null object
14 has_portrait 279 non-null bool
15 id 279 non-null object
16 scaled_bill_value 278 non-null float64
dtypes: bool(1), float64(3), int64(2), object(11)
memory usage: 35.3+ KB
The column label Dtype
is short for “data types”. You can also access the
element types for a DataFrame with the .dtypes
attribute:
banknotes.dtypes
currency_code object
country object
currency_name object
name object
gender object
bill_count float64
profession object
known_for_being_first object
current_bill_value int64
prop_total_bills float64
first_appearance_year int64
death_year object
comments object
hover_text object
has_portrait bool
id object
scaled_bill_value float64
dtype: object
For a Series or NumPy array, you can instead use .dtype
to get the element
type:
bill_value.dtype
dtype('int64')
Some of the element types listed for the banknotes
DataFrame are built into
Python, while others are provided by Pandas and NumPy. At the expense of being
more complicated, the Pandas/NumPy types tend to be more specific and
consistent. They provide programmers with greater control over how data are
stored in memory, which makes it possible to write more efficient code. For
computations that generate or process a large amount of data, as is often the
case in research computing, efficiency is a major concern.
Here’s a non-exhaustive table of data types that you’ll often encounter in data analysis:
Built-in |
Pandas/NumPy |
Example |
Description |
---|---|---|---|
|
|
Boolean values |
|
|
|
|
Whole numbers |
|
|
|
Decimal numbers |
|
|
|
Complex numbers |
|
|
Text strings |
|
|
|
Dates and times |
For most of the built-in types, you can explicitly construct an object with
that type by calling the function with the same name as the type. For instance,
here’s a way to construct an integer (type int
):
n = int(4)
type(n)
int
This example is a bit silly, since you could just write 4
instead of int(4)
and you’d still get an integer:
n = 4
type(n)
int
That said, suppose you want to construct an integer from a decimal number:
n = int(4.67)
type(n)
int
n
4
Calling int
forces the value to be an integer, and the numbers after the
decimal point are removed.
Decimal numbers like 4.67
are better represented by a floating point number,
or float
. Use this when you need decimal precision of any kind:
n = 4.67
type(n)
float
n
4.67
Notice that the Pandas/NumPy types have the same names as the built-in types, but with a number appended to the end. Section 2.3.3 explains what those numbers mean.
2.3.1. Strings & The object
Dtype#
Strings (type str
), which were introduced in Section 1.3, are a
bit more complicated than Boolean values and numbers because they have many
attributes and methods associated with them.
Recall that you can use double "
or single '
quotes to construct a string:
"Hello, world!"
'Hello, world!'
In Pandas and NumPy, strings usually associated with the object
data type
(printed as object
or O
). For example, look at the names
column in the
banknotes
data:
banknotes["name"]
0 Eva Perón
1 Julio Argentino Roca
2 Domingo Faustino Sarmiento
3 Juan Manuel de Rosas
4 Manuel Belgrano
...
274 Nelson Mandela
275 Nelson Mandela
276 Nelson Mandela
277 Nelson Mandela
278 Nelson Mandela
Name: name, Length: 279, dtype: object
The object
data type is provided as a catch-all for non-numeric data
types. For example, if you create a Series from several different types of
data, Pandas will choose object
as the element type:
mixed = pd.Series(["hi", 1, True])
mixed
0 hi
1 1
2 True
dtype: object
The individual elements of an object
Series retain their original data types:
type(mixed[0])
str
type(mixed[2])
bool
So one way to think about the object
data type is as an invisible wrapper
around each element’s original type. The Series can claim all of its elements
are generic “objects”, but when you access an element the wrapper is peeled off
and you get the original type.
You’re most likely to encounter the object
type when working with Series or
arrays of strings. In that case, you can generally assume all of the elements
are type str
. If you’re ever unsure of the type of an element, you can always
use type
to check.
Note
NumPy doesn’t have a dedicated string type because the way strings are stored
in memory is very different from the way numbers are stored. Since Pandas is
based on NumPy, until recently Pandas didn’t have a dedicated string type
either. So both use object
as the element type for Series and arrays of
strings.
As of Pandas 1.0, the developers have added an experimental string
type so that users can distinguish Series of strings from Series of
mixed types. Hopefully in the future the string
type will become the main way
to handle strings rather than an experimental feature.
2.3.2. Coercion & Conversion#
Although bool
, int
, and float
are different types, in most situations
Python will automatically convert between them as needed. For example, you can
multiply a floating point number by an integer and then add a Boolean value:
n = 3.1 * 2 + True
n
7.2
First, the integer 2
is converted to floating point number and multiplied by
3.1
, yielding 6.2
. Then the Boolean True
is converted to a floating point
number and added to 6.2
. In Python and most other programming languages,
False
corresponds to 0
and True
corresponds to 1
. Thus the result is
7.2
, a floating point number:
type(n)
float
This automatic conversion of types is known as implicit coercion. Conversion always proceeds from less general to more general types, so that no information is lost.
Implicit coercion usually only applies to numeric types (including Boolean values). Mixing other types will usually cause an error. For instance, you can’t add a number to a string:
"hi" + 1
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[44], line 1
----> 1 "hi" + 1
TypeError: can only concatenate str (not "int") to str
Implicit coercion also works for numeric Pandas/NumPy types. For example, you
can multiply bill_value
by one and a half times its current value:
bill_value * 1.5
0 150.0
1 150.0
2 75.0
3 30.0
4 15.0
...
274 15.0
275 30.0
276 75.0
277 150.0
278 300.0
Name: current_bill_value, Length: 279, dtype: float64
Notice that the dtype
has changed from int64
to float64
.
Type conversion is when you explicitly convert an object from one type to
another. You already saw examples of this with the int
and float
functions
in Section 2.3. Here are a few more:
bool(0)
False
str(105)
'105'
Python can even convert strings into numbers and Boolean values:
float("7.3")
7.3
bool("True")
True
Note however that such operations have to be logically sound. This will not work:
int("Hello world!")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[50], line 1
----> 1 int("Hello world!")
ValueError: invalid literal for int() with base 10: 'Hello world!'
For a Pandas Series or NumPy array, you can use the .astype
method to convert
the elements to a specific type. Pass the name of the target type to the method
as a string. For example, here’s how to convert the bill_value
elements to
float64
:
bill_value.astype("float64")
0 100.0
1 100.0
2 50.0
3 20.0
4 10.0
...
274 10.0
275 20.0
276 50.0
277 100.0
278 200.0
Name: current_bill_value, Length: 279, dtype: float64
2.3.3. Bit Sizes & Memory#
Recall the table of types from the beginning of this Section
(Section 2.3). The names of the Pandas/NumPy types in the table all
end in numbers such as 32
or 64
. These numbers indicate the bit size,
the number of bits of memory used to store a value of that type.
For example, a single value with type int64
uses 64 bits of memory. So the
int64
series bill_value
uses about 64 bits per element, for a total of:
64 * len(bill_value) # bits
17856
In contrast, Python’s built-in data types don’t specify how much memory they use:
type(3)
int
In fact, the amount of memory can vary depending on your computer’s hardware and operating system!
Why is bit size important? Your computer has a limited amount of memory, so it’s a good habit to only use what you need. The tradeoff is that using more memory allows you to use larger or more precise numbers.
For instance, a 64-bit integer can hold values between
-9,223,372,036,854,775,808
and 9,223,372,036,854,775,807
, whereas a 16-bit
integer can only hold values between -32,768
and 32,767
. Understanding this
matters when you’re working with large numbers. Without that knowledge, you
might assign 32,768
to an int16
variable and find that you’ve caused an
overflow error.
The same holds for instances where you need a certain amount of precision in your data. For example, Python and NumPy have the ability to represent irrational numbers, like pi. Ultimately, your computer has to represent such numbers with decimal values, so the number of decimal places a variable can hold will affect what pi “means” in your code:
np.pi
3.141592653589793
Tip
You can also use bit sizes to estimate the amount of memory your data will
require, as we did for the bill_value
object. When a computation runs out of
memory, an estimate of how much memory is necessary can help you understand
whether to get better hardware or to change your computing strategy.
2.4. Indexing in Pandas#
If you want to inspect the elements of a Series more closely, you can use indexing. Conceptually, indexing a Series is very similar to indexing a list or tuple, but Pandas offers additional ways to select and subset data via indexes.
2.4.1. What’s an Index?#
A Pandas index is more than just a positional location. Indexes serve three important roles:
As metadata to provide additional context about a data set
As a way to explicitly and automatically align data
As a convenience for getting and setting subsets of data
The index of a series is available via the index
attribute:
bill_value.index
RangeIndex(start=0, stop=279, step=1)
Index labels can be numbers, strings, dates, or other values. Pandas provides
subclasses of Index
for specific purposes, which you can read more about
here.
Indexes, like tuples, are immutable. The labels in an index cannot be changed. That said, every index also has a name that can be changed:
bill_value.name = "bill_value"
bill_value
0 100
1 100
2 50
3 20
4 10
...
274 10
275 20
276 50
277 100
278 200
Name: bill_value, Length: 279, dtype: int64
The above also applies to DataFrames:
banknotes.index
RangeIndex(start=0, stop=279, step=1)
Oftentimes, an index is a range of numbers, but this can be changed. The code
below uses the .set_index
method to change the index of banknotes
to
country
:
banknotes.set_index("country", inplace = True)
banknotes.index
Index(['Argentina', 'Argentina', 'Argentina', 'Argentina', 'Argentina',
'Australia', 'Australia', 'Australia', 'Australia', 'Australia',
...
'Venezuela', 'Venezuela', 'Venezuela', 'Venezuela', 'Venezuela',
'South Africa', 'South Africa', 'South Africa', 'South Africa',
'South Africa'],
dtype='object', name='country', length=279)
The inplace
argument instructs Pandas to change the index directly without
making a copy first, so that we don’t have to reassign banknotes.index
explicitly.
2.4.2. Indexing by Position#
Changing the index affects how you select elements. There are three main methods for accessing specific values in Pandas:
By integer position
By label/name
By a condition
To access elements in a series by integer position, use .iloc
:
bill_value.iloc[5]
np.int64(50)
Using .iloc
is extensible to sequences of values:
bill_value.iloc[[5, 15, 25, 35]]
5 50
15 1000
25 10
35 200
Name: bill_value, dtype: int64
Use a slice to select a range of elements. The syntax for a slice is
start:stop:step
, with the second colon :
and arguments being optional. This
syntax also applies to lists. For example:
bill_value.iloc[0:5]
0 100
1 100
2 50
3 20
4 10
Name: bill_value, dtype: int64
Below, we use a slice to get every twentieth element in the Series:
bill_value.iloc[::20]
0 100
20 10
40 100
60 50000
80 200
100 2
120 200
140 20
160 500
180 100
200 20
220 500
240 100
260 100
Name: bill_value, dtype: int64
Slices also accept negative values. This counts back from the end of a sequence. For instance:
bill_value.iloc[-5:]
274 10
275 20
276 50
277 100
278 200
Name: bill_value, dtype: int64
The result is the same as if you had used the .tail
method:
bill_value.tail()
274 10
275 20
276 50
277 100
278 200
Name: bill_value, dtype: int64
2.4.3. Indexing by Label#
Use .loc
to index a Series or DataFrame by label:
banknotes.loc["Peru"]
currency_code | currency_name | name | gender | bill_count | profession | known_for_being_first | current_bill_value | prop_total_bills | first_appearance_year | death_year | comments | hover_text | has_portrait | id | scaled_bill_value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
country | ||||||||||||||||
Peru | PEN | Sol | Jorge Basadre Grohmann | M | 1.0 | Politician | No | 100 | NaN | 1991 | 1980 | NaN | NaN | False | PEN_Basadre | 0.473684 |
Peru | PEN | Sol | Raúl Porras Barrenechea | M | 1.0 | Politician | No | 20 | NaN | 1991 | 1960 | NaN | NaN | False | PEN_Raul | 0.052632 |
Peru | PEN | Sol | María Isabel Granda y Larco | F | 1.0 | Musician | No | 10 | NaN | 2021 | 1983 | NaN | NaN | False | PEN_Granda | 0.000000 |
Peru | PEN | Sol | José Abelardo Quiñones Gonzales | M | 1.0 | Military | No | 10 | NaN | 1991 | 1941 | NaN | NaN | False | PEN_Abelardo | 0.000000 |
Peru | PEN | Sol | Abraham Valdelomar Pinto | M | 1.0 | Writer | No | 50 | NaN | 1991 | 1919 | NaN | NaN | False | PEN_Pinto | 0.210526 |
Peru | PEN | Sol | Pedro Paulet | M | 1.0 | STEM | Yes | 100 | NaN | 2021 | 1945 | Alleged first person to build a liquid-propell... | Alleged first person to build a liquid-propell... | False | PEN_Paulet | 0.473684 |
Peru | PEN | Sol | Santa Rosa de Lima | F | 1.0 | Religious figure | Yes | 200 | NaN | 1995 | 1617 | First catholic saint of the Americas | First Catholic saint of the Americas | False | PEN_Santa | 1.000000 |
You can select specific columns as well:
banknotes.loc["Peru", "name"]
country
Peru Jorge Basadre Grohmann
Peru Raúl Porras Barrenechea
Peru María Isabel Granda y Larco
Peru José Abelardo Quiñones Gonzales
Peru Abraham Valdelomar Pinto
Peru Pedro Paulet
Peru Santa Rosa de Lima
Name: name, dtype: object
Just as with .iloc
, it’s possible to pass sequences into .loc
:
banknotes.loc[["Peru", "Serbia", "Ukraine"]]
currency_code | currency_name | name | gender | bill_count | profession | known_for_being_first | current_bill_value | prop_total_bills | first_appearance_year | death_year | comments | hover_text | has_portrait | id | scaled_bill_value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
country | ||||||||||||||||
Peru | PEN | Sol | Jorge Basadre Grohmann | M | 1.0 | Politician | No | 100 | NaN | 1991 | 1980 | NaN | NaN | False | PEN_Basadre | 0.473684 |
Peru | PEN | Sol | Raúl Porras Barrenechea | M | 1.0 | Politician | No | 20 | NaN | 1991 | 1960 | NaN | NaN | False | PEN_Raul | 0.052632 |
Peru | PEN | Sol | María Isabel Granda y Larco | F | 1.0 | Musician | No | 10 | NaN | 2021 | 1983 | NaN | NaN | False | PEN_Granda | 0.000000 |
Peru | PEN | Sol | José Abelardo Quiñones Gonzales | M | 1.0 | Military | No | 10 | NaN | 1991 | 1941 | NaN | NaN | False | PEN_Abelardo | 0.000000 |
Peru | PEN | Sol | Abraham Valdelomar Pinto | M | 1.0 | Writer | No | 50 | NaN | 1991 | 1919 | NaN | NaN | False | PEN_Pinto | 0.210526 |
Peru | PEN | Sol | Pedro Paulet | M | 1.0 | STEM | Yes | 100 | NaN | 2021 | 1945 | Alleged first person to build a liquid-propell... | Alleged first person to build a liquid-propell... | False | PEN_Paulet | 0.473684 |
Peru | PEN | Sol | Santa Rosa de Lima | F | 1.0 | Religious figure | Yes | 200 | NaN | 1995 | 1617 | First catholic saint of the Americas | First Catholic saint of the Americas | False | PEN_Santa | 1.000000 |
Serbia | RSD | Serbian dinar | Slobodan Jovanovic | M | 1.0 | Head of Gov't | No | 5000 | NaN | 2003 | 1958 | writer, politician, diplomat and prime ministe... | NaN | True | RSD_Slobodan | 1.000000 |
Serbia | RSD | Serbian dinar | Milutin Milanković | M | 1.0 | STEM | No | 2000 | NaN | 2011 | 1958 | NaN | NaN | True | RSD_Milutin | 0.398798 |
Serbia | RSD | Serbian dinar | Nikola Tesla | M | 1.0 | STEM | No | 100 | NaN | 2003 | 1943 | NaN | NaN | True | RSD_Tesla | 0.018036 |
Serbia | RSD | Serbian dinar | Dorde Vajfert | M | 1.0 | Other | No | 1000 | NaN | 2003 | 1937 | was governer of the national Bank of Serbia. e... | NaN | True | RSD_Dorde | 0.198397 |
Serbia | RSD | Serbian dinar | Jovan Cvijic | M | 1.0 | STEM | No | 500 | NaN | 2004 | 1927 | NaN | NaN | True | RSD_Cvijic | 0.098196 |
Serbia | RSD | Serbian dinar | Nadežda Petrović | F | 1.0 | Visual Artist | No | 200 | NaN | 2005 | 1915 | NaN | NaN | True | RSD_Petrovic | 0.038076 |
Serbia | RSD | Serbian dinar | Stevan Stevanovic Mokranjac | M | 1.0 | Musician | Yes | 50 | NaN | 2005 | 1914 | Composer. was part of Serbia's first string qu... | Member of Serbia's first string quartet | True | RSD_Stevanovic | 0.008016 |
Serbia | RSD | Serbian dinar | Vuk Stefanovic Karadžic | M | 1.0 | Writer | Yes | 10 | NaN | 2006 | 1864 | wrote the 1st dictionary in the reformed Serbi... | Wrote the first dictionary in the reformed Ser... | True | RSD_Vuk | 0.000000 |
Serbia | RSD | Serbian dinar | Petar Petrovic Njegoš | M | 1.0 | Writer | No | 20 | NaN | 2006 | 1851 | wrote some of the most imporant Serbian litera... | NaN | True | RSD_Njegos | 0.002004 |
Ukraine | UAH | hryvna | Mykhailo Hrusheskyi | M | 1.0 | Politician | No | 50 | NaN | 1992 | 1934 | also an author, president of Central Rada (Sov... | NaN | False | UAH_Mykhailo | 0.049049 |
Ukraine | UAH | hryvna | Volodymyr Vernadskyi | M | 1.0 | STEM | No | 1000 | NaN | 2019 | 1945 | NaN | NaN | False | UAH_Vernadskyi | 1.000000 |
Ukraine | UAH | hryvna | Ivan Franko | M | 1.0 | Writer | Yes | 20 | NaN | 1992 | 1916 | 1st author of detective novels and modern poet... | First author of detective novels and modern po... | False | UAH_Franko | 0.019019 |
Ukraine | UAH | hryvna | Lesya Ukrainka | F | 1.0 | Writer | No | 200 | NaN | 2001 | 1913 | NaN | NaN | False | UAH_Ukrainka | 0.199199 |
Ukraine | UAH | hryvna | Taras Shevchenko | M | 1.0 | Writer | No | 100 | NaN | 1992 | 1841 | NaN | NaN | False | UAH_Taras | 0.099099 |
Ukraine | UAH | hryvna | Hryhoriy Skovoroda | M | 1.0 | Writer | No | 500 | NaN | 2006 | 1794 | NaN | NaN | False | UAH_Skovoroda | 0.499499 |
Ukraine | UAH | hryvna | Ivan Mazepa | M | 1.0 | Head of Gov't | No | 10 | NaN | 1992 | 1709 | elected "Hetman of Zaporizhian Host" . also se... | NaN | False | UAH_Mazepa | 0.009009 |
Ukraine | UAH | hryvna | Bogdan Khmelnitsky | M | 1.0 | Head of Gov't | No | 5 | NaN | 1992 | 1657 | 1st "hetman of ukraine", also military leader | NaN | False | UAH_Bogdon | 0.004004 |
Ukraine | UAH | hryvna | Yaroslav the Wise | M | 1.0 | Monarch | Yes | 2 | NaN | 1992 | 1054 | 1st christian prince of Kiev | First Christian Prince of Kiev | False | UAH_Yaroslav | 0.001001 |
Ukraine | UAH | hryvna | Volodymyr the Great | M | 1.0 | Monarch | No | 1 | NaN | 1992 | 1015 | NaN | NaN | False | UAH_Great | 0.000000 |
This can be a very powerful operation, but it’s easy to get mixed up when
labels are integers, as with the bill_value
data.
For example, this:
bill_value.loc[0:5]
0 100
1 100
2 50
3 20
4 10
5 50
Name: bill_value, dtype: int64
Is NOT the same as this:
bill_value.iloc[0:5]
0 100
1 100
2 50
3 20
4 10
Name: bill_value, dtype: int64
Recall that bracket notation selects columns in DataFrames. With a Series, the
same notation acts as another way to perform .loc
operations:
bill_value[0:5]
0 100
1 100
2 50
3 20
4 10
Name: bill_value, dtype: int64
Finally, .iloc
and .loc
can be used in tandem with one another. This is
called chaining. Below, we use the country-indexed banknotes
DataFrame to
select all rows with “Peru.” Then, we select the second row from this subset.
banknotes.loc["Peru"].iloc[1]
currency_code PEN
currency_name Sol
name Raúl Porras Barrenechea
gender M
bill_count 1.0
profession Politician
known_for_being_first No
current_bill_value 20
prop_total_bills NaN
first_appearance_year 1991
death_year 1960
comments NaN
hover_text NaN
has_portrait False
id PEN_Raul
scaled_bill_value 0.052632
Name: Peru, dtype: object
2.4.4. Indexing by a Condition#
The last way to index in Pandas is by condition. Pandas does this by evaluating a condition and returning a Boolean Series or array. This is by far the most powerful method of indexing in Pandas.
For example, suppose you want to find bill values that are divisible by 25. You
can use the modulo operator %
to get the remainder when one positive integer
is divided by another. So the condition to test for divisibility by 25 is:
bill_value % 25 == 0
0 True
1 True
2 True
3 False
4 False
...
274 False
275 False
276 True
277 True
278 True
Name: bill_value, Length: 279, dtype: bool
The result is a Boolean Series with as many elements as bill_value
. You can
use this condition in .loc
to get only the elements where the result was
True
:
bill_value.loc[bill_value % 25 == 0]
0 100
1 100
2 50
5 50
9 50
...
269 200
271 100
276 50
277 100
278 200
Name: bill_value, Length: 194, dtype: int64
You can also use square brackets []
without .loc
to index by condition:
bill_value[bill_value - 100 > 5]
15 1000
16 500
17 200
27 200
33 200
...
263 200
265 2000
266 500
269 200
278 200
Name: bill_value, Length: 130, dtype: int64
With a DataFrame, indexing by condition gives you a subset of the rows:
banknotes[banknotes["currency_code"] == "MWK"]
currency_code | currency_name | name | gender | bill_count | profession | known_for_being_first | current_bill_value | prop_total_bills | first_appearance_year | death_year | comments | hover_text | has_portrait | id | scaled_bill_value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
country | ||||||||||||||||
Malawi | MWK | Kwacha | Dr. Hastings Kamuzu Banda | M | 1.0 | Head of Gov't | Yes | 1000 | NaN | 1971 | 1997 | 1st President and 1st Prime Minister of Malawi | First President and Prime Minister of Malawi | False | MWK_Hastings | 0.494949 |
Malawi | MWK | Kwacha | Rose Lomathinda Chibambo | F | 1.0 | Politician | Yes | 200 | NaN | 2012 | 2016 | 1st female minister in the independent Malawi ... | First female minister in the independent Malaw... | False | MWK_Rose | 0.090909 |
Malawi | MWK | Kwacha | Inkosi ya Makhosi M'mbelwa II | M | 1.0 | Monarch | No | 20 | NaN | 2012 | 1959 | cannot find info about this person | NaN | False | MWK_Mmbelwa | 0.000000 |
Malawi | MWK | Kwacha | Inkosi Ya Mokhosi Gomani II | M | 1.0 | Monarch | No | 50 | NaN | 2012 | 1954 | NaN | NaN | False | MWK_Gomani | 0.015152 |
Malawi | MWK | Kwacha | Reverend John Chilembwe | M | 1.0 | Revolutionary | No | 500 | NaN | 1997 | 1915 | Known for organizing an uprising against the c... | NaN | False | MWK_Chilembwe | 0.242424 |
Malawi | MWK | Kwacha | Reverend John Chilembwe | M | 1.0 | Revolutionary | No | 2000 | NaN | 1997 | 1915 | Known for organizing an uprising against the c... | NaN | False | MWK_Chilembwe | 1.000000 |
Malawi | MWK | Kwacha | James Federick Sangala | M | 1.0 | Politician | No | 100 | NaN | 2012 | 1974 | NaN | NaN | False | MWK_Sangala | 0.040404 |
If you want to specify specific columns, use .loc
:
banknotes.loc[banknotes["current_bill_value"] == 10.0, "currency_name"]
country
Argentina Argentinian Peso
Australia Australian Dollar
Australia Australian Dollar
Bangladesh Taka
Bolivia Boliviano
Bolivia Boliviano
Bolivia Boliviano
Canada Canadian Dollar
Canada Canadian Dollar
Canada Canadian Dollar
Canada Canadian Dollar
Canada Canadian Dollar
England pound
England pound
Georgia lari
Nigeria naira
New Zealand New Zealand dollar
Peru Sol
Peru Sol
China Renminbi
Serbia Serbian dinar
Tunisia Tunisian dinar
Tunisia Tunisian dinar
Turkey Lira
Turkey Lira
Ukraine hryvna
United States US dollar
Venezuela Venezuelan bolivar
South Africa rand
Name: currency_name, dtype: object
The above lets you select multiple columns, but you could also do the following:
cols = ["currency_code", "currency_name"]
banknotes[cols].loc[banknotes["current_bill_value"] == 10.0]
currency_code | currency_name | |
---|---|---|
country | ||
Argentina | ARS | Argentinian Peso |
Australia | AUD | Australian Dollar |
Australia | AUD | Australian Dollar |
Bangladesh | BDT | Taka |
Bolivia | BOB | Boliviano |
Bolivia | BOB | Boliviano |
Bolivia | BOB | Boliviano |
Canada | CAD | Canadian Dollar |
Canada | CAD | Canadian Dollar |
Canada | CAD | Canadian Dollar |
Canada | CAD | Canadian Dollar |
Canada | CAD | Canadian Dollar |
England | GBP | pound |
England | GBP | pound |
Georgia | GEL | lari |
Nigeria | NGN | naira |
New Zealand | NZD | New Zealand dollar |
Peru | PEN | Sol |
Peru | PEN | Sol |
China | RMB | Renminbi |
Serbia | RSD | Serbian dinar |
Tunisia | TND | Tunisian dinar |
Tunisia | TND | Tunisian dinar |
Turkey | TRY | Lira |
Turkey | TRY | Lira |
Ukraine | UAH | hryvna |
United States | USD | US dollar |
Venezuela | VES | Venezuelan bolivar |
South Africa | ZAR | rand |
2.5. Special Values#
You may have noticed that some of the data in banknotes
is missing. This is
common, and it’s important to understand how to handle missing or invalid
values.
There are many reasons that could cause these values to be missing or incomplete, and as a result, Pandas provides lots of flexibility for detecting and handling these values.
In Pandas, these special values are generally treated as missing values in the
dataset, and are represented by the NumPy nan
type. This reduces some of the
nuance of data values and types, but was seemingly done for computational
performance reasons.
banknotes.iloc[-25]
currency_code USD
currency_name US dollar
name Abraham Lincoln
gender M
bill_count 1.0
profession Head of Gov't
known_for_being_first No
current_bill_value 5
prop_total_bills 0.06
first_appearance_year 1914
death_year 1865
comments NaN
hover_text NaN
has_portrait True
id USD_Lincoln
scaled_bill_value 0.040404
Name: United States, dtype: object
2.5.1. Types of Values Considered Missing by Pandas#
In addition to np.nan
(which displays as NaN), Pandas interprets several
other values as missing. This includes Python’s None
type, as well as
Pandas’ experimental NA
types.
Python’s None
type represents something that has no value. It often comes
about as the return of a function, if something hasn’t been defined yet, or if
something wasn’t found.
When creating a Series, we can pass this value:
pd.Series([None, "one", "two"])
0 None
1 one
2 two
dtype: object
Be aware that None
is a Python object, and in the above example, the
datatype of the series became ‘object’. If we specify a datatype explicitly
then Pandas will convert it to one of its representations:
pd.Series([1.5, 2.0, 3, None], dtype="float")
0 1.5
1 2.0
2 3.0
3 NaN
dtype: float64
2.5.2. Reading in Missing Values from a CSV file#
An obvious source of missing or incomplete values is the data itself. When the data was collected, there may have been reasons to code missing data. For example, in collection of survey responses, there may be times where the answer was not applicable.
Another example would be if a measurement was not taken on some of the samples. Obviously, there are no rules on how this was represented in the data set. However there are several conventions, and Pandas is aware of many of them.
When reading data from a CSV file, Pandas will automatically detect missing values. By default, it will convert any empty cell, or string such as ‘na’, ‘nan’, ‘null’, ‘N/A’, and other variants to NaN. A full list can be found in the Pandas documentation.
2.5.3. Detecting Missing Values#
To detect missing values, Pandas provides two complementary methods: .isna
and .notna
.
We can see information about missing values with the .count
method on
DataFrames:
banknotes.count()
currency_code 279
currency_name 279
name 279
gender 279
bill_count 279
profession 279
known_for_being_first 279
current_bill_value 279
prop_total_bills 59
first_appearance_year 279
death_year 272
comments 119
hover_text 89
has_portrait 279
id 279
scaled_bill_value 278
dtype: int64
If we look at the hover_text
columns of the DataFrame, we can see what
those missing values look like.
ht = banknotes["hover_text"]
ht.count()
np.int64(89)
The return of .isna
is a Boolean Series indicating which of the values are
considered missing:
ht.isna()
country
Argentina True
Argentina True
Argentina True
Argentina True
Argentina False
...
South Africa False
South Africa False
South Africa False
South Africa False
South Africa False
Name: hover_text, Length: 279, dtype: bool
The reverse—to see which values are not considered missing—is returned
with .notna
:
ht.notna()
country
Argentina False
Argentina False
Argentina False
Argentina False
Argentina True
...
South Africa True
South Africa True
South Africa True
South Africa True
South Africa True
Name: hover_text, Length: 279, dtype: bool
2.5.4. Replacing Missing Values#
We can use this Boolean Series to subset with .loc
. For example, to keep only
the values that aren’t missing:
ht.loc[ht.notna()]
country
Argentina Designed first Argentine flag
Australia First Australian Aboriginal writer to be publi...
Australia First person appointed Dame Commander of the B...
Australia Founded Royal Flying Doctor Service, the world...
Australia First Australian woman to serve as a member of...
...
South Africa First Black state leader and first President o...
South Africa First Black state leader and first President o...
South Africa First Black state leader and first President o...
South Africa First Black state leader and first President o...
South Africa First Black state leader and first President o...
Name: hover_text, Length: 89, dtype: object
Pandas also provides a shortcut with the .dropna
method:
ht.dropna()
country
Argentina Designed first Argentine flag
Australia First Australian Aboriginal writer to be publi...
Australia First person appointed Dame Commander of the B...
Australia Founded Royal Flying Doctor Service, the world...
Australia First Australian woman to serve as a member of...
...
South Africa First Black state leader and first President o...
South Africa First Black state leader and first President o...
South Africa First Black state leader and first President o...
South Africa First Black state leader and first President o...
South Africa First Black state leader and first President o...
Name: hover_text, Length: 89, dtype: object
Another strategy may be to fill the missing values. We could do so using the
.fillna
method:
ht.fillna(-1, inplace=True)
ht
country
Argentina -1
Argentina -1
Argentina -1
Argentina -1
Argentina Designed first Argentine flag
...
South Africa First Black state leader and first President o...
South Africa First Black state leader and first President o...
South Africa First Black state leader and first President o...
South Africa First Black state leader and first President o...
South Africa First Black state leader and first President o...
Name: hover_text, Length: 279, dtype: object
Additionally, the data set may have its own indicator for missing values, e.g
“” or 0. We can convert those to missing using the .replace
method:
ht.replace(-1, np.nan)
country
Argentina NaN
Argentina NaN
Argentina NaN
Argentina NaN
Argentina Designed first Argentine flag
...
South Africa First Black state leader and first President o...
South Africa First Black state leader and first President o...
South Africa First Black state leader and first President o...
South Africa First Black state leader and first President o...
South Africa First Black state leader and first President o...
Name: hover_text, Length: 279, dtype: object
2.6. Exercises#
2.6.1. Exercise#
Python’s range
function offers another way to create a sequence of numbers.
Read the help file for this function.
Create an example range. How does this differ from a list?
Describe the three arguments that you can use in
range
. Give examples of each.Convert one of those ranges to a list and print it to screen. What changes in the way Python represents this sequence?
2.6.2. Exercise#
Return to the discussion in Section 2.3.2.
Why does
"3" + 4
raise an error?Why does
True - 1
return 0?Why does
int(4.6) < 4.6
returnTrue
?
2.6.3. Exercise#
Use a search engine or consult StackOverflow to figure out how to subset a DataFrame with multiple conditions.
Create a new DataFrame from
banknotes
with the following conditions: current bill value is less than or equal to 20; gender is female; contains the columnscountry
,name
,comments
,has_portrait
Use a Pandas function to count the number of entries that have portraits. How many are there?
Return the last available comment. What does it say?