Slicing & Dicing Data with pandas

1. Slicing & Dicing Data with pandas#

Learning Objectives

Explain what pandas indexes are and how they’re used
Get, modify, and set indexes on series and data frames
Explain and use pandas’ index alignment
Describe and use the 4 major modes of indexing in pandas
Explain the difference between .loc and .at
Identify problems where using .set_index to make a column an index is helpful
Explain and use pandas’ query mini-language

This chapter is a deep dive into how indexing—the process of getting or setting elements of a data structure—works in the pandas package. Indexing is sometimes also called subsetting or (element) extraction, and is a fundamental operation when using a programming language to solve problems in data science.

1.1. Prerequisites#

This chapter assumes you already have basic familiarity with Python and pandas. In particular, you should be comfortable indexing Python lists and dictionaries with the square bracket operator [ ], and should be able to explain what a Series and DataFrame are and how the they differ. DataLab’s Python Basics Reader and its accompanying workshop provide a suitable introduction to these topics.

To follow along, you’ll need the following software versions (or newer) installed on your computer:

Python 3.10
NumPy 1.22
pandas 1.4

One way to install these is to install the Anaconda Python distribution. Chapter 2 provides more details about Anaconda and the conda package manager.

CLICK HERE to get the data set used in the examples for this chapter. After clicking the link, you’ll need to right-click on the page and select “Save Page As…”. Make sure to save the page somewhere you can easily access from your Python session.

1.2. What’s an Index?#

In pandas, an Index is a Python object that stores a sequence of labels or names. The labels usually correspond to the elements of a data structure along a given dimension or axis. Indexes are a key feature that differentiates pandas from other software for programming with tabular data (such as R).

According to the pandas documentation, indexes serve three important roles:

As metadata to provide additional context about a data set
As a way to explicitly and automatically align data, avoiding bugs due to misalignment (see Section 1.2.3)
As a convenience for getting and setting subsets of the data (see Section 1.4)

1.2.1. Indexes on Series#

As an example of where pandas uses indexes, Series are 1-dimensional and always have exactly one index. You can see the index just to the left of the elements when you print out a series:

import pandas as pd

x = pd.Series([4, 5, 3, 1])
x

  4
  5
  3
  1
dtype: int64

You can get or set the index on a series with the .index attribute:

x.index

RangeIndex(start=0, stop=4, step=1)

The labels in an index can be numbers, strings, dates, or other values. Pandas provides subclasses of Index for specific purposes, including:

RangeIndex, where the labels are a monotonic sequence of integers with a fixed beginning, end, and step size
CategoricalIndex, where the labels are categories (and not necessarily unique)
IntervalIndex, where the labels are non-overlapping intervals of the real number line
DateTimeIndex, where the labels are dates and times
PeriodIndex, where each label is a date and time span

You may also encounter Int64Index, UInt64Index, and Float64Index, which represent an arbitrary list of labels of the respective type. For example:

x[[1, 3]].index

Int64Index([1, 3], dtype='int64')

Beginning in pandas 2.0, these three subclasses (but not subclasses in the bulleted list above) will be replaced with a generic Index class. See this warning in the pandas documentation for details.

You can access the labels in an index with standard Python indexing operations:

idx = x.index
# Get the 2nd label
idx[1]

Indexes are immutable, which means the labels in an index can’t be changed. For instance:

idx[0] = 5

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 idx[0] = 5

File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/indexes/base.py:5302, in Index.__setitem__(self, key, value)
   5300 @final
   5301 def __setitem__(self, key, value):
-> 5302     raise TypeError("Index does not support mutable operations")

TypeError: Index does not support mutable operations

Although labels in an index can’t be changed, every index also has a name, which can be changed. Think of the name as a name for the dimension or axis to which the index corresponds. You can get or set the name of an index with the .name attribute (on the index itself):

idx.name

The default value of .name is None, which means the index doesn’t have a name. When pandas prints a series or data frame, it will also print the names of any attached indexes. For example:

x.index.name = "labels"
x

labels
0    4
1    5
2    3
3    1
dtype: int64

Pandas automatically creates indexes whenever you create (or load) a series or data frame. Pandas also provides functions to construct indexes manually. These are usually named after the type of index. For instance, you can create an ordinary Index from a list or NumPy array:

new_idx = pd.Index(["a", "b", "c", "d"])
new_idx

Index(['a', 'b', 'c', 'd'], dtype='object')

You can use assignment to replace the index on a series. For example:

x.index = new_idx
x

a    4
b    5
c    3
d    1
dtype: int64

Note that the new index must have the same length as the index it replaces (and the series itself).

Checkpoint Exercise

Replace the index on x with one like the original (in variable idx), but where the first label is 5 instead of 0. Try to do this without typing out a list of all 4 labels.

Hint: the Python list function can convert many types of objects into lists, including pandas indexes.

1.2.2. Indexes on Data Frames#

Pandas DataFrames are 2-dimensional and always have exactly two indexes. The indexes correspond to the rows and the columns, respectively. To see this, use pandas to load the data from the sports.csv file:

dtype = {"classification_other": str}
sports = pd.read_csv("data/sports.csv", dtype = dtype)
sports.head()

	year	unitid	institution_name	city_txt	state_cd	zip_text	classification_code	classification_name	classification_other	ef_male_count	...	partic_coed_women	sum_partic_men	sum_partic_women	rev_men	rev_women	total_rev_menwomen	exp_men	exp_women	total_exp_menwomen	sports
0	2015	100654	Alabama A & M University	Normal	AL	35762.0	2	NCAA Division I-FCS	NaN	1923	...	NaN	31	0	345592.0	NaN	345592.0	397818.0	NaN	397818.0	Baseball
1	2015	100654	Alabama A & M University	Normal	AL	35762.0	2	NCAA Division I-FCS	NaN	1923	...	NaN	19	16	1211095.0	748833.0	1959928.0	817868.0	742460.0	1560328.0	Basketball
2	2015	100654	Alabama A & M University	Normal	AL	35762.0	2	NCAA Division I-FCS	NaN	1923	...	NaN	61	46	183333.0	315574.0	498907.0	246949.0	251184.0	498133.0	All Track Combined
3	2015	100654	Alabama A & M University	Normal	AL	35762.0	2	NCAA Division I-FCS	NaN	1923	...	NaN	99	0	2808949.0	NaN	2808949.0	3059353.0	NaN	3059353.0	Football
4	2015	100654	Alabama A & M University	Normal	AL	35762.0	2	NCAA Division I-FCS	NaN	1923	...	NaN	9	0	78270.0	NaN	78270.0	83913.0	NaN	83913.0	Golf

5 rows × 28 columns

This data set contains information about funding and staffing of athletics programs at U.S. universities from 2015 to 2019 (inclusive). The source for the data set is the U.S. Department of Education’s Equity in Athletics Data Analysis project, and the data was cleaned by the Tidy Tuesday community. This article describes some of the conclusions that can be drawn from the data set.

On data frames, you can use the .index attribute to get and set the row index. For example:

sports.index

RangeIndex(start=0, stop=132327, step=1)

The default index created by the read_csv function counts from 0 to the number of rows.

The other index on data frames corresponds to the columns. You can use the .columns attribute to get and set the column index. For instance:

sports.columns

Index(['year', 'unitid', 'institution_name', 'city_txt', 'state_cd',
       'zip_text', 'classification_code', 'classification_name',
       'classification_other', 'ef_male_count', 'ef_female_count',
       'ef_total_count', 'sector_cd', 'sector_name', 'sportscode',
       'partic_men', 'partic_women', 'partic_coed_men', 'partic_coed_women',
       'sum_partic_men', 'sum_partic_women', 'rev_men', 'rev_women',
       'total_rev_menwomen', 'exp_men', 'exp_women', 'total_exp_menwomen',
       'sports'],
      dtype='object')

The indexes stored in the .index and .columns attributes of a data frame have the same classes and basic properties as indexes on series.

1.2.3. Alignment#

For arithmetic operations that involve more than one series or data frame, pandas automatically aligns the elements based on the indexes. As an example, consider these two series:

u = pd.Series([1, 2, 3], index = ["a", "b", "c"])
v = pd.Series([1, 2, 3], index = ["c", "b", "a"])

The elements of the series are identical, but the indexes differ. If you’re an experienced NumPy (or R) user, the result of adding these two series together might surprise you:

u + v

a    4
b    4
c    4
dtype: int64

This is the result because pandas aligns the elements, adding element a from u to element a from v, element b to element b, and so on.

Sometimes you might want to do arithmetic on series or data frames without any alignment. The canonical solution in this case is to use the .to_numpy method to convert one or both of the objects into a NumPy array:

u + v.to_numpy()

a    2
b    4
c    6
dtype: int64

Pandas tries to align elements and carry out arithmetic operations even for objects with mismatched shapes and labels. Pandas fills elements for which the result is unknown with missing values. For example:

w = pd.Series([-10, 0], index = ["z", "a"])

u - w

a    1.0
b    NaN
c    NaN
z    NaN
dtype: float64

In this example, only element a has two operands; for each other element, the result is a missing value because only one operand is known.

Checkpoint Exercise

What happens if you use an arithmetic operation on two series or data frames where one has repeated/duplicated labels in its index?

Hint: construct an example and see what happens!

Pandas does not align series and data frames in comparison operations, and raises an error if they’re not already aligned:

u > v

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[18], line 1
----> 1 u > v

File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/ops/common.py:72, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     68             return NotImplemented
     70 other = item_from_zerodim(other)
---> 72 return method(self, other)

File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/arraylike.py:58, in OpsMixin.__gt__(self, other)
     56 @unpack_zerodim_and_defer("__gt__")
     57 def __gt__(self, other):
---> 58     return self._cmp_method(other, operator.gt)

File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/series.py:6237, in Series._cmp_method(self, other, op)
   6234 res_name = ops.get_op_result_name(self, other)
   6236 if isinstance(other, Series) and not self._indexed_same(other):
-> 6237     raise ValueError("Can only compare identically-labeled Series objects")
   6239 lvalues = self._values
   6240 rvalues = extract_array(other, extract_numpy=True, extract_range=True)

ValueError: Can only compare identically-labeled Series objects

See this GitHub issue for a developer discussion of whether this is a helpful feature for preventing bugs or a nuisance.

Finally, the align method manually aligns two series or data frames. The method returns two objects of the same type, but with elements sorted so that the indexes match. For example:

u.align(v)

(a    1
 b    2
 c    3
 dtype: int64,
 a    3
 b    2
 c    1
 dtype: int64)

For objects with mismatched shapes, the align method inserts missing values to make their shapes and indexes match:

u.align(w)

(a    1.0
 b    2.0
 c    3.0
 z    NaN
 dtype: float64,
 a     0.0
 b     NaN
 c     NaN
 z   -10.0
 dtype: float64)

The align method provides a way to align to objects so that they can be compared:

ua, va = u.align(v)
ua > va

a    False
b    False
c     True
dtype: bool

Moreover, concatenating the two results from the align method is the same as performing a join. The align method even provides a parameter join to select the type of join ("outer", "left", "right", or "inner"). For relational series and data frames, you can break a join into smaller steps by using a combination of the align function , column indexing, and the pandas concat function.

1.3. Indexing Operators#

Pandas provides several different operators for indexing series and data frames, as well as several different ways to use each. The primary indexing operator is the square bracket [ ].

For a Series, the square bracket operator usually selects elements by label. For instance, integer arguments are treated as labels, not positions, if the labels in the index are integers:

x = pd.Series([1, 2, 3], index = [1, 0, 3])
x[0]

You can use a list to select multiple elements at once or repeat elements:

x[[0, 3, 0]]

  2
  3
  2
dtype: int64

You can also use the square bracket operator to select elements by label when the labels are strings:

y = pd.Series([10, 20, 30], index = ["a", "b", "c"])
y["a"]

In this case, passing an integer argument selects an element by position:

y[0]

Using the square bracket operator this way in non-interactive code is risky, because it can cause bugs if some of the series or data frames your code operates on unexpectedly have integer labels.

The dot . serves as a secondary indexing operator for series and data frames with string indexes. For instance, to get the element b:

y.b

Since the dot . is also used to access attributes and methods, it cannot be used to access elements with a label that’s the same as the name of an attribute or method. When in doubt, it’s safer to use [ ] to access elements by label.

1.3.1. Slices#

A slice selects a range of elements. The syntax is start:stop:step, with the second colon : and arguments optional, the same as for Python’s built-in slicing. The default start value is the beginning of the series, the default stop value is one past the end, and the default step value is 1.

Slices can be by position or by label. With [ ], when the start or stop value is an integer, pandas assumes you want to slice by position:

x[1:3]

0    2
3    3
dtype: int64

In a slice by position, the element at the stop position is not included.

Slicing with only the step can be useful in a variety of situations, such as getting every second element:

x[::2]

1    1
3    3
dtype: int64

Negative values in a slice are counted backward from the end of the object. For example, this code gets the last two elements of the series:

x[-2:]

0    2
3    3
dtype: int64

Checkpoint Exercise

What happens if you use a negative step value in a slice?

When the start or stop value is a string, pandas slices by label:

y["b":]

b    20
c    30
dtype: int64

Be aware that when you slice by label, the element at the stop position is always included:

y["b":"c"]

b    20
c    30
dtype: int64

This is a case where the pandas developers decided convenience outweighs consistency. See this section of the pandas documentation for more details.

Checkpoint Exercise

What happens if you include an integer step value in a slice by label?

1.3.2. By Position with `.iloc`#

When you want to index a series or data frame by position, use [ ] on the .iloc attribute. Then any integer arguments are assumed to be positions, regardless of the type of index on the object.

For example:

x.iloc[0]

x.iloc[[1, 0, 2]]

  2
  1
  3
dtype: int64

x.iloc[0:]

  1
  2
  3
dtype: int64

You can use this attribute regardless of the index type:

y.iloc[0:]

a    10
b    20
c    30
dtype: int64

As a result, the .iloc attribute is a more reliable way to index by position than using [ ] alone.

Pandas also provides an .iat attribute to get or set scalar values in a series or data frame by position:

x.iat[0]

Compared to .iloc, the .iat attribute is typically much faster (in terms of CPU time), so it’s important to use .iat in performance-critical sections of code. In addition, .iat raises an error if your code attempts to get or set more than one value, which can potentially alert you to bugs that would go undetected with .iloc.

1.3.3. By Label with `.loc`#

When you want to index a series or data frame by label, use [ ] on the .loc attribute. Then any integer arguments are assumed to be labels.

For example:

x.loc[0]

x.loc[0]

x.loc[[0, 1]]

0    2
1    1
dtype: int64

y.loc["b"]

y.loc[["b", "a", "b"]]

b    20
a    10
b    20
dtype: int64

Pandas also provides an .at attribute to get or set scalar values in a series or data frame by label:

x.at[0]

The advantages of .at compared to .loc are the same as .iat compared to .iloc: better run-time speed and stricter requirements on the argument.

When setting elements by label with [ ], .loc, or .at, if you use a label that’s not already in the series index, pandas will append the label and value. For instance:

x.loc[10] = -1
x

   1
   2
   3
 -1
dtype: int64

This doesn’t work when setting elements by position.

1.3.4. By a Condition#

The square bracket operator [ ], the .iloc attribute, and the .loc attribute all support indexing a series by a condition. The condition must be represented by a Boolean array with the same length as the series.

You can use comparison operators to get a suitable Boolean array. For example:

is_positive = x > 0
is_positive

    True
    True
    True
  False
dtype: bool

x[is_positive]

  1
  2
  3
dtype: int64

x.loc[is_positive]

  1
  2
  3
dtype: int64

If the condition has an index, the square bracket operator and .loc operator will try to align the condition with the series being indexed:

z = x.iloc[[1, 0, 3]]
z

   2
   1
 -1
dtype: int64

z[is_positive]

0    2
1    1
dtype: int64

The .iloc operator raises an error if the Boolean array has an index:

x.iloc[is_positive]

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[49], line 1
----> 1 x.iloc[is_positive]

File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/indexing.py:1073, in _LocationIndexer.__getitem__(self, key)
   1070 axis = self.axis or 0
   1072 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1073 return self._getitem_axis(maybe_callable, axis=axis)

File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/indexing.py:1611, in _iLocIndexer._getitem_axis(self, key, axis)
   1608     key = np.asarray(key)
   1610 if com.is_bool_indexer(key):
-> 1611     self._validate_key(key, axis)
   1612     return self._getbool_axis(key, axis=axis)
   1614 # a list of integers

File ~/garden/micromamba/envs/workshop/lib/python3.11/site-packages/pandas/core/indexing.py:1452, in _iLocIndexer._validate_key(self, key, axis)
   1450 if hasattr(key, "index") and isinstance(key.index, Index):
   1451     if key.index.inferred_type == "integer":
-> 1452         raise NotImplementedError(
   1453             "iLocation based boolean "
   1454             "indexing on an integer type "
   1455             "is not available"
   1456         )
   1457     raise ValueError(
   1458         "iLocation based boolean indexing cannot use "
   1459         "an indexable as a mask"
   1460     )
   1461 return

NotImplementedError: iLocation based boolean indexing on an integer type is not available

As described in Section 1.2.3, the canonical way to avoid index alignment is to use the .to_numpy method to convert to a NumPy array:

x.iloc[is_positive.to_numpy()]

  1
  2
  3
dtype: int64

The logic operators for inverting and combining conditions in pandas differ from Python’s usual logic operators:

| is logical or
& is logical and
~ is logical not

For example:

x[is_positive & (x % 2 == 1)]

1    1
3    3
dtype: int64

In compound conditions, each sub-condition must be enclosed in parentheses (), because in Python’s order of operations, the |, &, and ~ operators are evaluated before comparison operators such as ==, >, and <.

1.3.5. By a Callable#

The square bracket operator [ ], the .iloc attribute, and the .loc attribute also all support indexing a series by a function (or callable object defined with def or lambda. The function must accept one argument: the series itself. The function must return a valid argument for the indexing operator (positions, labels, or a Boolean array).

The primary use case for this form of indexing is chaining together multiple operations without defining intermediate variables. For example, this code adds each element at an even-numbered position to the next element, and then gets only the results greater than 2:

(x.iloc[::2] + x.iloc[1::2].to_numpy())[lambda a: a > 2]

1    3
dtype: int64

As you can probably see, chaining operations makes code relatively difficult to understand. Whenever possible, it’s better to use simple indexing strategies rather than indexing by a callable.

Checkpoint Exercise

Use indexing by a callable to set the elements of x with even values (not positions) to twice their current value.

What’s a simpler way to do this using other indexing modes?

1.3.6. Data Frames#

The indexing operators [ ], .iloc, and .loc can all be used with data frames in addition to series. Since data frames are 2-dimensional, you can pass the indexing operators a separate argument for each dimension. The rows come first.

For instance, to get (1st row, 2nd and 3rd columns) of the athletics data frame the code is:

sports.iloc[0, [1, 2]]

unitid                                100654
institution_name    Alabama A & M University
Name: 0, dtype: object

When the result has a lower dimension than the object being indexed, pandas automatically converts to an appropriate data type. In this case, the result is converted to a series with the names of the two columns as its index.

As another example, the code to get (rows labeled 10 and 11, column labeled "year") is:

sports.loc[[10, 11], "year"]

10    2015
11    2015
Name: year, dtype: int64

With .iloc and .loc, you can select all of the elements along a given dimension by passing a colon : as the argument for that dimension. For instance, this code gets all rows in the columns labeled "year" and "institution_name":

sports.loc[:, ["year", "institution_name"]]

	year	institution_name
0	2015	Alabama A & M University
1	2015	Alabama A & M University
2	2015	Alabama A & M University
3	2015	Alabama A & M University
4	2015	Alabama A & M University
...	...	...
132322	2019	Simon Fraser University
132323	2019	Simon Fraser University
132324	2019	Simon Fraser University
132325	2019	Simon Fraser University
132326	2019	Simon Fraser University

132327 rows × 2 columns

You can mix indexing by condition with indexing by position or label. The most common way to do this is to use a condition to select rows and labels to select columns. For example:

sports.loc[sports.year == 2018, ["year", "institution_name", "sports"]]

	year	institution_name	sports
52387	2018	Alabama A & M University	Baseball
52388	2018	Alabama A & M University	Basketball
52389	2018	Alabama A & M University	Football
52390	2018	Alabama A & M University	Golf
52391	2018	Alabama A & M University	Softball
...	...	...	...
70154	2018	Simon Fraser University	Soccer
70155	2018	Simon Fraser University	Softball
70156	2018	Simon Fraser University	Swimming
70157	2018	Simon Fraser University	Volleyball
70158	2018	Simon Fraser University	Wrestling

17772 rows × 3 columns

Checkpoint Exercise

Use indexing to get only the rows where year is 2016 and institution_name is "University of California-Davis". Limit the columns to year, institution_name, sports, and total_rev_menwomen (the total revenue from the program in USD).

Indexing a data frame with only one argument generally accesses rows. For example:

sports.iloc[0]

year                                        2015
unitid                                    100654
institution_name        Alabama A & M University
city_txt                                  Normal
state_cd                                      AL
zip_text                                 35762.0
classification_code                            2
classification_name          NCAA Division I-FCS
classification_other                         NaN
ef_male_count                               1923
ef_female_count                             2300
ef_total_count                              4223
sector_cd                                      1
sector_name              Public, 4-year or above
sportscode                                     1
partic_men                                  31.0
partic_women                                 NaN
partic_coed_men                              NaN
partic_coed_women                            NaN
sum_partic_men                                31
sum_partic_women                               0
rev_men                                 345592.0
rev_women                                    NaN
total_rev_menwomen                      345592.0
exp_men                                 397818.0
exp_women                                    NaN
total_exp_menwomen                      397818.0
sports                                  Baseball
Name: 0, dtype: object

As an exception, using the square bracket operator [ ] to index a data frame with a string argument (or list of strings) accesses columns:

sports[["year", "institution_name"]]

	year	institution_name
0	2015	Alabama A & M University
1	2015	Alabama A & M University
2	2015	Alabama A & M University
3	2015	Alabama A & M University
4	2015	Alabama A & M University
...	...	...
132322	2019	Simon Fraser University
132323	2019	Simon Fraser University
132324	2019	Simon Fraser University
132325	2019	Simon Fraser University
132326	2019	Simon Fraser University

132327 rows × 2 columns

1.4. Columns as Indexes#

An important feature of pandas is that you can use arbitrary columns in a data frame as indexes. Generally the columns should contain identifiers for the rows or other categorical data. For example, in the sports data frame, the column institution_name is an identifier.

You can use the .set_index method to set a column as an index:

sports = sports.set_index("institution_name")
sports.head()

	year	unitid	city_txt	state_cd	zip_text	classification_code	classification_name	classification_other	ef_male_count	ef_female_count	...	partic_coed_women	sum_partic_men	sum_partic_women	rev_men	rev_women	total_rev_menwomen	exp_men	exp_women	total_exp_menwomen	sports
institution_name
Alabama A & M University	2015	100654	Normal	AL	35762.0	2	NCAA Division I-FCS	NaN	1923	2300	...	NaN	31	0	345592.0	NaN	345592.0	397818.0	NaN	397818.0	Baseball
Alabama A & M University	2015	100654	Normal	AL	35762.0	2	NCAA Division I-FCS	NaN	1923	2300	...	NaN	19	16	1211095.0	748833.0	1959928.0	817868.0	742460.0	1560328.0	Basketball
Alabama A & M University	2015	100654	Normal	AL	35762.0	2	NCAA Division I-FCS	NaN	1923	2300	...	NaN	61	46	183333.0	315574.0	498907.0	246949.0	251184.0	498133.0	All Track Combined
Alabama A & M University	2015	100654	Normal	AL	35762.0	2	NCAA Division I-FCS	NaN	1923	2300	...	NaN	99	0	2808949.0	NaN	2808949.0	3059353.0	NaN	3059353.0	Football
Alabama A & M University	2015	100654	Normal	AL	35762.0	2	NCAA Division I-FCS	NaN	1923	2300	...	NaN	9	0	78270.0	NaN	78270.0	83913.0	NaN	83913.0	Golf

5 rows × 27 columns

After setting a categorical column as an index, you can use indexing by label as a convenient way to get rows in specific categories:

sports.loc["University of California-Davis"]

	year	unitid	city_txt	state_cd	zip_text	classification_code	classification_name	classification_other	ef_male_count	ef_female_count	...	partic_coed_women	sum_partic_men	sum_partic_women	rev_men	rev_women	total_rev_menwomen	exp_men	exp_women	total_exp_menwomen	sports
institution_name
University of California-Davis	2015	110644	Davis	CA	956168678.0	2	NCAA Division I-FCS	NaN	11215	16223	...	NaN	36	0	884627.0	NaN	884627.0	884627.0	NaN	884627.0	Baseball
University of California-Davis	2015	110644	Davis	CA	956168678.0	2	NCAA Division I-FCS	NaN	11215	16223	...	NaN	15	19	1862783.0	1340541.0	3203324.0	1862783.0	1340541.0	3203324.0	Basketball
University of California-Davis	2015	110644	Davis	CA	956168678.0	2	NCAA Division I-FCS	NaN	11215	16223	...	NaN	62	94	571582.0	590416.0	1161998.0	571582.0	590416.0	1161998.0	All Track Combined
University of California-Davis	2015	110644	Davis	CA	956168678.0	2	NCAA Division I-FCS	NaN	11215	16223	...	NaN	0	25	NaN	659996.0	659996.0	NaN	659996.0	659996.0	Field Hockey
University of California-Davis	2015	110644	Davis	CA	956168678.0	2	NCAA Division I-FCS	NaN	11215	16223	...	NaN	105	0	4478087.0	NaN	4478087.0	4478087.0	NaN	4478087.0	Football
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
University of California-Davis	2019	110644	Davis	CA	95616.0	2	NCAA Division I-FCS	NaN	11780	18371	...	NaN	0	27	NaN	945919.0	945919.0	NaN	945919.0	945919.0	Equestrian
University of California-Davis	2019	110644	Davis	CA	95616.0	2	NCAA Division I-FCS	NaN	11780	18371	...	NaN	0	0	NaN	NaN	NaN	NaN	NaN	NaN	Rodeo
University of California-Davis	2019	110644	Davis	CA	95616.0	2	NCAA Division I-FCS	NaN	11780	18371	...	NaN	0	0	NaN	NaN	NaN	NaN	NaN	NaN	Sailing
University of California-Davis	2019	110644	Davis	CA	95616.0	2	NCAA Division I-FCS	NaN	11780	18371	...	NaN	0	0	NaN	NaN	NaN	NaN	NaN	NaN	Table Tennis
University of California-Davis	2019	110644	Davis	CA	95616.0	2	NCAA Division I-FCS	NaN	11780	18371	...	NaN	0	0	NaN	NaN	NaN	NaN	NaN	NaN	Weight Lifting

96 rows × 27 columns

You can use the .reset_index method to restore the institution_name column to the data frame and reset the index to a range:

sports = sports.reset_index()
sports.head()

	institution_name	year	unitid	city_txt	state_cd	zip_text	classification_code	classification_name	classification_other	ef_male_count	...	partic_coed_women	sum_partic_men	sum_partic_women	rev_men	rev_women	total_rev_menwomen	exp_men	exp_women	total_exp_menwomen	sports
0	Alabama A & M University	2015	100654	Normal	AL	35762.0	2	NCAA Division I-FCS	NaN	1923	...	NaN	31	0	345592.0	NaN	345592.0	397818.0	NaN	397818.0	Baseball
1	Alabama A & M University	2015	100654	Normal	AL	35762.0	2	NCAA Division I-FCS	NaN	1923	...	NaN	19	16	1211095.0	748833.0	1959928.0	817868.0	742460.0	1560328.0	Basketball
2	Alabama A & M University	2015	100654	Normal	AL	35762.0	2	NCAA Division I-FCS	NaN	1923	...	NaN	61	46	183333.0	315574.0	498907.0	246949.0	251184.0	498133.0	All Track Combined
3	Alabama A & M University	2015	100654	Normal	AL	35762.0	2	NCAA Division I-FCS	NaN	1923	...	NaN	99	0	2808949.0	NaN	2808949.0	3059353.0	NaN	3059353.0	Football
4	Alabama A & M University	2015	100654	Normal	AL	35762.0	2	NCAA Division I-FCS	NaN	1923	...	NaN	9	0	78270.0	NaN	78270.0	83913.0	NaN	83913.0	Golf

5 rows × 28 columns

Setting indexes appropriately makes it much more convenient to filter the rows of a data frame.

1.4.1. MultiIndexes#

One of pandas’ most powerful features is the ability to set multiple columns as the index at the same time. This creates a MultiIndex, a hierarchical index with multiple levels. For example, this code creates a MultiIndex from year, institution_name and sports:

sports = sports.set_index(["year", "institution_name", "sports"])

The MultiIndex makes it easy to get all of the information for a specific year, institution, and sport. When working with a MultiIndex, the corresponding argument to the indexing operator should be a tuple with one element for each level of the MultiIndex:

sports.loc[(2015, "Stanford University", "Football")]

/tmp/ipykernel_12885/1887213640.py:1: PerformanceWarning: indexing past lexsort depth may impact performance.
  sports.loc[(2015, "Stanford University", "Football")]

			unitid	city_txt	state_cd	zip_text	classification_code	classification_name	classification_other	ef_male_count	ef_female_count	ef_total_count	...	partic_coed_men	partic_coed_women	sum_partic_men	sum_partic_women	rev_men	rev_women	total_rev_menwomen	exp_men	exp_women	total_exp_menwomen
year	institution_name	sports
2015	Stanford University	Football	243744	Stanford	CA	94305.0	1	NCAA Division I-FBS	NaN	3663	3331	6994	...	NaN	NaN	100	0	43744639.0	NaN	43744639.0	23724407.0	NaN	23724407.0

1 rows × 25 columns

You can use the .xs (read: “cross section”) method to select rows from only one particular level of a MultiIndex. For instance, to get all rows for Harvard university:

sports.xs("Harvard University", level = 1)

		unitid	city_txt	state_cd	zip_text	classification_code	classification_name	classification_other	ef_male_count	ef_female_count	ef_total_count	...	partic_coed_men	partic_coed_women	sum_partic_men	sum_partic_women	rev_men	rev_women	total_rev_menwomen	exp_men	exp_women	total_exp_menwomen
year	sports
2015	Baseball	166027	Cambridge	MA	2138.0	2	NCAA Division I-FCS	NaN	3637	3256	6893	...	NaN	NaN	35	0	543855.0	NaN	543855.0	543855.0	NaN	543855.0
	Basketball	166027	Cambridge	MA	2138.0	2	NCAA Division I-FCS	NaN	3637	3256	6893	...	NaN	NaN	16	19	1296186.0	743673.0	2039859.0	1296186.0	743673.0	2039859.0
	All Track Combined	166027	Cambridge	MA	2138.0	2	NCAA Division I-FCS	NaN	3637	3256	6893	...	NaN	NaN	138	95	721675.0	721128.0	1442803.0	721675.0	721128.0	1442803.0
	Fencing	166027	Cambridge	MA	2138.0	2	NCAA Division I-FCS	NaN	3637	3256	6893	...	NaN	NaN	13	12	151656.0	164450.0	316106.0	151656.0	164450.0	316106.0
	Field Hockey	166027	Cambridge	MA	2138.0	2	NCAA Division I-FCS	NaN	3637	3256	6893	...	NaN	NaN	0	24	NaN	363711.0	363711.0	NaN	363711.0	363711.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2019	Equestrian	166027	Cambridge	MA	2138.0	2	NCAA Division I-FCS	NaN	3531	3454	6985	...	NaN	NaN	0	0	NaN	NaN	NaN	NaN	NaN	NaN
	Rodeo	166027	Cambridge	MA	2138.0	2	NCAA Division I-FCS	NaN	3531	3454	6985	...	NaN	NaN	0	0	NaN	NaN	NaN	NaN	NaN	NaN
	Sailing	166027	Cambridge	MA	2138.0	2	NCAA Division I-FCS	NaN	3531	3454	6985	...	NaN	NaN	14	23	108409.0	178101.0	286510.0	108409.0	178101.0	286510.0
	Table Tennis	166027	Cambridge	MA	2138.0	2	NCAA Division I-FCS	NaN	3531	3454	6985	...	NaN	NaN	0	0	NaN	NaN	NaN	NaN	NaN	NaN
	Weight Lifting	166027	Cambridge	MA	2138.0	2	NCAA Division I-FCS	NaN	3531	3454	6985	...	NaN	NaN	0	0	NaN	NaN	NaN	NaN	NaN	NaN

122 rows × 25 columns

The level argument is 1 because institution_name is the 2nd level of the MultiIndex, and indexing starts from 0. You can also use the name of the level instead:

sports.xs("Princeton University", level = "institution_name")

		unitid	city_txt	state_cd	zip_text	classification_code	classification_name	classification_other	ef_male_count	ef_female_count	ef_total_count	...	partic_coed_men	partic_coed_women	sum_partic_men	sum_partic_women	rev_men	rev_women	total_rev_menwomen	exp_men	exp_women	total_exp_menwomen
year	sports
2015	Baseball	186131	Princeton	NJ	85440070.0	2	NCAA Division I-FCS	NaN	2717	2543	5260	...	NaN	NaN	31	0	454767.0	NaN	454767.0	454767.0	NaN	454767.0
	Basketball	186131	Princeton	NJ	85440070.0	2	NCAA Division I-FCS	NaN	2717	2543	5260	...	NaN	NaN	16	17	1117914.0	1170976.0	2288890.0	1117914.0	1170976.0	2288890.0
	All Track Combined	186131	Princeton	NJ	85440070.0	2	NCAA Division I-FCS	NaN	2717	2543	5260	...	NaN	NaN	159	175	776687.0	665927.0	1442614.0	776687.0	665927.0	1442614.0
	Fencing	186131	Princeton	NJ	85440070.0	2	NCAA Division I-FCS	NaN	2717	2543	5260	...	NaN	NaN	16	13	238594.0	219334.0	457928.0	238594.0	219334.0	457928.0
	Field Hockey	186131	Princeton	NJ	85440070.0	2	NCAA Division I-FCS	NaN	2717	2543	5260	...	NaN	NaN	0	23	NaN	500755.0	500755.0	NaN	500755.0	500755.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2019	Swimming and Diving	186131	Princeton	NJ	8544.0	2	NCAA Division I-FCS	NaN	2670	2638	5308	...	NaN	NaN	38	37	685836.0	710247.0	1396083.0	685836.0	710247.0	1396083.0
	Tennis	186131	Princeton	NJ	8544.0	2	NCAA Division I-FCS	NaN	2670	2638	5308	...	NaN	NaN	11	11	550389.0	472485.0	1022874.0	550389.0	472485.0	1022874.0
	Volleyball	186131	Princeton	NJ	8544.0	2	NCAA Division I-FCS	NaN	2670	2638	5308	...	NaN	NaN	17	16	324540.0	423386.0	747926.0	324540.0	423386.0	747926.0
	Water Polo	186131	Princeton	NJ	8544.0	2	NCAA Division I-FCS	NaN	2670	2638	5308	...	NaN	NaN	22	17	408972.0	335028.0	744000.0	408972.0	335028.0	744000.0
	Wrestling	186131	Princeton	NJ	8544.0	2	NCAA Division I-FCS	NaN	2670	2638	5308	...	NaN	NaN	31	0	840184.0	NaN	840184.0	840184.0	NaN	840184.0

91 rows × 25 columns

The pandas documentation on MultiIndexes describes a wide variety of ways MultiIndexes can be used to slice and dice data frames.

Checkpoint Exercise

Use a MultiIndex to get a series that lists the sports for which there were university athletics programs in Aberdeen, WA in 2018.

1.5. The `.query` Method#

The .query method is another powerful way to extract information from a data frame. The idea of .query is that you write a query string using a query mini-language to select a subset of the rows. This has several advantages over other ways of indexing:

For large data sets, the .query method is slightly faster than other ways of indexing
You don’t have to repeat the name of the data frame every time you refer to a column in a condition
You can reuse a single query across multiple data frames, provided they all have the queried columns
Conditions are easier to read:
- You can use not, or and and to invert and combine conditions instead of ~, |, and &
- You can use in and not in in conditions
- You can write a < b < c rather than (a < b) & (b < c)
- It’s not necessary to enclose sub-conditions in parentheses, since the query mini-language defines its own order of operations

Here’s an example of a query:

sports = sports.reset_index()

qs = "city_txt == 'Davis' and state_cd == 'CA' and sports == 'Basketball'"
ucd_bball = sports.query(qs)
ucd_bball[["institution_name", "sports", "year", "total_rev_menwomen"]]

	institution_name	sports	year	total_rev_menwomen
913	University of California-Davis	Basketball	2015	3203324.0
18263	University of California-Davis	Basketball	2016	3817436.0
35698	University of California-Davis	Basketball	2017	4021500.0
53319	University of California-Davis	Basketball	2018	4240907.0
73947	University of California-Davis	Basketball	2019	4141401.0

Checkpoint Exercise

Use a query string to get a data frame that lists the sports for which there were university athletics programs in Aberdeen, WA in 2018.

The pandas documentation provides further details about how to write query strings.

Slicing & Dicing Data with pandas

Contents

1. Slicing & Dicing Data with pandas#

1.1. Prerequisites#

1.2. What’s an Index?#

1.2.1. Indexes on Series#

1.2.2. Indexes on Data Frames#

1.2.3. Alignment#

1.3. Indexing Operators#

1.3.1. Slices#

1.3.2. By Position with .iloc#

1.3.3. By Label with .loc#

1.3.4. By a Condition#

1.3.5. By a Callable#

1.3.6. Data Frames#

1.4. Columns as Indexes#

1.4.1. MultiIndexes#

1.5. The .query Method#

1.3.2. By Position with `.iloc`#

1.3.3. By Label with `.loc`#

1.5. The `.query` Method#