2. Data Types & Structures#

Learning Objectives

After this lesson, you should be able to:

  • Check the type of an object

  • Cast an object to a different type

  • Describe and differentiate lists, series, tuples, sets, dicts, and arrays

  • Explain what a comprehension is

  • Identify and cast categorical data

  • Explain what broadcasting is

  • Describe and differentiate None, null, and NaN

  • Locate missing values in a series

  • Select columns of a data frame

  • Filter rows of a data frame on a condition

  • Negate or combine conditions with logic operators

The previous chapter introduced Python, providing enough background to do simple computations on data sets. This chapter focuses on the foundational knowledge and skills you’ll need to use Python effectively in the long term. Specifically, it’s a deep dive into data types and data structures in Python and Polars. Working knowledge of these will make you more effective at analyzing data and solving problems.

2.1. Data Types#

In Summarizing Data, we used the .glimpse method to get a structural summary of the California least tern data set:

terns.glimpse()
Rows: 791
Columns: 43
$ year                <i64> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000
$ site_name           <str> 'PITTSBURG POWER PLANT', 'ALBANY CENTRAL AVE', 'ALAMEDA POINT', 'KETTLEMAN CITY', 'OCEANO DUNES STATE VEHICULAR RECREATION AREA', 'RANCHO GUADALUPE DUNES PRESERVE', 'VANDENBERG SFB', 'SANTA CLARA RIVER MCGRATH STATE BEACH', 'ORMOND BEACH', 'NBVC POINT MUGU'
$ site_name_2013_2018 <str> 'Pittsburg Power Plant', 'NA_NO POLYGON', 'Alameda Point', 'Kettleman', 'Oceano Dunes State Vehicular Recreation Area', 'Rancho Guadalupe Dunes Preserve', 'Vandenberg AFB', 'Santa Clara River', 'Ormond Beach', 'NBVC Point Mugu'
$ site_name_1988_2001 <str> 'NA_2013_2018 POLYGON', 'Albany Central Avenue', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON'
$ site_abbr           <str> 'PITT_POWER', 'AL_CENTAVE', 'ALAM_PT', 'KET_CTY', 'OCEANO_DUNES', 'RGDP', 'VAN_SFB', 'S_CLAR_MCG', 'ORMOND', 'PT_MUGU'
$ region_3            <str> 'S.F._BAY', 'S.F._BAY', 'S.F._BAY', 'KINGS', 'CENTRAL', 'CENTRAL', 'CENTRAL', 'SOUTHERN', 'SOUTHERN', 'SOUTHERN'
$ region_4            <str> 'S.F._BAY', 'S.F._BAY', 'S.F._BAY', 'KINGS', 'CENTRAL', 'CENTRAL', 'CENTRAL', 'VENTURA', 'VENTURA', 'VENTURA'
$ event               <str> 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA'
$ bp_min              <f64> 15.0, 6.0, 282.0, 2.0, 4.0, 9.0, 30.0, 21.0, 73.0, 166.0
$ bp_max              <f64> 15.0, 12.0, 301.0, 3.0, 5.0, 9.0, 32.0, 21.0, 73.0, 167.0
$ fl_min              <i64> 16, 1, 200, 1, 4, 17, 11, 9, 60, 64
$ fl_max              <i64> 18, 1, 230, 2, 4, 17, 11, 9, 65, 64
$ total_nests         <i64> 15, 20, 312, 3, 5, 9, 32, 22, 73, 252
$ nonpred_eggs        <i64> 3, None, 124, None, 2, 0, None, 4, 2, None
$ nonpred_chicks      <i64> 0, None, 81, 3, 0, 1, 27, 3, 0, None
$ nonpred_fl          <i64> 0, None, 2, 1, 0, 0, 0, None, 0, None
$ nonpred_ad          <i64> 0, None, 1, 6, 0, 0, 0, None, 0, None
$ pred_control        <str> None, None, None, None, None, None, None, None, None, None
$ pred_eggs           <i64> 4, None, 17, None, 0, None, 0, None, None, None
$ pred_chicks         <i64> 2, None, 0, None, 4, None, 3, None, None, None
$ pred_fl             <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ pred_ad             <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ pred_pefa           <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_coy_fox        <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_meso           <str> 'N', None, 'N', None, 'N', None, 'N', None, 'Y', None
$ pred_owlspp         <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_corvid         <str> 'Y', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_other_raptor   <str> 'Y', None, 'Y', None, 'N', None, 'Y', None, 'Y', None
$ pred_other_avian    <str> 'N', None, 'Y', None, 'Y', None, 'N', None, 'N', None
$ pred_misc           <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ total_pefa          <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_coy_fox       <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_meso          <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_owlspp        <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_corvid        <i64> 4, None, 0, None, 0, None, 0, None, None, None
$ total_other_raptor  <i64> 2, None, 6, None, 0, None, 3, None, None, None
$ total_other_avian   <i64> 0, None, 11, None, 4, None, 0, None, None, None
$ total_misc          <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ first_observed      <str> '2000-05-11', None, '2000-05-01', '2000-06-10', '2000-05-04', '2000-05-07', '2000-05-07', '2000-06-06', None, '2000-05-21'
$ last_observed       <str> '2000-08-05', None, '2000-08-19', '2000-09-24', '2000-08-30', '2000-08-13', '2000-08-17', '2000-09-05', None, '2000-08-12'
$ first_nest          <str> '2000-05-26', None, '2000-05-16', '2000-06-17', '2000-05-28', '2000-05-31', '2000-05-28', '2000-06-06', '2000-06-08', '2000-06-01'
$ first_chick         <str> '2000-06-18', None, '2000-06-07', '2000-07-22', '2000-06-20', '2000-06-22', '2000-06-20', '2000-06-28', '2000-06-26', '2000-06-24'
$ first_fledge        <str> '2000-07-08', None, '2000-06-30', '2000-08-06', '2000-07-13', '2000-07-20', '2000-07-15', '2000-07-24', '2000-07-17', '2000-07-16'

The first two rows describe the shape of the data set. After that, each row lists a column name, the type of data in that column, and that column’s first few values. For instance, the site_name column contains str, or string, data.

We categorize data into different types based on sets of shared characteristics because types are useful for reasoning about what we can do with the data. For example, statisticians conventionally categorize data as one of four types within two larger categories:

  • numeric

    • continuous (real or complex numbers)

    • discrete (integers)

  • categorical

    • nominal (categories with no ordering)

    • ordinal (categories with some ordering)

Which approaches and statistical techniques are appropriate depends on the type of the data. Of course, other types of data, like graphs (networks) and natural language (books, speech, and so on), are also possible.

Most programming languages, including Python, also categorize data by type. The following table lists some of Python’s built-in types:

Type

Example

Description

bool

True, False

Boolean values

int

-8, 0, 42

Integers

float

-2.1, 0.5

Real numbers

complex

3j, 1-2j

Complex numbers

str

"hi", "2.1"

Strings

You can check the type of an object in Python with the built-in type function. Take a look at the types of a few objects:

type("hi")
str
type(True)
bool
type(-8.3)
float

In Python, class is just another word for type. So we can also say that the type function returns the class of an object.

Note

Python provides a class keyword to create your own classes. Creating classes is beyond the scope of this reader, but is explained in detail in most Python programming textbooks.

Tip

You can use the isinstance function to test whether a value is of a particular class. For example, to test whether 5 is a string:

isinstance(5, str)
False

2.1.1. Coercion & Casting#

Although bool, int, and float are different types, in most situations Python will automatically convert between them as needed. For example, you can multiply a floating point number by an integer and then add a Boolean value:

n = 3.1 * 2 + True
n
7.2

First, the integer 2 is converted to a floating point number and multiplied by 3.1, yielding 6.2. Then the Boolean True is converted to a floating point number and added to 6.2. In Python and most other programming languages, False corresponds to 0 and True corresponds to 1. Thus the result is 7.2, a floating point number:

type(n)
float

This automatic conversion of types is known as implicit coercion. Conversion always proceeds from less general to more general types, so that no information is lost.

Implicit coercion only applies in situations where the intent of the code is relatively unambiguous, such as arithmetic between different types of numbers (including Booleans). For example, you can’t add a number to a string, because it’s unclear what the result should be:

"hi" + 1
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 "hi" + 1

TypeError: can only concatenate str (not "int") to str

A cast explicitly converts an object from one type to another, sometimes losing information. You can cast an object to a particular type with the function of the same name. For example, to cast to the bool type:

bool(0)
False

Or to cast to the int type:

int(4.67)
4

Casts are especially useful for converting to and from the str type:

"hi" + str(1)
'hi1'
float("7.3")
7.3

Python will raise an error if a cast is not possible. For example, this will not work:

int("Hello world!")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[13], line 1
----> 1 int("Hello world!")

ValueError: invalid literal for int() with base 10: 'Hello world!'

2.2. Lists#

A data structure is a collection of data organized in a particular way. In Python, data structures are also called containers, because they contain data. Containers make working with lots of data manageable and efficient. Data frames, introduced in the previous chapter, are an example of a two-dimensional data structure.

A list is a general-purpose one-dimensional data structure. Lists are built into Python; you don’t even need to import a module in order to use them. You can create a list by enclosing any number of comma-separated values in square brackets [], like this:

x = [10, 20, 30, 40, 50]
x
[10, 20, 30, 40, 50]

Lists are ordered, which means the values, or elements, have specific positions. The first element is 10, the second is 20, the fifth is 50, and so on.

The elements of a list can be of different types, so we say lists are heterogeneous. For instance, this list contains a number, a string, and another list (with one element):

li = [8, "hello", [4.2]]
li
[8, 'hello', [4.2]]

A list can have no elements, in which case we say it’s empty. For example:

empty = []
empty
[]

You can get the length of a list with the len function:

len(empty)
0

The list function converts other containers into lists. Strings are technically containers for individual characters, so:

list("data science")
['d', 'a', 't', 'a', ' ', 's', 'c', 'i', 'e', 'n', 'c', 'e']

2.2.1. Indexing#

So far you’ve learned two ways to use square brackets []:

  1. To select columns from a data frame, as in terns["year"]

  2. To create lists, as in ["a", "b", 1]

The first case is an example of indexing, which means getting or setting elements of a data structure. The square brackets [] are Python’s indexing operator.

You can use indexing to get an element of a list based on the element’s position. Python uses zero-based indexing, which means the positions of elements are counted starting from 0 rather than 1. So the first element of a series is at position 0, the second is at position 1, and so on.

Note

Many programming languages use zero-based indexing. It may seem strange at first, but it makes some kinds of computations simpler by eliminating the need to add or subtract 1.

The indexing operator requires at least one argument, called the index, which goes inside of the square brackets []. The index says which elements you want to get. For a data frame, you can use a position or a column name as the index. For a list, you can only use a position.

As an example, consider the list li we created earlier:

li = [8, "hello", [4.2]]
li
[8, 'hello', [4.2]]

The code to get the first element is:

li[0]
8

Likewise, to get the third element:

li[2]
[4.2]

The third element is a list too. If you want to get its first element, you can chain, or repeat, the indexing operator:

li[2][0]
4.2

Read this code from right to left as “get the first element of the third element of the variable li.”

You can use a slice to select a range of elements. The syntax for a slice is lower:upper:stride, where all of the arguments and the second colon : are optional. The lower bound defaults to 0, the upper bound defaults to the length of the list, and the stride defaults to 1. For example, to get the first two elements:

li[:2]
[8, 'hello']

As another example, you can use a slice to get every other element:

li[::2]
[8, [4.2]]

Negative values in a slice index backwards from the end of the list. For instance, to get the last 2 elements:

li[-2:]
['hello', [4.2]]

You can set an element of a list by assigning a value at a given index. So the code to change the first element of li to the string “hi” is:

li[0] = "hi"
li
['hi', 'hello', [4.2]]

Indexing isn’t just for lists: most of the examples in this section also apply to series, data frames, and other data structures.

2.2.2. References#

Assigning elements of a container is not without complication. Suppose you assign a list to a variable x and then create a new variable y from x. If you change an element of y, it will also change x:

x = [1, 2]
y = x
y[0] = 10
x
[10, 2]

This happens because of how Python handles containers. When you create a container, Python stores it in your computer’s memory. If you then assign the container to a variable, the variable points, or refers, to the location of the container in memory. If you create a second variable from the first, both will refer to the same location. As a result, operations on one variable will affect the value of the other, because there’s really only one container in memory and both variables refer to it.

The example above uses lists, but other containers—such as data frames—behave the same way. If you want to assign an independent copy of a container to a variable rather than a reference, you need to use a function or method to explicitly make a copy. Many containers have a .copy or .clone method that makes a copy:

x = [1, 2]
y = x.copy()
y[0] = 10
x
[1, 2]

2.2.3. Comprehensions#

A list comprehension creates a new list from the elements of an existing list. We’ll use this list to demonstrate comprehensions:

values = [10, 11, 12, -5, 13, 14]

If you want, for example, to add 1 to each element of the list, you can use a comprehension to do it. Here’s how:

[v + 1 for v in values]
[11, 12, 13, -4, 14, 15]

In words, this code tells Python to create a new list where the elements are v + 1 for each element v in values. The enclosing square brackets [] indicate that the result should be a list. More generally, the syntax for a comprehension is:

EXPRESSION for ELEMENT in CONTAINER

Replace CONTAINER with a data structure, ELEMENT with a variable name for the elements, and EXPRESSION with an expression to compute (typically using the elements).

For instance, you can also use a comprehension to compute the type of each element in a container:

[type(v) for v in values]
[int, int, int, int, int, int]

Comprehensions can also filter out some elements based on a condition. Suppose we only want the positive elements of the list:

[v for v in values if v > 0]
[10, 11, 12, 13, 14]

The syntax for a comprehension with a condition is:

EXPRESSION for ELEMENT in CONTAINER if CONDITION

The CONDITION must be an expression that evaluates to True or False (typically using the elements).

Comprehensions are an efficient way to compute (and compute on) lists. You can also use comprehensions with Python’s other built-in data structures, which you’ll learn about in Built-in Data Structures.

2.3. Series#

A series is an ordered, one-dimensional data structure. Series are a fundamental data structure in Polars, because each column in a data frame is a series.

For example, in the California least tern data set, the site_name column is a series. Take a look at the first few elements with its .head method:

terns["site_name"].head()
shape: (10,)
site_name
str
"PITTSBURG POWER PLANT"
"ALBANY CENTRAL AVE"
"ALAMEDA POINT"
"KETTLEMAN CITY"
"OCEANO DUNES STATE VEHICULAR R…
"RANCHO GUADALUPE DUNES PRESERV…
"VANDENBERG SFB"
"SANTA CLARA RIVER MCGRATH STAT…
"ORMOND BEACH"
"NBVC POINT MUGU"

Series and data frames have many attributes and methods in common; the .head method is one of these.

Notice that the elements of the site_name series are all strings. Unlike a list, in a series all elements must be of the same type, so we say series are homogeneous. A series can contain strings, integers, decimal numbers, or any of several other types of data, but not a mix of these all at once.

The other columns in the least tern data are also series. For instance, the year column is a series of integers:

terns["year"]
shape: (791,)
year
i64
2000
2000
2000
2000
2000
2023
2023
2023
2023
2023

Series can contain any number of elements, including 0 or 1 element. You can check the number of elements, or length, of a series with Python’s built-in len function:

len(terns["year"])
791

Since this is a column from the terns data frame, its length is the same as the number of rows in terns.

Note

You can also check the length of a series with the .shape attribute:

terns["year"].shape
(791,)

Python prints the value of .shape differently from the result of len because they are different types of data. Most of the time, it’s more convenient to use the len function to check lengths of one-dimensional objects like series, because it returns an integer.

2.3.1. Creating Series#

Sometimes you’ll want to create series by manually inputting data, perhaps because your data set isn’t digitized or because you want a toy data set to test out some code. You can create a series from a list (or other sequence) with the pl.Series function:

pl.Series([1, 2, 19, -3])
shape: (4,)
i64
1
2
19
-3
pl.Series(["hi", "hello"])
shape: (2,)
str
"hi"
"hello"

The Polars documentation recommends setting a name for every series. To do this with pl.Series, pass the name as the first argument and the elements as the second argument:

pl.Series("tens", [10, 20, 30])
shape: (3,)
tens
i64
10
20
30

Polars will print the name when you print the series and will use the name as a column name if you put the series in a data frame. You can get or set the name on a series through the .name attribute:

x = pl.Series("tens", [10, 20, 30])
x.name
'tens'

Tip

If you want to create a series that contains a sequence of numbers, there are several helper functions you can use. Python’s built-in range function creates a sequence of integers. NumPy’s np.arange and np.linspace functions can create sequences of integers or decimal numbers. You can pass the result from any of these functions to pl.Series to create a series.

2.3.2. Data Types in Polars#

Series are homogeneous, so if you try to create a series from elements of different types, the pl.Series function will raise an error:

pl.Series([1, "cool", 2.3])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:316, in _construct_series_with_fallbacks(constructor, name, values, dtype, strict)
    315 try:
--> 316     return constructor(name, values, strict)
    317 except (TypeError, OverflowError) as e:
    318     # # This retry with i64 is related to https://github.com/pola-rs/polars/issues/17231
    319     # # Essentially, when given a [0, u64::MAX] then it would Overflow.

TypeError: 'str' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[40], line 1
----> 1 pl.Series([1, "cool", 2.3])

File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/series/series.py:295, in Series.__init__(self, name, values, dtype, strict, nan_to_null)
    292         raise TypeError(msg)
    294 if isinstance(values, Sequence):
--> 295     self._s = sequence_to_pyseries(
    296         name,
    297         values,
    298         dtype=dtype,
    299         strict=strict,
    300         nan_to_null=nan_to_null,
    301     )
    303 elif values is None:
    304     self._s = sequence_to_pyseries(name, [], dtype=dtype)

File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:301, in sequence_to_pyseries(name, values, dtype, strict, nan_to_null)
    298     except RuntimeError:
    299         return PySeries.new_from_any_values(name, values, strict=strict)
--> 301 return _construct_series_with_fallbacks(
    302     constructor, name, values, dtype, strict=strict
    303 )

File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:329, in _construct_series_with_fallbacks(constructor, name, values, dtype, strict)
    325     return _construct_series_with_fallbacks(
    326         PySeries.new_opt_u64, name, values, dtype, strict=strict
    327     )
    328 elif dtype is None:
--> 329     return PySeries.new_from_any_values(name, values, strict=strict)
    330 else:
    331     return PySeries.new_from_any_values_and_dtype(
    332         name, values, dtype, strict=strict
    333     )

TypeError: unexpected value while building Series of type Int64; found value of type String: "cool"

Hint: Try setting `strict=False` to allow passing data with mixed types.

By default, Polars infers the data type of a series’ elements from the first element. This can lead to errors you might not expect:

pl.Series([1, 9.2, 2.3])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:316, in _construct_series_with_fallbacks(constructor, name, values, dtype, strict)
    315 try:
--> 316     return constructor(name, values, strict)
    317 except (TypeError, OverflowError) as e:
    318     # # This retry with i64 is related to https://github.com/pola-rs/polars/issues/17231
    319     # # Essentially, when given a [0, u64::MAX] then it would Overflow.

TypeError: 'float' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[41], line 1
----> 1 pl.Series([1, 9.2, 2.3])

File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/series/series.py:295, in Series.__init__(self, name, values, dtype, strict, nan_to_null)
    292         raise TypeError(msg)
    294 if isinstance(values, Sequence):
--> 295     self._s = sequence_to_pyseries(
    296         name,
    297         values,
    298         dtype=dtype,
    299         strict=strict,
    300         nan_to_null=nan_to_null,
    301     )
    303 elif values is None:
    304     self._s = sequence_to_pyseries(name, [], dtype=dtype)

File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:301, in sequence_to_pyseries(name, values, dtype, strict, nan_to_null)
    298     except RuntimeError:
    299         return PySeries.new_from_any_values(name, values, strict=strict)
--> 301 return _construct_series_with_fallbacks(
    302     constructor, name, values, dtype, strict=strict
    303 )

File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:329, in _construct_series_with_fallbacks(constructor, name, values, dtype, strict)
    325     return _construct_series_with_fallbacks(
    326         PySeries.new_opt_u64, name, values, dtype, strict=strict
    327     )
    328 elif dtype is None:
--> 329     return PySeries.new_from_any_values(name, values, strict=strict)
    330 else:
    331     return PySeries.new_from_any_values_and_dtype(
    332         name, values, dtype, strict=strict
    333     )

TypeError: unexpected value while building Series of type Int64; found value of type Float64: 9.2

Hint: Try setting `strict=False` to allow passing data with mixed types.

You can explicitly specify a data type for a series with the pl.Series function’s third parameter, dtype:

pl.Series([1, 9.2, 2.3], dtype = float)
shape: (3,)
f64
1.0
9.2
2.3

Polars uses its own data types for series elements, so that:

  • It can efficiently support types of data that are not built into Python, such as categorical data.

  • Every numeric type has an explicit bit size: the number of bits of memory necessary to store an element. Bit sizes appear as suffixes in the name of the type. For instance, Float32 stores a floating point number in 32 bits.

  • A special value, null, can be present in any series to indicate missing data. We’ll explain null in Special Values.

When you create or access the elements of a series, Polars silently converts between its types and Python’s built-in types. Some of the Polars types and their Python equivalents are listed in the following table:

Type

Python Equivalent

Description

Boolean

bool

Boolean values

Int8, Int16, Int32, Int64

int

Integers

Float32, Float64

float

Real numbers (base-2 floating point)

Not yet supported

complex

Complex numbers

String

str

Strings

Categorical, Enum

No equivalent

Categorical data

The Polars documentation has the complete list.

If you call Python’s type function on a data structure, it returns the type of the data structure:

type(terns["site_name"])
polars.series.series.Series

For a series, you can get the element type with the .dtype attribute:

terns["site_name"].dtype
String

Note

Data frames don’t have a .dtype attribute since they can consist of multiple series. Instead, they have a .dtypes attribute, a list with the element type for each column.

If your goal is to summarize a data frame, the .glimpse method is usually more convenient.

You can use the .cast method to cast the elements of a series to a specific type. For example, here’s how to cast the total_nests column to a String series:

terns["total_nests"].cast(pl.String)
shape: (791,)
total_nests
str
"15"
"20"
"312"
"3"
"5"
"717"
"44"
"59"
"48"
"171"

2.3.3. Categorical Data#

A feature is categorical if it measures a qualitative category. For example, the genres rock, blues, alternative, folk, pop are categories.

Polars uses the Categorical and Enum data types to represent categorical data. Visualizations and statistical models sometimes treat categorical data differently than other data types, so it’s important to make sure you have the right data type.

When it reads a data set, Polars usually can’t tell which features are categorical. That means identifying and converting the categorical features is up to you. For beginners, it can be difficult to understand whether a feature is categorical or not. The key is to think about whether you want to use the feature to divide the data into groups.

For example, if you want to know how many songs are in the rock genre, you first need to divide the songs by genre, and then count the number of songs in each group (or at least the rock group).

As a second example, months recorded as numbers can be categorical or not, depending on how you want to use them. You might want to treat them as categorical (for example, to compute max rainfall in each month) or you might want to treat them as numbers (for example, to compute the number of months time between two events).

The bottom line is that you have to think about what you’ll be doing in the analysis. In some cases, you might treat a feature as categorical only for part of the analysis.

Let’s think about which features are categorical in least terns data set. To refresh your memory of what’s in the data set, take a look at the structural summary:

terns.glimpse()
Rows: 791
Columns: 43
$ year                <i64> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000
$ site_name           <str> 'PITTSBURG POWER PLANT', 'ALBANY CENTRAL AVE', 'ALAMEDA POINT', 'KETTLEMAN CITY', 'OCEANO DUNES STATE VEHICULAR RECREATION AREA', 'RANCHO GUADALUPE DUNES PRESERVE', 'VANDENBERG SFB', 'SANTA CLARA RIVER MCGRATH STATE BEACH', 'ORMOND BEACH', 'NBVC POINT MUGU'
$ site_name_2013_2018 <str> 'Pittsburg Power Plant', 'NA_NO POLYGON', 'Alameda Point', 'Kettleman', 'Oceano Dunes State Vehicular Recreation Area', 'Rancho Guadalupe Dunes Preserve', 'Vandenberg AFB', 'Santa Clara River', 'Ormond Beach', 'NBVC Point Mugu'
$ site_name_1988_2001 <str> 'NA_2013_2018 POLYGON', 'Albany Central Avenue', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON'
$ site_abbr           <str> 'PITT_POWER', 'AL_CENTAVE', 'ALAM_PT', 'KET_CTY', 'OCEANO_DUNES', 'RGDP', 'VAN_SFB', 'S_CLAR_MCG', 'ORMOND', 'PT_MUGU'
$ region_3            <str> 'S.F._BAY', 'S.F._BAY', 'S.F._BAY', 'KINGS', 'CENTRAL', 'CENTRAL', 'CENTRAL', 'SOUTHERN', 'SOUTHERN', 'SOUTHERN'
$ region_4            <str> 'S.F._BAY', 'S.F._BAY', 'S.F._BAY', 'KINGS', 'CENTRAL', 'CENTRAL', 'CENTRAL', 'VENTURA', 'VENTURA', 'VENTURA'
$ event               <str> 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA'
$ bp_min              <f64> 15.0, 6.0, 282.0, 2.0, 4.0, 9.0, 30.0, 21.0, 73.0, 166.0
$ bp_max              <f64> 15.0, 12.0, 301.0, 3.0, 5.0, 9.0, 32.0, 21.0, 73.0, 167.0
$ fl_min              <i64> 16, 1, 200, 1, 4, 17, 11, 9, 60, 64
$ fl_max              <i64> 18, 1, 230, 2, 4, 17, 11, 9, 65, 64
$ total_nests         <i64> 15, 20, 312, 3, 5, 9, 32, 22, 73, 252
$ nonpred_eggs        <i64> 3, None, 124, None, 2, 0, None, 4, 2, None
$ nonpred_chicks      <i64> 0, None, 81, 3, 0, 1, 27, 3, 0, None
$ nonpred_fl          <i64> 0, None, 2, 1, 0, 0, 0, None, 0, None
$ nonpred_ad          <i64> 0, None, 1, 6, 0, 0, 0, None, 0, None
$ pred_control        <str> None, None, None, None, None, None, None, None, None, None
$ pred_eggs           <i64> 4, None, 17, None, 0, None, 0, None, None, None
$ pred_chicks         <i64> 2, None, 0, None, 4, None, 3, None, None, None
$ pred_fl             <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ pred_ad             <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ pred_pefa           <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_coy_fox        <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_meso           <str> 'N', None, 'N', None, 'N', None, 'N', None, 'Y', None
$ pred_owlspp         <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_corvid         <str> 'Y', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_other_raptor   <str> 'Y', None, 'Y', None, 'N', None, 'Y', None, 'Y', None
$ pred_other_avian    <str> 'N', None, 'Y', None, 'Y', None, 'N', None, 'N', None
$ pred_misc           <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ total_pefa          <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_coy_fox       <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_meso          <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_owlspp        <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_corvid        <i64> 4, None, 0, None, 0, None, 0, None, None, None
$ total_other_raptor  <i64> 2, None, 6, None, 0, None, 3, None, None, None
$ total_other_avian   <i64> 0, None, 11, None, 4, None, 0, None, None, None
$ total_misc          <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ first_observed      <str> '2000-05-11', None, '2000-05-01', '2000-06-10', '2000-05-04', '2000-05-07', '2000-05-07', '2000-06-06', None, '2000-05-21'
$ last_observed       <str> '2000-08-05', None, '2000-08-19', '2000-09-24', '2000-08-30', '2000-08-13', '2000-08-17', '2000-09-05', None, '2000-08-12'
$ first_nest          <str> '2000-05-26', None, '2000-05-16', '2000-06-17', '2000-05-28', '2000-05-31', '2000-05-28', '2000-06-06', '2000-06-08', '2000-06-01'
$ first_chick         <str> '2000-06-18', None, '2000-06-07', '2000-07-22', '2000-06-20', '2000-06-22', '2000-06-20', '2000-06-28', '2000-06-26', '2000-06-24'
$ first_fledge        <str> '2000-07-08', None, '2000-06-30', '2000-08-06', '2000-07-13', '2000-07-20', '2000-07-15', '2000-07-24', '2000-07-17', '2000-07-16'

The site_name, site_abbr, and event columns are all examples of categorical data. The region_ columns and some of the pred_ columns also contain categorical data.

One way to check whether a feature is useful for grouping (and thus effectively categorical) is to count the number of times each value appears. For a series, you can do this with the .value_counts method. For instance, to count the number of times each category of event appears:

terns["event"].value_counts()
shape: (3, 2)
eventcount
stru32
"LA_NINA"258
"NEUTRAL"413
"EL_NINO"120

Features with only a few unique values, repeated many times, are ideal for grouping. Numerical features, like total_nests, usually aren’t good for grouping, both because of what they measure and because they tend to have many unique values, which leads to very small groups.

The year column can be treated as categorical or quantitative data. It’s easy to imagine grouping observations by year, but years are also numerical: they have an order and we might want to do math on them. The most appropriate type for year depends on how we want to use it for analysis.

You can cast a column to the Categorical type with the .cast method. Try this for the event column:

event = terns["event"].cast(pl.Categorical)
event
shape: (791,)
event
cat
"LA_NINA"
"LA_NINA"
"LA_NINA"
"LA_NINA"
"LA_NINA"
"LA_NINA"
"LA_NINA"
"LA_NINA"
"LA_NINA"
"LA_NINA"

Polars organizes attributes and methods for categorical data under the .cat attribute of series. These raise errors if the element type of the series is not Categorical (or Enum). You can get the categories of a categorical series with the .cat.get_categories method:

event.cat.get_categories()
shape: (3,)
event
str
"LA_NINA"
"NEUTRAL"
"EL_NINO"

A categorical series remembers all possible categories even if you take a subset where some of the categories aren’t present:

event[:3]
shape: (3,)
event
cat
"LA_NINA"
"LA_NINA"
"LA_NINA"
event[:3].cat.get_categories()
shape: (3,)
event
str
"LA_NINA"
"NEUTRAL"
"EL_NINO"

This is one way the Categorical type is different from the String type, and ensures that when you, for example, plot a categorical series, missing categories are represented.

Note

The Categorical and Enum types both represent categorical data. The Categorical type is more flexible, allowing you to add categories as needed. The Enum type is more memory-efficient, but requires that you specify all possible categories up front. In practice, the Categorical type is more convenient for interactive work.

2.3.4. Broadcasting#

If you use an arithmetic operator on a series, Polars broadcasts the operation to each element:

x = pl.Series([1, 3, 0])
x - 3
shape: (3,)
i64
-2
0
-3

The result is the same as if you had applied the operation element-by-element. That is:

pl.Series([1 - 3, 3 - 3, 0 - 3])
shape: (3,)
i64
-2
0
-3

Most NumPy (and SciPy) functions also broadcast. For instance:

import numpy as np

x = pl.Series([1.0, 3.0, 0.0, np.pi])
np.sin(x)
shape: (4,)
f64
0.841471
0.14112
0.0
1.2246e-16

Some examples of functions that broadcast are np.sin, np.cos, np.tan, np.log, np.exp, and np.sqrt.

NumPy functions that combine or aggregate values usually don’t broadcast. For example, np.sum, np.mean, and np.median don’t broadcast.

Tip

Broadcasting is the counterpart to comprehensions (introduced in Comprehensions). Both are highly efficient. Generally, you should:

  • Use broadcasting with data structures that support it, such as series and NumPy arrays (explained in NumPy Arrays).

  • Use comprehensions with lists and Python’s other built-in data structures (explained in Built-in Data Structures).

A function can broadcast across multiple arguments. To demonstrate this, suppose we want to estimate number of nests per breeding pair for the least terns data. The total_nests column contains the total number of nests at each site, and the bp_max column contains the maximum reported number of breeding pairs. So to compute nests per breeding pair:

terns["total_nests"] / terns["bp_max"]
shape: (791,)
total_nests
f64
1.0
1.666667
1.036545
1.0
1.0
1.113354
1.157895
1.092593
1.170732
1.036364

The elements are paired up and divided according to their positions. Notice that the result is a Float64 series. The total_nests column is an Int64 series, so besides broadcasting, the example also demonstrates that Series are subject to implicit coercion (introduced in Coercion & Casting).

If you try to broadcast a function across two series of different lengths, Polars raises an error:

x = pl.Series([1, 2])
y = pl.Series([9, 8, 7])
x - y
---------------------------------------------------------------------------
InvalidOperationError                     Traceback (most recent call last)
Cell In[56], line 3
      1 x = pl.Series([1, 2])
      2 y = pl.Series([9, 8, 7])
----> 3 x - y

File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/series/series.py:1077, in Series.__sub__(self, other)
   1075 if self.dtype.is_decimal() and isinstance(other, (float, int)):
   1076     return self.to_frame().select(F.col(self.name) - other).to_series()
-> 1077 return self._arithmetic(other, "sub", "sub_<>")

File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/series/series.py:1011, in Series._arithmetic(self, other, op_s, op_ffi)
   1008     other = pl.Series("", [None])
   1010 if isinstance(other, Series):
-> 1011     return self._from_pyseries(getattr(self._s, op_s)(other._s))
   1012 elif _check_for_numpy(other) and isinstance(other, np.ndarray):
   1013     return self._from_pyseries(getattr(self._s, op_s)(Series(other)._s))

InvalidOperationError: cannot do arithmetic operation on series of different lengths: got 2 and 3

2.4. Other Data Structures#

In this section, you’ll learn about several one-dimensional data structures that are fundamental to programming in Python.

2.4.1. Built-in Data Structures#

Besides lists, Python provides several other useful data structures:

  • Like lists, tuples are ordered and heterogeneous. The main difference is that tuples are immutable: once you create a tuple, you can’t change it. This makes tuples safer and more efficient than lists.

    You can make a tuple by enclosing comma-separated values in parentheses ():

    (True, 1, "hi")
    

    You can cast other data structures to a tuple with the tuple function. Use a tuple when the number of elements is constant and known in advance.

  • A set is unordered and heterogeneous. As in a mathematical set, the elements in a set must be unique. Python automatically discards any duplicates added to a set. Sets support set theoretic operations such as unions and intersections.

    You can make a set by enclosing comma-separated values in curly braces {}:

    {True, 1, "hi"}
    

    You can convert other data structures to a set with the set function. Use a set when you need a guarantee that the elements are unique.

  • A dict is an ordered, heterogeneous collection of key-value pairs. Keys must be distinct and many different types of keys are valid. The indexing operator [] gets elements by key rather than position.

    You can make a dict by enclosing comma-separated key: value pairs in curly braces {}:

    {"hi": -3.5}
    

    Use a dict when you need to index elements by something other than position or need a mapping from one collection of data to another.

See also

Python’s official documentation provides more details about what you can do with these data structures.

2.4.2. NumPy Arrays#

An array (or ndarray) is an ordered, homogeneous data structure, similar to a series. Arrays are a fundamental data structure in NumPy.

You can create an array with the np.array function and a list of elements:

x = np.array([10, 20, 30])
x
array([10, 20, 30])

You can convert an array into a series with the pl.Series function:

pl.Series(x)
shape: (3,)
i64
10
20
30

Conversely, you can convert a series to an array with the .to_numpy method:

terns["total_nests"][:5].to_numpy()
array([ 15,  20, 312,   3,   5])

Tip

Series tend to be a good choice for data analysis, while arrays tend to be a good choice for sophisticated mathematical computations (such as simulations).

Note

NumPy uses its own data types for array elements, for many of the same reasons Polars does for series elements. The NumPy documentation has more details.

NumPy is primarily designed for numerical computing, so working with strings in NumPy can be tricky. See the documentation for details about its string types. If you need to work with strings, Polars is more convenient than NumPy.

2.5. Special Values#

2.5.1. None#

In Python, None represents an absent or undefined value. It is useful:

  1. As a way to explicitly indicate a value is absent.

  2. As the return value for functions that are useful for their side effects and don’t need to return anything.

  3. As a default argument for optional parameters in functions.

For example, Python’s built-in print function, which prints a string to the console, returns None:

print("Hello!")
Hello!

The Python console doesn’t print anything when an expression produces None:

None

None is the only value of type NoneType:

type(None)
NoneType

You can check if a value is None with Python’s is keyword:

x = None
x is None
True

2.5.2. Missing Values#

In the least terns data set, notice that some of the entries are null. For instance, look at the second element of the nonpred_eggs column:

terns.head()
shape: (5, 43)
yearsite_namesite_name_2013_2018site_name_1988_2001site_abbrregion_3region_4eventbp_minbp_maxfl_minfl_maxtotal_nestsnonpred_eggsnonpred_chicksnonpred_flnonpred_adpred_controlpred_eggspred_chickspred_flpred_adpred_pefapred_coy_foxpred_mesopred_owlspppred_corvidpred_other_raptorpred_other_avianpred_misctotal_pefatotal_coy_foxtotal_mesototal_owlspptotal_corvidtotal_other_raptortotal_other_aviantotal_miscfirst_observedlast_observedfirst_nestfirst_chickfirst_fledge
i64strstrstrstrstrstrstrf64f64i64i64i64i64i64i64i64stri64i64i64i64strstrstrstrstrstrstrstri64i64i64i64i64i64i64i64strstrstrstrstr
2000"PITTSBURG POWER PLANT""Pittsburg Power Plant""NA_2013_2018 POLYGON""PITT_POWER""S.F._BAY""S.F._BAY""LA_NINA"15.015.01618153000null4200"N""N""N""N""Y""Y""N""N"00004200"2000-05-11""2000-08-05""2000-05-26""2000-06-18""2000-07-08"
2000"ALBANY CENTRAL AVE""NA_NO POLYGON""Albany Central Avenue""AL_CENTAVE""S.F._BAY""S.F._BAY""LA_NINA"6.012.01120nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"ALAMEDA POINT""Alameda Point""NA_2013_2018 POLYGON""ALAM_PT""S.F._BAY""S.F._BAY""LA_NINA"282.0301.02002303121248121null17000"N""N""N""N""N""Y""Y""N"000006110"2000-05-01""2000-08-19""2000-05-16""2000-06-07""2000-06-30"
2000"KETTLEMAN CITY""Kettleman""NA_2013_2018 POLYGON""KET_CTY""KINGS""KINGS""LA_NINA"2.03.0123null316nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull"2000-06-10""2000-09-24""2000-06-17""2000-07-22""2000-08-06"
2000"OCEANO DUNES STATE VEHICULAR R…"Oceano Dunes State Vehicular R…"NA_2013_2018 POLYGON""OCEANO_DUNES""CENTRAL""CENTRAL""LA_NINA"4.05.04452000null0400"N""N""N""N""N""N""Y""N"00000040"2000-05-04""2000-08-30""2000-05-28""2000-06-20""2000-07-13"

Polars uses null, called the missing value, to represent missing entries in a data set. It’s implied that the entries are missing due to how the data was collected, although there are exceptions. As an example, imagine the data came from a survey, and respondents chose not to answer some questions. In the data set, their answers for those questions can be recorded as null.

The missing value null is a chameleon: it can be of an element of any type in a series. Polars implicitly converts null to and from None when you get or set an element in a series. This means you can use None to create a series with null elements:

x = pl.Series([1, 2, None])
x
shape: (3,)
i64
1
2
null

And you get back None if you access a null element:

terns["nonpred_eggs"][1]

The missing value null is also contagious: it represents an unknown quantity, so computing on it usually produces another missing value. The idea is that if the inputs to a computation are unknown, generally so is the output:

x - 3
shape: (3,)
i64
-2
-1
null

Polars makes an exception for aggregation functions, which automatically filter out missing values:

x.mean()
1.5

You can use the .is_null method to test if elements of a series are null:

x.is_null()
shape: (3,)
bool
false
false
true

Polars also provides an .is_not_null method and a .fill_null method to fill missing values with a different value.

2.5.3. Infinity#

NumPy (and Polars) use np.inf to represent infinity. It is of type float. You’re most likely to encounter it as the result of certain computations:

pl.Series([13]) / 0
shape: (1,)
f64
inf

You can us the .is_infinite method to test if elements of a series are infinite:

x = pl.Series([1.0, 2.0, np.inf])
x.is_infinite()
shape: (3,)
bool
false
false
true

2.5.4. Not a Number#

NumPy (and Polars) use np.nan, called not a number and also written NaN, to represent mathematically undefined results. It is of type float. As an example, dividing 0 by 0 is undefined:

pl.Series([0]) / 0
shape: (1,)
f64
NaN

You can use the .is_nan method to test if elements of a series are NaN:

x = pl.Series([0, 1, 2]) / 0
x.is_nan()
shape: (3,)
bool
true
false
false

2.6. Data Frames#

2.6.1. Selecting Columns#

An excellent starting point for selecting and transforming columns in a data frame is the .select method. Summarizing Columns already showed how to select a single column by name with the indexing operator [], but the .select method is much more flexible. You can use it to select multiple columns at once, by name or type, and can transform or rename them.

As with the indexing operator, you can use .select to select a single column by providing the column name as an argument. Here’s an example (with .head to limit the output):

terns.select("year").head()
shape: (5, 1)
year
i64
2000
2000
2000
2000
2000

Unlike the indexing operator, .select returns a data frame rather than a series.

You can also select multiple columns this way:

terns.select("year", "site_name").head()
shape: (5, 2)
yearsite_name
i64str
2000"PITTSBURG POWER PLANT"
2000"ALBANY CENTRAL AVE"
2000"ALAMEDA POINT"
2000"KETTLEMAN CITY"
2000"OCEANO DUNES STATE VEHICULAR R…

The .select method is flexible because it can evaluate a Polars expression: instructions for how to select or transform data. One way to create an expression is with the pl.col function, which represents a column or set of columns. So another way to select the year and site_name column from the least terns data is:

terns.select(
    pl.col("year", "site_name")
).head()
shape: (5, 2)
yearsite_name
i64str
2000"PITTSBURG POWER PLANT"
2000"ALBANY CENTRAL AVE"
2000"ALAMEDA POINT"
2000"KETTLEMAN CITY"
2000"OCEANO DUNES STATE VEHICULAR R…

An advantage of using pl.col is that you’re not limited to selecting columns by name: you can also select columns by type. Here’s how to get only the Int64 and Float64 columns in the least terns data:

terns.select(
    pl.col(pl.Int64, pl.Float64)
).head()
shape: (5, 22)
yearbp_minbp_maxfl_minfl_maxtotal_nestsnonpred_eggsnonpred_chicksnonpred_flnonpred_adpred_eggspred_chickspred_flpred_adtotal_pefatotal_coy_foxtotal_mesototal_owlspptotal_corvidtotal_other_raptortotal_other_aviantotal_misc
i64f64f64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64
200015.015.01618153000420000004200
20006.012.01120nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000282.0301.0200230312124812117000000006110
20002.03.0123null316nullnullnullnullnullnullnullnullnullnullnullnull
20004.05.04452000040000000040

Selecting columns this way is useful for doing things like computing summaries of only numeric columns. In fact, to compute widely-used summaries, like the mean, all you need to do is call the corresponding method on the Polars expression:

terns.select(
    pl.col(pl.Int64, pl.Float64).mean()
).head()
shape: (1, 22)
yearbp_minbp_maxfl_minfl_maxtotal_nestsnonpred_eggsnonpred_chicksnonpred_flnonpred_adpred_eggspred_chickspred_flpred_adtotal_pefatotal_coy_foxtotal_mesototal_owlspptotal_corvidtotal_other_raptortotal_other_aviantotal_misc
f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64
2013.082174129.319923151.04342340.81514850.349166162.84546660.29186644.3726814.1814880.85098741.5740748.5185192.3653852.6896551.7407419.4642865.5555561.4545457.9615381.7115388.8979596.566038

A third way to select columns with pl.col is with a pattern. Patterns are strings, and must begin with a caret ^ and end with a dollar sign $. Within a pattern, you can use .* as a wild card that matches any characters.

Note

Technically, Polars’ patterns are regular expressions, a widely-used language for describing patterns in text. You can learn more about regular expressions in the Date & String Processing chapter of DataLab’s Intermediate Python workshop reader.

As motivation to demonstrate patterns, most of the columns with names that start with pred_ are categorical but currently have string elements. It would be good to cast them to the Categorical type. To begin, select all of the columns with names that start with pred_:

terns.select(
    pl.col("^pred_.*$")
).head()
shape: (5, 13)
pred_controlpred_eggspred_chickspred_flpred_adpred_pefapred_coy_foxpred_mesopred_owlspppred_corvidpred_other_raptorpred_other_avianpred_misc
stri64i64i64i64strstrstrstrstrstrstrstr
null4200"N""N""N""N""Y""Y""N""N"
nullnullnullnullnullnullnullnullnullnullnullnullnull
null17000"N""N""N""N""N""Y""Y""N"
nullnullnullnullnullnullnullnullnullnullnullnullnull
null0400"N""N""N""N""N""N""Y""N"

There are a few columns in the result, such as pred_eggs, with Int64 elements. These columns aren’t categorical, so we should exclude them before casting. You can exclude columns from an expression with the .exclude method:

terns.select(
    pl.col("^pred_.*$").exclude(pl.Int64).cast(pl.Categorical)
).head()
shape: (5, 9)
pred_controlpred_pefapred_coy_foxpred_mesopred_owlspppred_corvidpred_other_raptorpred_other_avianpred_misc
catcatcatcatcatcatcatcatcat
null"N""N""N""N""Y""Y""N""N"
nullnullnullnullnullnullnullnullnull
null"N""N""N""N""N""Y""Y""N"
nullnullnullnullnullnullnullnullnull
null"N""N""N""N""N""N""Y""N"

To make this change permanent, we need to reassign the terns data frame. The .select method only returns the selected columns, so assigning the result to terns would mean losing all of the other columns.

Instead of using .select, you can use .with_columns to transform some columns but return all of the columns. In all other respects, .with_columns works the same way as .select. So to make the cast permanent:

terns = terns.with_columns(
    pl.col("^pred_.*$").exclude(pl.Int64).cast(pl.Categorical)
)

terns.head()
shape: (5, 43)
yearsite_namesite_name_2013_2018site_name_1988_2001site_abbrregion_3region_4eventbp_minbp_maxfl_minfl_maxtotal_nestsnonpred_eggsnonpred_chicksnonpred_flnonpred_adpred_controlpred_eggspred_chickspred_flpred_adpred_pefapred_coy_foxpred_mesopred_owlspppred_corvidpred_other_raptorpred_other_avianpred_misctotal_pefatotal_coy_foxtotal_mesototal_owlspptotal_corvidtotal_other_raptortotal_other_aviantotal_miscfirst_observedlast_observedfirst_nestfirst_chickfirst_fledge
i64strstrstrstrstrstrstrf64f64i64i64i64i64i64i64i64cati64i64i64i64catcatcatcatcatcatcatcati64i64i64i64i64i64i64i64strstrstrstrstr
2000"PITTSBURG POWER PLANT""Pittsburg Power Plant""NA_2013_2018 POLYGON""PITT_POWER""S.F._BAY""S.F._BAY""LA_NINA"15.015.01618153000null4200"N""N""N""N""Y""Y""N""N"00004200"2000-05-11""2000-08-05""2000-05-26""2000-06-18""2000-07-08"
2000"ALBANY CENTRAL AVE""NA_NO POLYGON""Albany Central Avenue""AL_CENTAVE""S.F._BAY""S.F._BAY""LA_NINA"6.012.01120nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"ALAMEDA POINT""Alameda Point""NA_2013_2018 POLYGON""ALAM_PT""S.F._BAY""S.F._BAY""LA_NINA"282.0301.02002303121248121null17000"N""N""N""N""N""Y""Y""N"000006110"2000-05-01""2000-08-19""2000-05-16""2000-06-07""2000-06-30"
2000"KETTLEMAN CITY""Kettleman""NA_2013_2018 POLYGON""KET_CTY""KINGS""KINGS""LA_NINA"2.03.0123null316nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull"2000-06-10""2000-09-24""2000-06-17""2000-07-22""2000-08-06"
2000"OCEANO DUNES STATE VEHICULAR R…"Oceano Dunes State Vehicular R…"NA_2013_2018 POLYGON""OCEANO_DUNES""CENTRAL""CENTRAL""LA_NINA"4.05.04452000null0400"N""N""N""N""N""N""Y""N"00000040"2000-05-04""2000-08-30""2000-05-28""2000-06-20""2000-07-13"

Tip

In general, choose:

  • .select if you only want to get back the selected columns.

  • .with_columns if you want get back all of the columns.

The .select method is also useful for testing expressions before switching to the .with_columns method.

You can use columns to transform other columns. As a final example, suppose we want to compute eggs per breeding pair and nests per breeding pair for the least terns data. Non-predated egg counts are in the nonpred_eggs column and nest counts are in the total_nests column. For now, we’ll use the maximum reported breeding pairs, bp_max, as the number of breeding pairs:

terns.select(
    pl.col("nonpred_eggs", "total_nests") / pl.col("bp_max")
).head()
shape: (5, 2)
nonpred_eggstotal_nests
f64f64
0.21.0
null1.666667
0.411961.036545
null1.0
0.41.0

There’s also a bp_min column with the minimum reported breeding pairs. To be thorough, we should compute the rates with both bp_min and bp_max, not just bp_max. We’ll also need to rename the resulting columns, so that each column has a unique name. You can use the .alias method to rename a single column, or the .name.prefix and .name.suffix methods, respectively, to prefix or suffix a column’s name. Let’s add a suffix to the column names to identify which breeding pair column was used:

terns.select(
    (
        pl.col("nonpred_eggs", "total_nests") / pl.col("bp_max")
    ).name.suffix("_per_bp_max"),
    (
        pl.col("nonpred_eggs", "total_nests") / pl.col("bp_min")
    ).name.suffix("_per_bp_min")
).head()
shape: (5, 4)
nonpred_eggs_per_bp_maxtotal_nests_per_bp_maxnonpred_eggs_per_bp_mintotal_nests_per_bp_min
f64f64f64f64
0.21.00.21.0
null1.666667null3.333333
0.411961.0365450.4397161.106383
null1.0null1.5
0.41.00.51.25

See also

Much more is possible with Polars expressions and the .select and .with_columns methods. See the Polars User Guide for details.

2.6.2. Filtering Rows#

Filtering the rows of a data frame is the counterpart selecting columns. The .filter method filters rows based on one or more conditions: expressions that evaluate to a series of Boolean values.

As an example, suppose we want to find all sites in the least terns data where the number of nests in the total_nests column is greater than 5. Here’s the code:

terns.filter(pl.col("total_nests") > 5).head()
shape: (5, 43)
yearsite_namesite_name_2013_2018site_name_1988_2001site_abbrregion_3region_4eventbp_minbp_maxfl_minfl_maxtotal_nestsnonpred_eggsnonpred_chicksnonpred_flnonpred_adpred_controlpred_eggspred_chickspred_flpred_adpred_pefapred_coy_foxpred_mesopred_owlspppred_corvidpred_other_raptorpred_other_avianpred_misctotal_pefatotal_coy_foxtotal_mesototal_owlspptotal_corvidtotal_other_raptortotal_other_aviantotal_miscfirst_observedlast_observedfirst_nestfirst_chickfirst_fledge
i64strstrstrstrstrstrstrf64f64i64i64i64i64i64i64i64cati64i64i64i64catcatcatcatcatcatcatcati64i64i64i64i64i64i64i64strstrstrstrstr
2000"PITTSBURG POWER PLANT""Pittsburg Power Plant""NA_2013_2018 POLYGON""PITT_POWER""S.F._BAY""S.F._BAY""LA_NINA"15.015.01618153000null4200"N""N""N""N""Y""Y""N""N"00004200"2000-05-11""2000-08-05""2000-05-26""2000-06-18""2000-07-08"
2000"ALBANY CENTRAL AVE""NA_NO POLYGON""Albany Central Avenue""AL_CENTAVE""S.F._BAY""S.F._BAY""LA_NINA"6.012.01120nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
2000"ALAMEDA POINT""Alameda Point""NA_2013_2018 POLYGON""ALAM_PT""S.F._BAY""S.F._BAY""LA_NINA"282.0301.02002303121248121null17000"N""N""N""N""N""Y""Y""N"000006110"2000-05-01""2000-08-19""2000-05-16""2000-06-07""2000-06-30"
2000"RANCHO GUADALUPE DUNES PRESERV…"Rancho Guadalupe Dunes Preserv…"NA_2013_2018 POLYGON""RGDP""CENTRAL""CENTRAL""LA_NINA"9.09.0171790100nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull"2000-05-07""2000-08-13""2000-05-31""2000-06-22""2000-07-20"
2000"VANDENBERG SFB""Vandenberg AFB""NA_2013_2018 POLYGON""VAN_SFB""CENTRAL""CENTRAL""LA_NINA"30.032.0111132null2700null0300"N""N""N""N""N""Y""N""N"00000300"2000-05-07""2000-08-17""2000-05-28""2000-06-20""2000-07-15"

If we only want the site names, we can chain this with a call to .select and use the .unique method on the site_name column:

terns.filter(
    pl.col("total_nests") > 5
).select(
    pl.col("site_name").unique()
)
shape: (40, 1)
site_name
str
"NAVAL AMPHIBIOUS BASE CORONADO"
"LA HARBOR"
"SAN PABLO BAY NWR"
"HOLLYWOOD BEACH"
"BATIQUITOS LAGOON ECOLOGICAL R…
"PITTSBURG POWER PLANT"
"SANTA CLARA RIVER MCGRATH STAT…
"SEAL BEACH NWR ANAHEIM BAY"
"EDEN LANDING ECOLOGICAL RESERV…
"MALIBU LAGOON"

We can conclude that there are 40 sites that had at least 5 nests at some point.

2.6.2.1. Logic Operators#

Series with Boolean elements, such as conditions, can be inverted or combined with logic operators. All of the logic operators broadcast to series elements. For demonstration, we’ll use the following series:

x1 = pl.Series([True, False, True, False])
x2 = pl.Series([True, True, False, False])

The NOT operator ~ inverts values, so True becomes False and False becomes True:

~x1
shape: (4,)
bool
false
true
false
true

The OR operator | combines two values, returning True unless both values are False:

x1 | x2
shape: (4,)
bool
true
true
true
false

The AND operator & combines two values, returning False unless both values are True:

x1 & x2
shape: (4,)
bool
true
false
false
false

Caution

The logic operators ~, |, and & only work on Polars series, NumPy arrays, and other homogeneous data structures. If you use them on Python’s built-in bool values, Python will return an unexpected result or produce an error.

Python instead uses the keywords not, or, and and as the respective logic operators on bool values. Polars, NumPy, and other packages don’t use these keywords because their behavior can’t be customized for data structures.

2.6.2.2. Multiple Conditions#

As a final example, let’s filter the least terns data with multiple conditions. We’ll get all rows for 2023 where there were at least 10 fledglings reported. We’ll use the fl_min column for the minimum reported fledgling count. Here’s the call to .filter:

terns.filter(
    (pl.col("year") == 2023) &
    (pl.col("fl_min") > 10)
)
shape: (14, 43)
yearsite_namesite_name_2013_2018site_name_1988_2001site_abbrregion_3region_4eventbp_minbp_maxfl_minfl_maxtotal_nestsnonpred_eggsnonpred_chicksnonpred_flnonpred_adpred_controlpred_eggspred_chickspred_flpred_adpred_pefapred_coy_foxpred_mesopred_owlspppred_corvidpred_other_raptorpred_other_avianpred_misctotal_pefatotal_coy_foxtotal_mesototal_owlspptotal_corvidtotal_other_raptortotal_other_aviantotal_miscfirst_observedlast_observedfirst_nestfirst_chickfirst_fledge
i64strstrstrstrstrstrstrf64f64i64i64i64i64i64i64i64cati64i64i64i64catcatcatcatcatcatcatcati64i64i64i64i64i64i64i64strstrstrstrstr
2023"ALAMEDA POINT""Alameda Point""NA_2013_2018 POLYGON""ALAM_PT""S.F._BAY""S.F._BAY""LA_NINA"169.0339.01091313391125650"Y"nullnullnullnull"Y""N""N""N""Y""N""Y""N"nullnullnullnullnullnullnullnull"2023-04-28""2023-09-09""2023-06-17""2023-07-09"null
2023"HAYWARD REGIONAL SHORELINE""Hayward Regional Shoreline""NA_2013_2018 POLYGON""HAY_REG_SHOR""S.F._BAY""S.F._BAY""LA_NINA"88.0143.012713014417100"Y"nullnullnullnull"Y""N""Y""N""N""N""Y""Y"nullnullnullnullnullnullnullnull"2023-05-01""2023-09-11""2023-05-30""2023-06-17"null
2023"OCEANO DUNES STATE VEHICULAR R…"Oceano Dunes State Vehicular R…"NA_2013_2018 POLYGON""OCEANO_DUNES""CENTRAL""CENTRAL""LA_NINA"40.042.03535428200"Y"nullnullnullnull"Y""N""N""N""N""N""N""N"nullnullnullnullnullnullnullnull"2023-05-07""2023-08-20""2023-06-01""2023-06-19"null
2023"VANDENBERG SFB""Vandenberg AFB""NA_2013_2018 POLYGON""VAN_SFB""CENTRAL""CENTRAL""LA_NINA"33.039.017174232800"Y"nullnullnullnull"N""N""N""N""N""N""Y""N"nullnullnullnullnullnullnullnull"2023-05-11""2023-08-26""2023-06-01""2023-06-25"null
2023"NBVC POINT MUGU""NBVC Point Mugu""NA_2013_2018 POLYGON""PT_MUGU""SOUTHERN""VENTURA""LA_NINA"168.0177.05110018840301"Y"nullnullnullnull"N""Y""N""Y""N""N""N""Y"nullnullnullnullnullnullnullnull"2023-05-03""2023-09-02""2023-05-21""2023-06-14"null
2023"BATIQUITOS LAGOON ECOLOGICAL R…"Batiquitos Lagoon""NA_2013_2018 POLYGON""BLER""SOUTHERN""SOUTHERN""LA_NINA"205.0218.07789228679470"Y"nullnullnullnull"Y""N""N""Y""N""Y""N""Y"nullnullnullnullnullnullnullnull"2023-04-27""2023-08-03""2023-05-11""2023-06-05"null
2023"SAN DIEGUITO LAGOON ECOLOGICAL…"San Diequito Lagoon""NA_2013_2018 POLYGON""SANDIEGU_LAG""SOUTHERN""SOUTHERN""LA_NINA"34.047.01622470000"Y"nullnullnullnull"Y""Y""N""Y""N""Y""N""N"nullnullnullnullnullnullnullnull"2023-05-24""2023-09-02""2023-05-24"nullnull
2023"MISSION BAY FAA ISLAND""FAA Island""NA_2013_2018 POLYGON""MB_FAA""SOUTHERN""SOUTHERN""LA_NINA"141.0144.04448156503640"Y"nullnullnullnull"N""N""N""N""N""N""Y""Y"nullnullnullnullnullnullnullnull"2023-04-23""2023-08-17""2023-05-10""2023-05-28"null
2023"NAVAL AMPHIBIOUS BASE CORONADO""Naval Base Coronado""NA_2013_2018 POLYGON""NAB""SOUTHERN""SOUTHERN""LA_NINA"596.0644.09012871732918566"Y"nullnullnullnull"N""N""N""N""Y""N""Y""Y"nullnullnullnullnullnullnullnull"2023-04-22""2023-09-09""2023-05-07""2023-05-31"null
2023"TIJUANA ESTUARY NERR""Tijuana Estuary""NA_2013_2018 POLYGON""TJ_RIV""SOUTHERN""SOUTHERN""LA_NINA"144.0165.03535171654411"Y"nullnullnullnull"N""N""N""N""N""N""Y""Y"nullnullnullnullnullnullnullnull"2023-04-26""2023-08-28""2023-05-12""2023-06-10"null

This gives us 14 sites with at least 10 fledglings in 2023. Notice that we had to put the conditions in parentheses () so that Python gets the order of operations right.

2.7. Exercises#

2.7.1. Exercise#

Python’s range function offers another way to create a sequence of numbers. Read the help file for this function.

  1. Create an example range. How does this differ from a list?

  2. Describe the three arguments that you can use in range. Give examples of each.

  3. Convert one of those ranges to a list and print it to screen. What changes in the way Python represents this sequence?

2.7.2. Exercise#

Return to the discussion in Coercion & Casting.

  1. Why does "3" + 4 raise an error?

  2. Why does True - 1 return 0?

  3. Why does int(4.6) < 4.6 return True?

2.7.3. Exercise#

  1. Create a new data frame from the least terns data with the following characteristics:

    • Each entry’s year is between 2010 and 2019 (inclusive).

    • Each entry reports at least 100 breeding pairs.

    • The columns are year, site_name, bp_min, bp_max, total_nests.

    Use this data frame for the remaining questions.

  2. Count the number of entries for each site. How many sites have at least 100 breeding pairs across all 10 years?

  3. Which site-year combination has the highest number of nests?