2. Data Types & Structures#
Learning Objectives
After this lesson, you should be able to:
Check the type of an object
Cast an object to a different type
Describe and differentiate lists, series, tuples, sets, dicts, and arrays
Explain what a comprehension is
Identify and cast categorical data
Explain what broadcasting is
Describe and differentiate
None
,null
, andNaN
Locate missing values in a series
Select columns of a data frame
Filter rows of a data frame on a condition
Negate or combine conditions with logic operators
The previous chapter introduced Python, providing enough background to do simple computations on data sets. This chapter focuses on the foundational knowledge and skills you’ll need to use Python effectively in the long term. Specifically, it’s a deep dive into data types and data structures in Python and Polars. Working knowledge of these will make you more effective at analyzing data and solving problems.
2.1. Data Types#
In Summarizing Data, we used the .glimpse
method to get a
structural summary of the California least tern data set:
terns.glimpse()
Rows: 791
Columns: 43
$ year <i64> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000
$ site_name <str> 'PITTSBURG POWER PLANT', 'ALBANY CENTRAL AVE', 'ALAMEDA POINT', 'KETTLEMAN CITY', 'OCEANO DUNES STATE VEHICULAR RECREATION AREA', 'RANCHO GUADALUPE DUNES PRESERVE', 'VANDENBERG SFB', 'SANTA CLARA RIVER MCGRATH STATE BEACH', 'ORMOND BEACH', 'NBVC POINT MUGU'
$ site_name_2013_2018 <str> 'Pittsburg Power Plant', 'NA_NO POLYGON', 'Alameda Point', 'Kettleman', 'Oceano Dunes State Vehicular Recreation Area', 'Rancho Guadalupe Dunes Preserve', 'Vandenberg AFB', 'Santa Clara River', 'Ormond Beach', 'NBVC Point Mugu'
$ site_name_1988_2001 <str> 'NA_2013_2018 POLYGON', 'Albany Central Avenue', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON'
$ site_abbr <str> 'PITT_POWER', 'AL_CENTAVE', 'ALAM_PT', 'KET_CTY', 'OCEANO_DUNES', 'RGDP', 'VAN_SFB', 'S_CLAR_MCG', 'ORMOND', 'PT_MUGU'
$ region_3 <str> 'S.F._BAY', 'S.F._BAY', 'S.F._BAY', 'KINGS', 'CENTRAL', 'CENTRAL', 'CENTRAL', 'SOUTHERN', 'SOUTHERN', 'SOUTHERN'
$ region_4 <str> 'S.F._BAY', 'S.F._BAY', 'S.F._BAY', 'KINGS', 'CENTRAL', 'CENTRAL', 'CENTRAL', 'VENTURA', 'VENTURA', 'VENTURA'
$ event <str> 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA'
$ bp_min <f64> 15.0, 6.0, 282.0, 2.0, 4.0, 9.0, 30.0, 21.0, 73.0, 166.0
$ bp_max <f64> 15.0, 12.0, 301.0, 3.0, 5.0, 9.0, 32.0, 21.0, 73.0, 167.0
$ fl_min <i64> 16, 1, 200, 1, 4, 17, 11, 9, 60, 64
$ fl_max <i64> 18, 1, 230, 2, 4, 17, 11, 9, 65, 64
$ total_nests <i64> 15, 20, 312, 3, 5, 9, 32, 22, 73, 252
$ nonpred_eggs <i64> 3, None, 124, None, 2, 0, None, 4, 2, None
$ nonpred_chicks <i64> 0, None, 81, 3, 0, 1, 27, 3, 0, None
$ nonpred_fl <i64> 0, None, 2, 1, 0, 0, 0, None, 0, None
$ nonpred_ad <i64> 0, None, 1, 6, 0, 0, 0, None, 0, None
$ pred_control <str> None, None, None, None, None, None, None, None, None, None
$ pred_eggs <i64> 4, None, 17, None, 0, None, 0, None, None, None
$ pred_chicks <i64> 2, None, 0, None, 4, None, 3, None, None, None
$ pred_fl <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ pred_ad <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ pred_pefa <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_coy_fox <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_meso <str> 'N', None, 'N', None, 'N', None, 'N', None, 'Y', None
$ pred_owlspp <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_corvid <str> 'Y', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_other_raptor <str> 'Y', None, 'Y', None, 'N', None, 'Y', None, 'Y', None
$ pred_other_avian <str> 'N', None, 'Y', None, 'Y', None, 'N', None, 'N', None
$ pred_misc <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ total_pefa <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_coy_fox <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_meso <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_owlspp <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_corvid <i64> 4, None, 0, None, 0, None, 0, None, None, None
$ total_other_raptor <i64> 2, None, 6, None, 0, None, 3, None, None, None
$ total_other_avian <i64> 0, None, 11, None, 4, None, 0, None, None, None
$ total_misc <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ first_observed <str> '2000-05-11', None, '2000-05-01', '2000-06-10', '2000-05-04', '2000-05-07', '2000-05-07', '2000-06-06', None, '2000-05-21'
$ last_observed <str> '2000-08-05', None, '2000-08-19', '2000-09-24', '2000-08-30', '2000-08-13', '2000-08-17', '2000-09-05', None, '2000-08-12'
$ first_nest <str> '2000-05-26', None, '2000-05-16', '2000-06-17', '2000-05-28', '2000-05-31', '2000-05-28', '2000-06-06', '2000-06-08', '2000-06-01'
$ first_chick <str> '2000-06-18', None, '2000-06-07', '2000-07-22', '2000-06-20', '2000-06-22', '2000-06-20', '2000-06-28', '2000-06-26', '2000-06-24'
$ first_fledge <str> '2000-07-08', None, '2000-06-30', '2000-08-06', '2000-07-13', '2000-07-20', '2000-07-15', '2000-07-24', '2000-07-17', '2000-07-16'
The first two rows describe the shape of the data set. After that, each row
lists a column name, the type of data in that column, and that column’s first
few values. For instance, the site_name
column contains str
, or string,
data.
We categorize data into different types based on sets of shared characteristics because types are useful for reasoning about what we can do with the data. For example, statisticians conventionally categorize data as one of four types within two larger categories:
numeric
continuous (real or complex numbers)
discrete (integers)
categorical
nominal (categories with no ordering)
ordinal (categories with some ordering)
Which approaches and statistical techniques are appropriate depends on the type of the data. Of course, other types of data, like graphs (networks) and natural language (books, speech, and so on), are also possible.
Most programming languages, including Python, also categorize data by type. The following table lists some of Python’s built-in types:
Type |
Example |
Description |
---|---|---|
|
|
Boolean values |
|
|
Integers |
|
|
Real numbers |
|
|
Complex numbers |
|
|
Strings |
You can check the type of an object in Python with the built-in type
function. Take a look at the types of a few objects:
type("hi")
str
type(True)
bool
type(-8.3)
float
In Python, class is just another word for type. So we can also say that the
type
function returns the class of an object.
Note
Python provides a class
keyword to create your own classes. Creating classes
is beyond the scope of this reader, but is explained in detail in most Python
programming textbooks.
Tip
You can use the isinstance
function to test whether a value is of a
particular class. For example, to test whether 5
is a string:
isinstance(5, str)
False
2.1.1. Coercion & Casting#
Although bool
, int
, and float
are different types, in most situations
Python will automatically convert between them as needed. For example, you can
multiply a floating point number by an integer and then add a Boolean value:
n = 3.1 * 2 + True
n
7.2
First, the integer 2
is converted to a floating point number and multiplied
by 3.1
, yielding 6.2
. Then the Boolean True
is converted to a floating
point number and added to 6.2
. In Python and most other programming
languages, False
corresponds to 0
and True
corresponds to 1
. Thus the
result is 7.2
, a floating point number:
type(n)
float
This automatic conversion of types is known as implicit coercion. Conversion always proceeds from less general to more general types, so that no information is lost.
Implicit coercion only applies in situations where the intent of the code is relatively unambiguous, such as arithmetic between different types of numbers (including Booleans). For example, you can’t add a number to a string, because it’s unclear what the result should be:
"hi" + 1
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[8], line 1
----> 1 "hi" + 1
TypeError: can only concatenate str (not "int") to str
A cast explicitly converts an object from one type to another, sometimes
losing information. You can cast an object to a particular type with the
function of the same name. For example, to cast to the bool
type:
bool(0)
False
Or to cast to the int
type:
int(4.67)
4
Casts are especially useful for converting to and from the str
type:
"hi" + str(1)
'hi1'
float("7.3")
7.3
Python will raise an error if a cast is not possible. For example, this will not work:
int("Hello world!")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[13], line 1
----> 1 int("Hello world!")
ValueError: invalid literal for int() with base 10: 'Hello world!'
2.2. Lists#
A data structure is a collection of data organized in a particular way. In Python, data structures are also called containers, because they contain data. Containers make working with lots of data manageable and efficient. Data frames, introduced in the previous chapter, are an example of a two-dimensional data structure.
A list is a general-purpose one-dimensional data structure. Lists are built
into Python; you don’t even need to import a module in order to use them. You
can create a list by enclosing any number of comma-separated values in square
brackets []
, like this:
x = [10, 20, 30, 40, 50]
x
[10, 20, 30, 40, 50]
Lists are ordered, which means the values, or elements, have specific
positions. The first element is 10
, the second is 20
, the fifth is 50
,
and so on.
The elements of a list can be of different types, so we say lists are heterogeneous. For instance, this list contains a number, a string, and another list (with one element):
li = [8, "hello", [4.2]]
li
[8, 'hello', [4.2]]
A list can have no elements, in which case we say it’s empty. For example:
empty = []
empty
[]
You can get the length of a list with the len
function:
len(empty)
0
The list
function converts other containers into lists. Strings are
technically containers for individual characters, so:
list("data science")
['d', 'a', 't', 'a', ' ', 's', 'c', 'i', 'e', 'n', 'c', 'e']
2.2.1. Indexing#
So far you’ve learned two ways to use square brackets []
:
To select columns from a data frame, as in
terns["year"]
To create lists, as in
["a", "b", 1]
The first case is an example of indexing, which means getting or setting
elements of a data structure. The square brackets []
are Python’s indexing
operator.
You can use indexing to get an element of a list based on the element’s position. Python uses zero-based indexing, which means the positions of elements are counted starting from 0 rather than 1. So the first element of a series is at position 0, the second is at position 1, and so on.
Note
Many programming languages use zero-based indexing. It may seem strange at first, but it makes some kinds of computations simpler by eliminating the need to add or subtract 1.
The indexing operator requires at least one argument, called the index,
which goes inside of the square brackets []
. The index says which elements
you want to get. For a data frame, you can use a position or a column name as
the index. For a list, you can only use a position.
As an example, consider the list li
we created earlier:
li = [8, "hello", [4.2]]
li
[8, 'hello', [4.2]]
The code to get the first element is:
li[0]
8
Likewise, to get the third element:
li[2]
[4.2]
The third element is a list too. If you want to get its first element, you can chain, or repeat, the indexing operator:
li[2][0]
4.2
Read this code from right to left as “get the first element of the third
element of the variable li
.”
You can use a slice to select a range of elements. The syntax for a slice
is lower:upper:stride
, where all of the arguments and the second colon :
are optional. The lower bound defaults to 0, the upper bound defaults to the
length of the list, and the stride defaults to 1. For example, to get the first
two elements:
li[:2]
[8, 'hello']
As another example, you can use a slice to get every other element:
li[::2]
[8, [4.2]]
Negative values in a slice index backwards from the end of the list. For instance, to get the last 2 elements:
li[-2:]
['hello', [4.2]]
You can set an element of a list by assigning a value at a given index. So the
code to change the first element of li
to the string “hi” is:
li[0] = "hi"
li
['hi', 'hello', [4.2]]
Indexing isn’t just for lists: most of the examples in this section also apply to series, data frames, and other data structures.
2.2.2. References#
Assigning elements of a container is not without complication. Suppose you
assign a list to a variable x
and then create a new variable y
from x
. If
you change an element of y
, it will also change x
:
x = [1, 2]
y = x
y[0] = 10
x
[10, 2]
This happens because of how Python handles containers. When you create a container, Python stores it in your computer’s memory. If you then assign the container to a variable, the variable points, or refers, to the location of the container in memory. If you create a second variable from the first, both will refer to the same location. As a result, operations on one variable will affect the value of the other, because there’s really only one container in memory and both variables refer to it.
The example above uses lists, but other containers—such as data
frames—behave the same way. If you want to assign an independent copy of a
container to a variable rather than a reference, you need to use a function or
method to explicitly make a copy. Many containers have a .copy
or .clone
method that makes a copy:
x = [1, 2]
y = x.copy()
y[0] = 10
x
[1, 2]
2.2.3. Comprehensions#
A list comprehension creates a new list from the elements of an existing list. We’ll use this list to demonstrate comprehensions:
values = [10, 11, 12, -5, 13, 14]
If you want, for example, to add 1 to each element of the list, you can use a comprehension to do it. Here’s how:
[v + 1 for v in values]
[11, 12, 13, -4, 14, 15]
In words, this code tells Python to create a new list where the elements are
v + 1
for each element v
in values
. The enclosing square brackets []
indicate that the result should be a list. More generally, the syntax for a
comprehension is:
EXPRESSION for ELEMENT in CONTAINER
Replace CONTAINER
with a data structure, ELEMENT
with a variable name for
the elements, and EXPRESSION
with an expression to compute (typically using
the elements).
For instance, you can also use a comprehension to compute the type of each element in a container:
[type(v) for v in values]
[int, int, int, int, int, int]
Comprehensions can also filter out some elements based on a condition. Suppose we only want the positive elements of the list:
[v for v in values if v > 0]
[10, 11, 12, 13, 14]
The syntax for a comprehension with a condition is:
EXPRESSION for ELEMENT in CONTAINER if CONDITION
The CONDITION
must be an expression that evaluates to True
or False
(typically using the elements).
Comprehensions are an efficient way to compute (and compute on) lists. You can also use comprehensions with Python’s other built-in data structures, which you’ll learn about in Built-in Data Structures.
2.3. Series#
A series is an ordered, one-dimensional data structure. Series are a fundamental data structure in Polars, because each column in a data frame is a series.
For example, in the California least tern data set, the site_name
column is a
series. Take a look at the first few elements with its .head
method:
terns["site_name"].head()
site_name |
---|
str |
"PITTSBURG POWER PLANT" |
"ALBANY CENTRAL AVE" |
"ALAMEDA POINT" |
"KETTLEMAN CITY" |
"OCEANO DUNES STATE VEHICULAR R… |
"RANCHO GUADALUPE DUNES PRESERV… |
"VANDENBERG SFB" |
"SANTA CLARA RIVER MCGRATH STAT… |
"ORMOND BEACH" |
"NBVC POINT MUGU" |
Series and data frames have many attributes and methods in common; the .head
method is one of these.
Notice that the elements of the site_name
series are all strings. Unlike a
list, in a series all elements must be of the same type, so we say series are
homogeneous. A series can contain strings, integers, decimal numbers, or
any of several other types of data, but not a mix of these all at once.
The other columns in the least tern data are also series. For instance, the
year
column is a series of integers:
terns["year"]
year |
---|
i64 |
2000 |
2000 |
2000 |
2000 |
2000 |
… |
2023 |
2023 |
2023 |
2023 |
2023 |
Series can contain any number of elements, including 0 or 1 element. You can
check the number of elements, or length, of a series with Python’s built-in
len
function:
len(terns["year"])
791
Since this is a column from the terns
data frame, its length is the same as
the number of rows in terns
.
Note
You can also check the length of a series with the .shape
attribute:
terns["year"].shape
(791,)
Python prints the value of .shape
differently from the result of len
because they are different types of data. Most of the time, it’s more
convenient to use the len
function to check lengths of one-dimensional
objects like series, because it returns an integer.
2.3.1. Creating Series#
Sometimes you’ll want to create series by manually inputting data, perhaps
because your data set isn’t digitized or because you want a toy data set to
test out some code. You can create a series from a list (or other sequence)
with the pl.Series
function:
pl.Series([1, 2, 19, -3])
i64 |
1 |
2 |
19 |
-3 |
pl.Series(["hi", "hello"])
str |
"hi" |
"hello" |
The Polars documentation recommends setting a name for every series. To do this
with pl.Series
, pass the name as the first argument and the elements as the
second argument:
pl.Series("tens", [10, 20, 30])
tens |
---|
i64 |
10 |
20 |
30 |
Polars will print the name when you print the series and will use the name as a
column name if you put the series in a data frame. You can get or set the name
on a series through the .name
attribute:
x = pl.Series("tens", [10, 20, 30])
x.name
'tens'
Tip
If you want to create a series that contains a sequence of numbers, there are
several helper functions you can use. Python’s built-in range
function
creates a sequence of integers. NumPy’s np.arange
and np.linspace
functions
can create sequences of integers or decimal numbers. You can pass the result
from any of these functions to pl.Series
to create a series.
2.3.2. Data Types in Polars#
Series are homogeneous, so if you try to create a series from elements of
different types, the pl.Series
function will raise an error:
pl.Series([1, "cool", 2.3])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:316, in _construct_series_with_fallbacks(constructor, name, values, dtype, strict)
315 try:
--> 316 return constructor(name, values, strict)
317 except (TypeError, OverflowError) as e:
318 # # This retry with i64 is related to https://github.com/pola-rs/polars/issues/17231
319 # # Essentially, when given a [0, u64::MAX] then it would Overflow.
TypeError: 'str' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
Cell In[40], line 1
----> 1 pl.Series([1, "cool", 2.3])
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/series/series.py:295, in Series.__init__(self, name, values, dtype, strict, nan_to_null)
292 raise TypeError(msg)
294 if isinstance(values, Sequence):
--> 295 self._s = sequence_to_pyseries(
296 name,
297 values,
298 dtype=dtype,
299 strict=strict,
300 nan_to_null=nan_to_null,
301 )
303 elif values is None:
304 self._s = sequence_to_pyseries(name, [], dtype=dtype)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:301, in sequence_to_pyseries(name, values, dtype, strict, nan_to_null)
298 except RuntimeError:
299 return PySeries.new_from_any_values(name, values, strict=strict)
--> 301 return _construct_series_with_fallbacks(
302 constructor, name, values, dtype, strict=strict
303 )
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:329, in _construct_series_with_fallbacks(constructor, name, values, dtype, strict)
325 return _construct_series_with_fallbacks(
326 PySeries.new_opt_u64, name, values, dtype, strict=strict
327 )
328 elif dtype is None:
--> 329 return PySeries.new_from_any_values(name, values, strict=strict)
330 else:
331 return PySeries.new_from_any_values_and_dtype(
332 name, values, dtype, strict=strict
333 )
TypeError: unexpected value while building Series of type Int64; found value of type String: "cool"
Hint: Try setting `strict=False` to allow passing data with mixed types.
By default, Polars infers the data type of a series’ elements from the first element. This can lead to errors you might not expect:
pl.Series([1, 9.2, 2.3])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:316, in _construct_series_with_fallbacks(constructor, name, values, dtype, strict)
315 try:
--> 316 return constructor(name, values, strict)
317 except (TypeError, OverflowError) as e:
318 # # This retry with i64 is related to https://github.com/pola-rs/polars/issues/17231
319 # # Essentially, when given a [0, u64::MAX] then it would Overflow.
TypeError: 'float' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
Cell In[41], line 1
----> 1 pl.Series([1, 9.2, 2.3])
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/series/series.py:295, in Series.__init__(self, name, values, dtype, strict, nan_to_null)
292 raise TypeError(msg)
294 if isinstance(values, Sequence):
--> 295 self._s = sequence_to_pyseries(
296 name,
297 values,
298 dtype=dtype,
299 strict=strict,
300 nan_to_null=nan_to_null,
301 )
303 elif values is None:
304 self._s = sequence_to_pyseries(name, [], dtype=dtype)
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:301, in sequence_to_pyseries(name, values, dtype, strict, nan_to_null)
298 except RuntimeError:
299 return PySeries.new_from_any_values(name, values, strict=strict)
--> 301 return _construct_series_with_fallbacks(
302 constructor, name, values, dtype, strict=strict
303 )
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/_utils/construction/series.py:329, in _construct_series_with_fallbacks(constructor, name, values, dtype, strict)
325 return _construct_series_with_fallbacks(
326 PySeries.new_opt_u64, name, values, dtype, strict=strict
327 )
328 elif dtype is None:
--> 329 return PySeries.new_from_any_values(name, values, strict=strict)
330 else:
331 return PySeries.new_from_any_values_and_dtype(
332 name, values, dtype, strict=strict
333 )
TypeError: unexpected value while building Series of type Int64; found value of type Float64: 9.2
Hint: Try setting `strict=False` to allow passing data with mixed types.
You can explicitly specify a data type for a series with the pl.Series
function’s third parameter, dtype
:
pl.Series([1, 9.2, 2.3], dtype = float)
f64 |
1.0 |
9.2 |
2.3 |
Polars uses its own data types for series elements, so that:
It can efficiently support types of data that are not built into Python, such as categorical data.
Every numeric type has an explicit bit size: the number of bits of memory necessary to store an element. Bit sizes appear as suffixes in the name of the type. For instance,
Float32
stores a floating point number in 32 bits.A special value,
null
, can be present in any series to indicate missing data. We’ll explainnull
in Special Values.
When you create or access the elements of a series, Polars silently converts between its types and Python’s built-in types. Some of the Polars types and their Python equivalents are listed in the following table:
Type |
Python Equivalent |
Description |
---|---|---|
|
|
Boolean values |
|
|
Integers |
|
|
Real numbers (base-2 floating point) |
|
Complex numbers |
|
|
|
Strings |
|
No equivalent |
Categorical data |
The Polars documentation has the complete list.
Why does bit size matter?
By using more bits to store values, you can express a wider range of values.
This is best illustrated by the integer types: Int8
can only express values
from -128
to 127
(inclusive), whereas Int64
can express values from
-9,223,372,036,854,775,808
to 9,223,372,036,854,775,807
.
If a computation produces a value too small or too large to express as its given type, the value overflows, leading to an inaccurate result. The exact effect of overflow depends on the type, software, and hardware, but one common outcome is that the value “wraps around” to the other side of the type’s range:
pl.Series([127], dtype = pl.Int8) + 1
shape: (1,)
Series: '' [i8]
[
-128
]
The tradeoff is that the more bits you use to store values, the more memory you need. For computations that generate or process large quantities of data, as is often the case in research computing, memory efficiency is a major concern—computers have a limited amount of memory.
You can use bit size to estimate the amount of memory a series will require.
For example, since a single element of a Float64
series requires about 64
bits, the bp_min
column in the least terns data requires roughly this many
bytes:
64 * len(terns["bp_min"]) / 8 # 8 bits per byte
6328.0
For series and data frames, you can use the .estimated_size
method to have
Python do this calculation for you. When a computation runs out of memory, an
estimate of how much memory is necessary can help you decide whether to change
your computing strategy or get more memory.
Most of Python’s built-in data types don’t specify bit sizes, and their sizes can even vary depending on your computer’s hardware and operating system!
If you call Python’s type
function on a data structure, it returns the type
of the data structure:
type(terns["site_name"])
polars.series.series.Series
For a series, you can get the element type with the .dtype
attribute:
terns["site_name"].dtype
String
Note
Data frames don’t have a .dtype
attribute since they can consist of multiple
series. Instead, they have a .dtypes
attribute, a list with the element type
for each column.
If your goal is to summarize a data frame, the .glimpse
method is usually
more convenient.
You can use the .cast
method to cast the elements of a series to a specific
type. For example, here’s how to cast the total_nests
column to a String
series:
terns["total_nests"].cast(pl.String)
total_nests |
---|
str |
"15" |
"20" |
"312" |
"3" |
"5" |
… |
"717" |
"44" |
"59" |
"48" |
"171" |
2.3.3. Categorical Data#
A feature is categorical if it measures a qualitative category. For
example, the genres rock
, blues
, alternative
, folk
, pop
are
categories.
Polars uses the Categorical
and Enum
data types to represent categorical
data. Visualizations and statistical models sometimes treat categorical data
differently than other data types, so it’s important to make sure you have the
right data type.
When it reads a data set, Polars usually can’t tell which features are categorical. That means identifying and converting the categorical features is up to you. For beginners, it can be difficult to understand whether a feature is categorical or not. The key is to think about whether you want to use the feature to divide the data into groups.
For example, if you want to know how many songs are in the rock
genre, you
first need to divide the songs by genre, and then count the number of songs in
each group (or at least the rock
group).
As a second example, months recorded as numbers can be categorical or not, depending on how you want to use them. You might want to treat them as categorical (for example, to compute max rainfall in each month) or you might want to treat them as numbers (for example, to compute the number of months time between two events).
The bottom line is that you have to think about what you’ll be doing in the analysis. In some cases, you might treat a feature as categorical only for part of the analysis.
Let’s think about which features are categorical in least terns data set. To refresh your memory of what’s in the data set, take a look at the structural summary:
terns.glimpse()
Rows: 791
Columns: 43
$ year <i64> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000
$ site_name <str> 'PITTSBURG POWER PLANT', 'ALBANY CENTRAL AVE', 'ALAMEDA POINT', 'KETTLEMAN CITY', 'OCEANO DUNES STATE VEHICULAR RECREATION AREA', 'RANCHO GUADALUPE DUNES PRESERVE', 'VANDENBERG SFB', 'SANTA CLARA RIVER MCGRATH STATE BEACH', 'ORMOND BEACH', 'NBVC POINT MUGU'
$ site_name_2013_2018 <str> 'Pittsburg Power Plant', 'NA_NO POLYGON', 'Alameda Point', 'Kettleman', 'Oceano Dunes State Vehicular Recreation Area', 'Rancho Guadalupe Dunes Preserve', 'Vandenberg AFB', 'Santa Clara River', 'Ormond Beach', 'NBVC Point Mugu'
$ site_name_1988_2001 <str> 'NA_2013_2018 POLYGON', 'Albany Central Avenue', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON', 'NA_2013_2018 POLYGON'
$ site_abbr <str> 'PITT_POWER', 'AL_CENTAVE', 'ALAM_PT', 'KET_CTY', 'OCEANO_DUNES', 'RGDP', 'VAN_SFB', 'S_CLAR_MCG', 'ORMOND', 'PT_MUGU'
$ region_3 <str> 'S.F._BAY', 'S.F._BAY', 'S.F._BAY', 'KINGS', 'CENTRAL', 'CENTRAL', 'CENTRAL', 'SOUTHERN', 'SOUTHERN', 'SOUTHERN'
$ region_4 <str> 'S.F._BAY', 'S.F._BAY', 'S.F._BAY', 'KINGS', 'CENTRAL', 'CENTRAL', 'CENTRAL', 'VENTURA', 'VENTURA', 'VENTURA'
$ event <str> 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA', 'LA_NINA'
$ bp_min <f64> 15.0, 6.0, 282.0, 2.0, 4.0, 9.0, 30.0, 21.0, 73.0, 166.0
$ bp_max <f64> 15.0, 12.0, 301.0, 3.0, 5.0, 9.0, 32.0, 21.0, 73.0, 167.0
$ fl_min <i64> 16, 1, 200, 1, 4, 17, 11, 9, 60, 64
$ fl_max <i64> 18, 1, 230, 2, 4, 17, 11, 9, 65, 64
$ total_nests <i64> 15, 20, 312, 3, 5, 9, 32, 22, 73, 252
$ nonpred_eggs <i64> 3, None, 124, None, 2, 0, None, 4, 2, None
$ nonpred_chicks <i64> 0, None, 81, 3, 0, 1, 27, 3, 0, None
$ nonpred_fl <i64> 0, None, 2, 1, 0, 0, 0, None, 0, None
$ nonpred_ad <i64> 0, None, 1, 6, 0, 0, 0, None, 0, None
$ pred_control <str> None, None, None, None, None, None, None, None, None, None
$ pred_eggs <i64> 4, None, 17, None, 0, None, 0, None, None, None
$ pred_chicks <i64> 2, None, 0, None, 4, None, 3, None, None, None
$ pred_fl <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ pred_ad <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ pred_pefa <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_coy_fox <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_meso <str> 'N', None, 'N', None, 'N', None, 'N', None, 'Y', None
$ pred_owlspp <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_corvid <str> 'Y', None, 'N', None, 'N', None, 'N', None, 'N', None
$ pred_other_raptor <str> 'Y', None, 'Y', None, 'N', None, 'Y', None, 'Y', None
$ pred_other_avian <str> 'N', None, 'Y', None, 'Y', None, 'N', None, 'N', None
$ pred_misc <str> 'N', None, 'N', None, 'N', None, 'N', None, 'N', None
$ total_pefa <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_coy_fox <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_meso <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_owlspp <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ total_corvid <i64> 4, None, 0, None, 0, None, 0, None, None, None
$ total_other_raptor <i64> 2, None, 6, None, 0, None, 3, None, None, None
$ total_other_avian <i64> 0, None, 11, None, 4, None, 0, None, None, None
$ total_misc <i64> 0, None, 0, None, 0, None, 0, None, None, None
$ first_observed <str> '2000-05-11', None, '2000-05-01', '2000-06-10', '2000-05-04', '2000-05-07', '2000-05-07', '2000-06-06', None, '2000-05-21'
$ last_observed <str> '2000-08-05', None, '2000-08-19', '2000-09-24', '2000-08-30', '2000-08-13', '2000-08-17', '2000-09-05', None, '2000-08-12'
$ first_nest <str> '2000-05-26', None, '2000-05-16', '2000-06-17', '2000-05-28', '2000-05-31', '2000-05-28', '2000-06-06', '2000-06-08', '2000-06-01'
$ first_chick <str> '2000-06-18', None, '2000-06-07', '2000-07-22', '2000-06-20', '2000-06-22', '2000-06-20', '2000-06-28', '2000-06-26', '2000-06-24'
$ first_fledge <str> '2000-07-08', None, '2000-06-30', '2000-08-06', '2000-07-13', '2000-07-20', '2000-07-15', '2000-07-24', '2000-07-17', '2000-07-16'
The site_name
, site_abbr
, and event
columns are all examples of
categorical data. The region_
columns and some of the pred_
columns also
contain categorical data.
One way to check whether a feature is useful for grouping (and thus effectively
categorical) is to count the number of times each value appears. For a series,
you can do this with the .value_counts
method. For instance, to count the
number of times each category of event
appears:
terns["event"].value_counts()
event | count |
---|---|
str | u32 |
"LA_NINA" | 258 |
"NEUTRAL" | 413 |
"EL_NINO" | 120 |
Features with only a few unique values, repeated many times, are ideal for
grouping. Numerical features, like total_nests
, usually aren’t good for
grouping, both because of what they measure and because they tend to have many
unique values, which leads to very small groups.
The year
column can be treated as categorical or quantitative data. It’s easy
to imagine grouping observations by year, but years are also numerical: they
have an order and we might want to do math on them. The most appropriate type
for year
depends on how we want to use it for analysis.
You can cast a column to the Categorical
type with the .cast
method. Try
this for the event
column:
event = terns["event"].cast(pl.Categorical)
event
event |
---|
cat |
"LA_NINA" |
"LA_NINA" |
"LA_NINA" |
"LA_NINA" |
"LA_NINA" |
… |
"LA_NINA" |
"LA_NINA" |
"LA_NINA" |
"LA_NINA" |
"LA_NINA" |
Polars organizes attributes and methods for categorical data under the .cat
attribute of series. These raise errors if the element type of the series is
not Categorical
(or Enum
). You can get the categories of a categorical
series with the .cat.get_categories
method:
event.cat.get_categories()
event |
---|
str |
"LA_NINA" |
"NEUTRAL" |
"EL_NINO" |
A categorical series remembers all possible categories even if you take a subset where some of the categories aren’t present:
event[:3]
event |
---|
cat |
"LA_NINA" |
"LA_NINA" |
"LA_NINA" |
event[:3].cat.get_categories()
event |
---|
str |
"LA_NINA" |
"NEUTRAL" |
"EL_NINO" |
This is one way the Categorical
type is different from the String
type, and
ensures that when you, for example, plot a categorical series, missing
categories are represented.
Note
The Categorical
and Enum
types both represent categorical data. The
Categorical
type is more flexible, allowing you to add categories as needed.
The Enum
type is more memory-efficient, but requires that you specify all
possible categories up front. In practice, the Categorical
type is more
convenient for interactive work.
2.3.4. Broadcasting#
If you use an arithmetic operator on a series, Polars broadcasts the operation to each element:
x = pl.Series([1, 3, 0])
x - 3
i64 |
-2 |
0 |
-3 |
The result is the same as if you had applied the operation element-by-element. That is:
pl.Series([1 - 3, 3 - 3, 0 - 3])
i64 |
-2 |
0 |
-3 |
Most NumPy (and SciPy) functions also broadcast. For instance:
import numpy as np
x = pl.Series([1.0, 3.0, 0.0, np.pi])
np.sin(x)
f64 |
0.841471 |
0.14112 |
0.0 |
1.2246e-16 |
Some examples of functions that broadcast are np.sin
, np.cos
, np.tan
,
np.log
, np.exp
, and np.sqrt
.
NumPy functions that combine or aggregate values usually don’t broadcast. For
example, np.sum
, np.mean
, and np.median
don’t broadcast.
Tip
Broadcasting is the counterpart to comprehensions (introduced in Comprehensions). Both are highly efficient. Generally, you should:
Use broadcasting with data structures that support it, such as series and NumPy arrays (explained in NumPy Arrays).
Use comprehensions with lists and Python’s other built-in data structures (explained in Built-in Data Structures).
A function can broadcast across multiple arguments. To demonstrate this,
suppose we want to estimate number of nests per breeding pair for the least
terns data. The total_nests
column contains the total number of nests at each
site, and the bp_max
column contains the maximum reported number of breeding
pairs. So to compute nests per breeding pair:
terns["total_nests"] / terns["bp_max"]
total_nests |
---|
f64 |
1.0 |
1.666667 |
1.036545 |
1.0 |
1.0 |
… |
1.113354 |
1.157895 |
1.092593 |
1.170732 |
1.036364 |
The elements are paired up and divided according to their positions. Notice
that the result is a Float64
series. The total_nests
column is an Int64
series, so besides broadcasting, the example also demonstrates that Series are
subject to implicit coercion (introduced in Coercion & Casting).
If you try to broadcast a function across two series of different lengths, Polars raises an error:
x = pl.Series([1, 2])
y = pl.Series([9, 8, 7])
x - y
---------------------------------------------------------------------------
InvalidOperationError Traceback (most recent call last)
Cell In[56], line 3
1 x = pl.Series([1, 2])
2 y = pl.Series([9, 8, 7])
----> 3 x - y
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/series/series.py:1077, in Series.__sub__(self, other)
1075 if self.dtype.is_decimal() and isinstance(other, (float, int)):
1076 return self.to_frame().select(F.col(self.name) - other).to_series()
-> 1077 return self._arithmetic(other, "sub", "sub_<>")
File ~/mill/datalab/teaching/python_basics/.pixi/envs/default/lib/python3.13/site-packages/polars/series/series.py:1011, in Series._arithmetic(self, other, op_s, op_ffi)
1008 other = pl.Series("", [None])
1010 if isinstance(other, Series):
-> 1011 return self._from_pyseries(getattr(self._s, op_s)(other._s))
1012 elif _check_for_numpy(other) and isinstance(other, np.ndarray):
1013 return self._from_pyseries(getattr(self._s, op_s)(Series(other)._s))
InvalidOperationError: cannot do arithmetic operation on series of different lengths: got 2 and 3
2.4. Other Data Structures#
In this section, you’ll learn about several one-dimensional data structures that are fundamental to programming in Python.
2.4.1. Built-in Data Structures#
Besides lists, Python provides several other useful data structures:
Like lists, tuples are ordered and heterogeneous. The main difference is that tuples are immutable: once you create a tuple, you can’t change it. This makes tuples safer and more efficient than lists.
You can make a tuple by enclosing comma-separated values in parentheses
()
:(True, 1, "hi")
You can cast other data structures to a tuple with the
tuple
function. Use a tuple when the number of elements is constant and known in advance.A set is unordered and heterogeneous. As in a mathematical set, the elements in a set must be unique. Python automatically discards any duplicates added to a set. Sets support set theoretic operations such as unions and intersections.
You can make a set by enclosing comma-separated values in curly braces
{}
:{True, 1, "hi"}
You can convert other data structures to a set with the
set
function. Use a set when you need a guarantee that the elements are unique.A dict is an ordered, heterogeneous collection of key-value pairs. Keys must be distinct and many different types of keys are valid. The indexing operator
[]
gets elements by key rather than position.You can make a dict by enclosing comma-separated
key: value
pairs in curly braces{}
:{"hi": -3.5}
Use a dict when you need to index elements by something other than position or need a mapping from one collection of data to another.
See also
Python’s official documentation provides more details about what you can do with these data structures.
2.4.2. NumPy Arrays#
An array (or ndarray
) is an ordered, homogeneous data structure, similar
to a series. Arrays are a fundamental data structure in NumPy.
You can create an array with the np.array
function and a list of elements:
x = np.array([10, 20, 30])
x
array([10, 20, 30])
You can convert an array into a series with the pl.Series
function:
pl.Series(x)
i64 |
10 |
20 |
30 |
Conversely, you can convert a series to an array with the .to_numpy
method:
terns["total_nests"][:5].to_numpy()
array([ 15, 20, 312, 3, 5])
Tip
Series tend to be a good choice for data analysis, while arrays tend to be a good choice for sophisticated mathematical computations (such as simulations).
Note
NumPy uses its own data types for array elements, for many of the same reasons Polars does for series elements. The NumPy documentation has more details.
NumPy is primarily designed for numerical computing, so working with strings in NumPy can be tricky. See the documentation for details about its string types. If you need to work with strings, Polars is more convenient than NumPy.
2.5. Special Values#
2.5.1. None#
In Python, None
represents an absent or undefined value. It is useful:
As a way to explicitly indicate a value is absent.
As the return value for functions that are useful for their side effects and don’t need to return anything.
As a default argument for optional parameters in functions.
For example, Python’s built-in print
function, which prints a string to the
console, returns None
:
print("Hello!")
Hello!
The Python console doesn’t print anything when an expression produces None
:
None
None
is the only value of type NoneType
:
type(None)
NoneType
You can check if a value is None
with Python’s is
keyword:
x = None
x is None
True
2.5.2. Missing Values#
In the least terns data set, notice that some of the entries are null
. For
instance, look at the second element of the nonpred_eggs
column:
terns.head()
year | site_name | site_name_2013_2018 | site_name_1988_2001 | site_abbr | region_3 | region_4 | event | bp_min | bp_max | fl_min | fl_max | total_nests | nonpred_eggs | nonpred_chicks | nonpred_fl | nonpred_ad | pred_control | pred_eggs | pred_chicks | pred_fl | pred_ad | pred_pefa | pred_coy_fox | pred_meso | pred_owlspp | pred_corvid | pred_other_raptor | pred_other_avian | pred_misc | total_pefa | total_coy_fox | total_meso | total_owlspp | total_corvid | total_other_raptor | total_other_avian | total_misc | first_observed | last_observed | first_nest | first_chick | first_fledge |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | str | str | str | f64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | str | i64 | i64 | i64 | i64 | str | str | str | str | str | str | str | str | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | str | str | str | str | str |
2000 | "PITTSBURG POWER PLANT" | "Pittsburg Power Plant" | "NA_2013_2018 POLYGON" | "PITT_POWER" | "S.F._BAY" | "S.F._BAY" | "LA_NINA" | 15.0 | 15.0 | 16 | 18 | 15 | 3 | 0 | 0 | 0 | null | 4 | 2 | 0 | 0 | "N" | "N" | "N" | "N" | "Y" | "Y" | "N" | "N" | 0 | 0 | 0 | 0 | 4 | 2 | 0 | 0 | "2000-05-11" | "2000-08-05" | "2000-05-26" | "2000-06-18" | "2000-07-08" |
2000 | "ALBANY CENTRAL AVE" | "NA_NO POLYGON" | "Albany Central Avenue" | "AL_CENTAVE" | "S.F._BAY" | "S.F._BAY" | "LA_NINA" | 6.0 | 12.0 | 1 | 1 | 20 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "ALAMEDA POINT" | "Alameda Point" | "NA_2013_2018 POLYGON" | "ALAM_PT" | "S.F._BAY" | "S.F._BAY" | "LA_NINA" | 282.0 | 301.0 | 200 | 230 | 312 | 124 | 81 | 2 | 1 | null | 17 | 0 | 0 | 0 | "N" | "N" | "N" | "N" | "N" | "Y" | "Y" | "N" | 0 | 0 | 0 | 0 | 0 | 6 | 11 | 0 | "2000-05-01" | "2000-08-19" | "2000-05-16" | "2000-06-07" | "2000-06-30" |
2000 | "KETTLEMAN CITY" | "Kettleman" | "NA_2013_2018 POLYGON" | "KET_CTY" | "KINGS" | "KINGS" | "LA_NINA" | 2.0 | 3.0 | 1 | 2 | 3 | null | 3 | 1 | 6 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | "2000-06-10" | "2000-09-24" | "2000-06-17" | "2000-07-22" | "2000-08-06" |
2000 | "OCEANO DUNES STATE VEHICULAR R… | "Oceano Dunes State Vehicular R… | "NA_2013_2018 POLYGON" | "OCEANO_DUNES" | "CENTRAL" | "CENTRAL" | "LA_NINA" | 4.0 | 5.0 | 4 | 4 | 5 | 2 | 0 | 0 | 0 | null | 0 | 4 | 0 | 0 | "N" | "N" | "N" | "N" | "N" | "N" | "Y" | "N" | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | "2000-05-04" | "2000-08-30" | "2000-05-28" | "2000-06-20" | "2000-07-13" |
Polars uses null
, called the missing value, to represent missing entries
in a data set. It’s implied that the entries are missing due to how the data
was collected, although there are exceptions. As an example, imagine the data
came from a survey, and respondents chose not to answer some questions. In the
data set, their answers for those questions can be recorded as null
.
The missing value null
is a chameleon: it can be of an element of any type in
a series. Polars implicitly converts null
to and from None
when you get or
set an element in a series. This means you can use None
to create a series
with null
elements:
x = pl.Series([1, 2, None])
x
i64 |
1 |
2 |
null |
And you get back None
if you access a null
element:
terns["nonpred_eggs"][1]
The missing value null
is also contagious: it represents an unknown quantity,
so computing on it usually produces another missing value. The idea is that if
the inputs to a computation are unknown, generally so is the output:
x - 3
i64 |
-2 |
-1 |
null |
Polars makes an exception for aggregation functions, which automatically filter out missing values:
x.mean()
1.5
You can use the .is_null
method to test if elements of a series are null
:
x.is_null()
bool |
false |
false |
true |
Polars also provides an .is_not_null
method and a .fill_null
method to fill
missing values with a different value.
2.5.3. Infinity#
NumPy (and Polars) use np.inf
to represent infinity. It is of type float
.
You’re most likely to encounter it as the result of certain computations:
pl.Series([13]) / 0
f64 |
inf |
You can us the .is_infinite
method to test if elements of a series are
infinite:
x = pl.Series([1.0, 2.0, np.inf])
x.is_infinite()
bool |
false |
false |
true |
2.5.4. Not a Number#
NumPy (and Polars) use np.nan
, called not a number and also written
NaN
, to represent mathematically undefined results. It is of type float
. As
an example, dividing 0 by 0 is undefined:
pl.Series([0]) / 0
f64 |
NaN |
You can use the .is_nan
method to test if elements of a series are NaN
:
x = pl.Series([0, 1, 2]) / 0
x.is_nan()
bool |
true |
false |
false |
2.6. Data Frames#
2.6.1. Selecting Columns#
An excellent starting point for selecting and transforming columns in a data
frame is the .select
method. Summarizing Columns already showed
how to select a single column by name with the indexing operator []
, but the
.select
method is much more flexible. You can use it to select multiple
columns at once, by name or type, and can transform or rename them.
As with the indexing operator, you can use .select
to select a single column
by providing the column name as an argument. Here’s an example (with .head
to
limit the output):
terns.select("year").head()
year |
---|
i64 |
2000 |
2000 |
2000 |
2000 |
2000 |
Unlike the indexing operator, .select
returns a data frame rather than a
series.
You can also select multiple columns this way:
terns.select("year", "site_name").head()
year | site_name |
---|---|
i64 | str |
2000 | "PITTSBURG POWER PLANT" |
2000 | "ALBANY CENTRAL AVE" |
2000 | "ALAMEDA POINT" |
2000 | "KETTLEMAN CITY" |
2000 | "OCEANO DUNES STATE VEHICULAR R… |
The .select
method is flexible because it can evaluate a Polars
expression: instructions for how to select or transform data. One way to
create an expression is with the pl.col
function, which represents a column
or set of columns. So another way to select the year
and site_name
column
from the least terns data is:
terns.select(
pl.col("year", "site_name")
).head()
year | site_name |
---|---|
i64 | str |
2000 | "PITTSBURG POWER PLANT" |
2000 | "ALBANY CENTRAL AVE" |
2000 | "ALAMEDA POINT" |
2000 | "KETTLEMAN CITY" |
2000 | "OCEANO DUNES STATE VEHICULAR R… |
An advantage of using pl.col
is that you’re not limited to selecting columns
by name: you can also select columns by type. Here’s how to get only the
Int64
and Float64
columns in the least terns data:
terns.select(
pl.col(pl.Int64, pl.Float64)
).head()
year | bp_min | bp_max | fl_min | fl_max | total_nests | nonpred_eggs | nonpred_chicks | nonpred_fl | nonpred_ad | pred_eggs | pred_chicks | pred_fl | pred_ad | total_pefa | total_coy_fox | total_meso | total_owlspp | total_corvid | total_other_raptor | total_other_avian | total_misc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | f64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
2000 | 15.0 | 15.0 | 16 | 18 | 15 | 3 | 0 | 0 | 0 | 4 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 2 | 0 | 0 |
2000 | 6.0 | 12.0 | 1 | 1 | 20 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | 282.0 | 301.0 | 200 | 230 | 312 | 124 | 81 | 2 | 1 | 17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 11 | 0 |
2000 | 2.0 | 3.0 | 1 | 2 | 3 | null | 3 | 1 | 6 | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | 4.0 | 5.0 | 4 | 4 | 5 | 2 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 |
Selecting columns this way is useful for doing things like computing summaries of only numeric columns. In fact, to compute widely-used summaries, like the mean, all you need to do is call the corresponding method on the Polars expression:
terns.select(
pl.col(pl.Int64, pl.Float64).mean()
).head()
year | bp_min | bp_max | fl_min | fl_max | total_nests | nonpred_eggs | nonpred_chicks | nonpred_fl | nonpred_ad | pred_eggs | pred_chicks | pred_fl | pred_ad | total_pefa | total_coy_fox | total_meso | total_owlspp | total_corvid | total_other_raptor | total_other_avian | total_misc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
2013.082174 | 129.319923 | 151.043423 | 40.815148 | 50.349166 | 162.845466 | 60.291866 | 44.372681 | 4.181488 | 0.850987 | 41.574074 | 8.518519 | 2.365385 | 2.689655 | 1.740741 | 9.464286 | 5.555556 | 1.454545 | 7.961538 | 1.711538 | 8.897959 | 6.566038 |
A third way to select columns with pl.col
is with a pattern. Patterns are
strings, and must begin with a caret ^
and end with a dollar sign $
. Within
a pattern, you can use .*
as a wild card that matches any characters.
Note
Technically, Polars’ patterns are regular expressions, a widely-used language for describing patterns in text. You can learn more about regular expressions in the Date & String Processing chapter of DataLab’s Intermediate Python workshop reader.
As motivation to demonstrate patterns, most of the columns with names that
start with pred_
are categorical but currently have string elements. It would
be good to cast them to the Categorical
type. To begin, select all of the
columns with names that start with pred_
:
terns.select(
pl.col("^pred_.*$")
).head()
pred_control | pred_eggs | pred_chicks | pred_fl | pred_ad | pred_pefa | pred_coy_fox | pred_meso | pred_owlspp | pred_corvid | pred_other_raptor | pred_other_avian | pred_misc |
---|---|---|---|---|---|---|---|---|---|---|---|---|
str | i64 | i64 | i64 | i64 | str | str | str | str | str | str | str | str |
null | 4 | 2 | 0 | 0 | "N" | "N" | "N" | "N" | "Y" | "Y" | "N" | "N" |
null | null | null | null | null | null | null | null | null | null | null | null | null |
null | 17 | 0 | 0 | 0 | "N" | "N" | "N" | "N" | "N" | "Y" | "Y" | "N" |
null | null | null | null | null | null | null | null | null | null | null | null | null |
null | 0 | 4 | 0 | 0 | "N" | "N" | "N" | "N" | "N" | "N" | "Y" | "N" |
There are a few columns in the result, such as pred_eggs
, with Int64
elements. These columns aren’t categorical, so we should exclude them before
casting. You can exclude columns from an expression with the .exclude
method:
terns.select(
pl.col("^pred_.*$").exclude(pl.Int64).cast(pl.Categorical)
).head()
pred_control | pred_pefa | pred_coy_fox | pred_meso | pred_owlspp | pred_corvid | pred_other_raptor | pred_other_avian | pred_misc |
---|---|---|---|---|---|---|---|---|
cat | cat | cat | cat | cat | cat | cat | cat | cat |
null | "N" | "N" | "N" | "N" | "Y" | "Y" | "N" | "N" |
null | null | null | null | null | null | null | null | null |
null | "N" | "N" | "N" | "N" | "N" | "Y" | "Y" | "N" |
null | null | null | null | null | null | null | null | null |
null | "N" | "N" | "N" | "N" | "N" | "N" | "Y" | "N" |
To make this change permanent, we need to reassign the terns
data frame. The
.select
method only returns the selected columns, so assigning the result to
terns
would mean losing all of the other columns.
Instead of using .select
, you can use .with_columns
to transform some
columns but return all of the columns. In all other respects, .with_columns
works the same way as .select
. So to make the cast permanent:
terns = terns.with_columns(
pl.col("^pred_.*$").exclude(pl.Int64).cast(pl.Categorical)
)
terns.head()
year | site_name | site_name_2013_2018 | site_name_1988_2001 | site_abbr | region_3 | region_4 | event | bp_min | bp_max | fl_min | fl_max | total_nests | nonpred_eggs | nonpred_chicks | nonpred_fl | nonpred_ad | pred_control | pred_eggs | pred_chicks | pred_fl | pred_ad | pred_pefa | pred_coy_fox | pred_meso | pred_owlspp | pred_corvid | pred_other_raptor | pred_other_avian | pred_misc | total_pefa | total_coy_fox | total_meso | total_owlspp | total_corvid | total_other_raptor | total_other_avian | total_misc | first_observed | last_observed | first_nest | first_chick | first_fledge |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | str | str | str | f64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | cat | i64 | i64 | i64 | i64 | cat | cat | cat | cat | cat | cat | cat | cat | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | str | str | str | str | str |
2000 | "PITTSBURG POWER PLANT" | "Pittsburg Power Plant" | "NA_2013_2018 POLYGON" | "PITT_POWER" | "S.F._BAY" | "S.F._BAY" | "LA_NINA" | 15.0 | 15.0 | 16 | 18 | 15 | 3 | 0 | 0 | 0 | null | 4 | 2 | 0 | 0 | "N" | "N" | "N" | "N" | "Y" | "Y" | "N" | "N" | 0 | 0 | 0 | 0 | 4 | 2 | 0 | 0 | "2000-05-11" | "2000-08-05" | "2000-05-26" | "2000-06-18" | "2000-07-08" |
2000 | "ALBANY CENTRAL AVE" | "NA_NO POLYGON" | "Albany Central Avenue" | "AL_CENTAVE" | "S.F._BAY" | "S.F._BAY" | "LA_NINA" | 6.0 | 12.0 | 1 | 1 | 20 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "ALAMEDA POINT" | "Alameda Point" | "NA_2013_2018 POLYGON" | "ALAM_PT" | "S.F._BAY" | "S.F._BAY" | "LA_NINA" | 282.0 | 301.0 | 200 | 230 | 312 | 124 | 81 | 2 | 1 | null | 17 | 0 | 0 | 0 | "N" | "N" | "N" | "N" | "N" | "Y" | "Y" | "N" | 0 | 0 | 0 | 0 | 0 | 6 | 11 | 0 | "2000-05-01" | "2000-08-19" | "2000-05-16" | "2000-06-07" | "2000-06-30" |
2000 | "KETTLEMAN CITY" | "Kettleman" | "NA_2013_2018 POLYGON" | "KET_CTY" | "KINGS" | "KINGS" | "LA_NINA" | 2.0 | 3.0 | 1 | 2 | 3 | null | 3 | 1 | 6 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | "2000-06-10" | "2000-09-24" | "2000-06-17" | "2000-07-22" | "2000-08-06" |
2000 | "OCEANO DUNES STATE VEHICULAR R… | "Oceano Dunes State Vehicular R… | "NA_2013_2018 POLYGON" | "OCEANO_DUNES" | "CENTRAL" | "CENTRAL" | "LA_NINA" | 4.0 | 5.0 | 4 | 4 | 5 | 2 | 0 | 0 | 0 | null | 0 | 4 | 0 | 0 | "N" | "N" | "N" | "N" | "N" | "N" | "Y" | "N" | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | "2000-05-04" | "2000-08-30" | "2000-05-28" | "2000-06-20" | "2000-07-13" |
Tip
In general, choose:
.select
if you only want to get back the selected columns..with_columns
if you want get back all of the columns.
The .select
method is also useful for testing expressions before switching to
the .with_columns
method.
You can use columns to transform other columns. As a final example, suppose we
want to compute eggs per breeding pair and nests per breeding pair for the
least terns data. Non-predated egg counts are in the nonpred_eggs
column and
nest counts are in the total_nests
column. For now, we’ll use the maximum
reported breeding pairs, bp_max
, as the number of breeding pairs:
terns.select(
pl.col("nonpred_eggs", "total_nests") / pl.col("bp_max")
).head()
nonpred_eggs | total_nests |
---|---|
f64 | f64 |
0.2 | 1.0 |
null | 1.666667 |
0.41196 | 1.036545 |
null | 1.0 |
0.4 | 1.0 |
There’s also a bp_min
column with the minimum reported breeding pairs. To be
thorough, we should compute the rates with both bp_min
and bp_max
, not just
bp_max
. We’ll also need to rename the resulting columns, so that each column
has a unique name. You can use the .alias
method to rename a single column,
or the .name.prefix
and .name.suffix
methods, respectively, to prefix or
suffix a column’s name. Let’s add a suffix to the column names to identify
which breeding pair column was used:
terns.select(
(
pl.col("nonpred_eggs", "total_nests") / pl.col("bp_max")
).name.suffix("_per_bp_max"),
(
pl.col("nonpred_eggs", "total_nests") / pl.col("bp_min")
).name.suffix("_per_bp_min")
).head()
nonpred_eggs_per_bp_max | total_nests_per_bp_max | nonpred_eggs_per_bp_min | total_nests_per_bp_min |
---|---|---|---|
f64 | f64 | f64 | f64 |
0.2 | 1.0 | 0.2 | 1.0 |
null | 1.666667 | null | 3.333333 |
0.41196 | 1.036545 | 0.439716 | 1.106383 |
null | 1.0 | null | 1.5 |
0.4 | 1.0 | 0.5 | 1.25 |
See also
Much more is possible with Polars expressions and the .select
and
.with_columns
methods. See the Polars User Guide for details.
2.6.2. Filtering Rows#
Filtering the rows of a data frame is the counterpart selecting columns. The
.filter
method filters rows based on one or more conditions: expressions
that evaluate to a series of Boolean values.
As an example, suppose we want to find all sites in the least terns data where
the number of nests in the total_nests
column is greater than 5. Here’s the
code:
terns.filter(pl.col("total_nests") > 5).head()
year | site_name | site_name_2013_2018 | site_name_1988_2001 | site_abbr | region_3 | region_4 | event | bp_min | bp_max | fl_min | fl_max | total_nests | nonpred_eggs | nonpred_chicks | nonpred_fl | nonpred_ad | pred_control | pred_eggs | pred_chicks | pred_fl | pred_ad | pred_pefa | pred_coy_fox | pred_meso | pred_owlspp | pred_corvid | pred_other_raptor | pred_other_avian | pred_misc | total_pefa | total_coy_fox | total_meso | total_owlspp | total_corvid | total_other_raptor | total_other_avian | total_misc | first_observed | last_observed | first_nest | first_chick | first_fledge |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | str | str | str | f64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | cat | i64 | i64 | i64 | i64 | cat | cat | cat | cat | cat | cat | cat | cat | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | str | str | str | str | str |
2000 | "PITTSBURG POWER PLANT" | "Pittsburg Power Plant" | "NA_2013_2018 POLYGON" | "PITT_POWER" | "S.F._BAY" | "S.F._BAY" | "LA_NINA" | 15.0 | 15.0 | 16 | 18 | 15 | 3 | 0 | 0 | 0 | null | 4 | 2 | 0 | 0 | "N" | "N" | "N" | "N" | "Y" | "Y" | "N" | "N" | 0 | 0 | 0 | 0 | 4 | 2 | 0 | 0 | "2000-05-11" | "2000-08-05" | "2000-05-26" | "2000-06-18" | "2000-07-08" |
2000 | "ALBANY CENTRAL AVE" | "NA_NO POLYGON" | "Albany Central Avenue" | "AL_CENTAVE" | "S.F._BAY" | "S.F._BAY" | "LA_NINA" | 6.0 | 12.0 | 1 | 1 | 20 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
2000 | "ALAMEDA POINT" | "Alameda Point" | "NA_2013_2018 POLYGON" | "ALAM_PT" | "S.F._BAY" | "S.F._BAY" | "LA_NINA" | 282.0 | 301.0 | 200 | 230 | 312 | 124 | 81 | 2 | 1 | null | 17 | 0 | 0 | 0 | "N" | "N" | "N" | "N" | "N" | "Y" | "Y" | "N" | 0 | 0 | 0 | 0 | 0 | 6 | 11 | 0 | "2000-05-01" | "2000-08-19" | "2000-05-16" | "2000-06-07" | "2000-06-30" |
2000 | "RANCHO GUADALUPE DUNES PRESERV… | "Rancho Guadalupe Dunes Preserv… | "NA_2013_2018 POLYGON" | "RGDP" | "CENTRAL" | "CENTRAL" | "LA_NINA" | 9.0 | 9.0 | 17 | 17 | 9 | 0 | 1 | 0 | 0 | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | "2000-05-07" | "2000-08-13" | "2000-05-31" | "2000-06-22" | "2000-07-20" |
2000 | "VANDENBERG SFB" | "Vandenberg AFB" | "NA_2013_2018 POLYGON" | "VAN_SFB" | "CENTRAL" | "CENTRAL" | "LA_NINA" | 30.0 | 32.0 | 11 | 11 | 32 | null | 27 | 0 | 0 | null | 0 | 3 | 0 | 0 | "N" | "N" | "N" | "N" | "N" | "Y" | "N" | "N" | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | "2000-05-07" | "2000-08-17" | "2000-05-28" | "2000-06-20" | "2000-07-15" |
If we only want the site names, we can chain this with a call to .select
and
use the .unique
method on the site_name
column:
terns.filter(
pl.col("total_nests") > 5
).select(
pl.col("site_name").unique()
)
site_name |
---|
str |
"NAVAL AMPHIBIOUS BASE CORONADO" |
"LA HARBOR" |
"SAN PABLO BAY NWR" |
"HOLLYWOOD BEACH" |
"BATIQUITOS LAGOON ECOLOGICAL R… |
… |
"PITTSBURG POWER PLANT" |
"SANTA CLARA RIVER MCGRATH STAT… |
"SEAL BEACH NWR ANAHEIM BAY" |
"EDEN LANDING ECOLOGICAL RESERV… |
"MALIBU LAGOON" |
We can conclude that there are 40 sites that had at least 5 nests at some point.
2.6.2.1. Logic Operators#
Series with Boolean elements, such as conditions, can be inverted or combined with logic operators. All of the logic operators broadcast to series elements. For demonstration, we’ll use the following series:
x1 = pl.Series([True, False, True, False])
x2 = pl.Series([True, True, False, False])
The NOT operator ~
inverts values, so True
becomes False
and False
becomes True
:
~x1
bool |
false |
true |
false |
true |
The OR operator |
combines two values, returning True
unless both
values are False
:
x1 | x2
bool |
true |
true |
true |
false |
The AND operator &
combines two values, returning False
unless both
values are True
:
x1 & x2
bool |
true |
false |
false |
false |
Caution
The logic operators ~
, |
, and &
only work on Polars series, NumPy arrays,
and other homogeneous data structures. If you use them on Python’s built-in
bool
values, Python will return an unexpected result or produce an error.
Python instead uses the keywords not
, or
, and and
as the respective logic
operators on bool
values. Polars, NumPy, and other packages don’t use these
keywords because their behavior can’t be customized for data structures.
2.6.2.2. Multiple Conditions#
As a final example, let’s filter the least terns data with multiple conditions.
We’ll get all rows for 2023 where there were at least 10 fledglings reported.
We’ll use the fl_min
column for the minimum reported fledgling count. Here’s
the call to .filter
:
terns.filter(
(pl.col("year") == 2023) &
(pl.col("fl_min") > 10)
)
year | site_name | site_name_2013_2018 | site_name_1988_2001 | site_abbr | region_3 | region_4 | event | bp_min | bp_max | fl_min | fl_max | total_nests | nonpred_eggs | nonpred_chicks | nonpred_fl | nonpred_ad | pred_control | pred_eggs | pred_chicks | pred_fl | pred_ad | pred_pefa | pred_coy_fox | pred_meso | pred_owlspp | pred_corvid | pred_other_raptor | pred_other_avian | pred_misc | total_pefa | total_coy_fox | total_meso | total_owlspp | total_corvid | total_other_raptor | total_other_avian | total_misc | first_observed | last_observed | first_nest | first_chick | first_fledge |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | str | str | str | f64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | cat | i64 | i64 | i64 | i64 | cat | cat | cat | cat | cat | cat | cat | cat | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | str | str | str | str | str |
2023 | "ALAMEDA POINT" | "Alameda Point" | "NA_2013_2018 POLYGON" | "ALAM_PT" | "S.F._BAY" | "S.F._BAY" | "LA_NINA" | 169.0 | 339.0 | 109 | 131 | 339 | 112 | 56 | 5 | 0 | "Y" | null | null | null | null | "Y" | "N" | "N" | "N" | "Y" | "N" | "Y" | "N" | null | null | null | null | null | null | null | null | "2023-04-28" | "2023-09-09" | "2023-06-17" | "2023-07-09" | null |
2023 | "HAYWARD REGIONAL SHORELINE" | "Hayward Regional Shoreline" | "NA_2013_2018 POLYGON" | "HAY_REG_SHOR" | "S.F._BAY" | "S.F._BAY" | "LA_NINA" | 88.0 | 143.0 | 127 | 130 | 144 | 17 | 1 | 0 | 0 | "Y" | null | null | null | null | "Y" | "N" | "Y" | "N" | "N" | "N" | "Y" | "Y" | null | null | null | null | null | null | null | null | "2023-05-01" | "2023-09-11" | "2023-05-30" | "2023-06-17" | null |
2023 | "OCEANO DUNES STATE VEHICULAR R… | "Oceano Dunes State Vehicular R… | "NA_2013_2018 POLYGON" | "OCEANO_DUNES" | "CENTRAL" | "CENTRAL" | "LA_NINA" | 40.0 | 42.0 | 35 | 35 | 42 | 8 | 2 | 0 | 0 | "Y" | null | null | null | null | "Y" | "N" | "N" | "N" | "N" | "N" | "N" | "N" | null | null | null | null | null | null | null | null | "2023-05-07" | "2023-08-20" | "2023-06-01" | "2023-06-19" | null |
2023 | "VANDENBERG SFB" | "Vandenberg AFB" | "NA_2013_2018 POLYGON" | "VAN_SFB" | "CENTRAL" | "CENTRAL" | "LA_NINA" | 33.0 | 39.0 | 17 | 17 | 42 | 3 | 28 | 0 | 0 | "Y" | null | null | null | null | "N" | "N" | "N" | "N" | "N" | "N" | "Y" | "N" | null | null | null | null | null | null | null | null | "2023-05-11" | "2023-08-26" | "2023-06-01" | "2023-06-25" | null |
2023 | "NBVC POINT MUGU" | "NBVC Point Mugu" | "NA_2013_2018 POLYGON" | "PT_MUGU" | "SOUTHERN" | "VENTURA" | "LA_NINA" | 168.0 | 177.0 | 51 | 100 | 188 | 40 | 3 | 0 | 1 | "Y" | null | null | null | null | "N" | "Y" | "N" | "Y" | "N" | "N" | "N" | "Y" | null | null | null | null | null | null | null | null | "2023-05-03" | "2023-09-02" | "2023-05-21" | "2023-06-14" | null |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2023 | "BATIQUITOS LAGOON ECOLOGICAL R… | "Batiquitos Lagoon" | "NA_2013_2018 POLYGON" | "BLER" | "SOUTHERN" | "SOUTHERN" | "LA_NINA" | 205.0 | 218.0 | 77 | 89 | 228 | 67 | 94 | 7 | 0 | "Y" | null | null | null | null | "Y" | "N" | "N" | "Y" | "N" | "Y" | "N" | "Y" | null | null | null | null | null | null | null | null | "2023-04-27" | "2023-08-03" | "2023-05-11" | "2023-06-05" | null |
2023 | "SAN DIEGUITO LAGOON ECOLOGICAL… | "San Diequito Lagoon" | "NA_2013_2018 POLYGON" | "SANDIEGU_LAG" | "SOUTHERN" | "SOUTHERN" | "LA_NINA" | 34.0 | 47.0 | 16 | 22 | 47 | 0 | 0 | 0 | 0 | "Y" | null | null | null | null | "Y" | "Y" | "N" | "Y" | "N" | "Y" | "N" | "N" | null | null | null | null | null | null | null | null | "2023-05-24" | "2023-09-02" | "2023-05-24" | null | null |
2023 | "MISSION BAY FAA ISLAND" | "FAA Island" | "NA_2013_2018 POLYGON" | "MB_FAA" | "SOUTHERN" | "SOUTHERN" | "LA_NINA" | 141.0 | 144.0 | 44 | 48 | 156 | 50 | 36 | 4 | 0 | "Y" | null | null | null | null | "N" | "N" | "N" | "N" | "N" | "N" | "Y" | "Y" | null | null | null | null | null | null | null | null | "2023-04-23" | "2023-08-17" | "2023-05-10" | "2023-05-28" | null |
2023 | "NAVAL AMPHIBIOUS BASE CORONADO" | "Naval Base Coronado" | "NA_2013_2018 POLYGON" | "NAB" | "SOUTHERN" | "SOUTHERN" | "LA_NINA" | 596.0 | 644.0 | 90 | 128 | 717 | 329 | 185 | 6 | 6 | "Y" | null | null | null | null | "N" | "N" | "N" | "N" | "Y" | "N" | "Y" | "Y" | null | null | null | null | null | null | null | null | "2023-04-22" | "2023-09-09" | "2023-05-07" | "2023-05-31" | null |
2023 | "TIJUANA ESTUARY NERR" | "Tijuana Estuary" | "NA_2013_2018 POLYGON" | "TJ_RIV" | "SOUTHERN" | "SOUTHERN" | "LA_NINA" | 144.0 | 165.0 | 35 | 35 | 171 | 65 | 44 | 1 | 1 | "Y" | null | null | null | null | "N" | "N" | "N" | "N" | "N" | "N" | "Y" | "Y" | null | null | null | null | null | null | null | null | "2023-04-26" | "2023-08-28" | "2023-05-12" | "2023-06-10" | null |
This gives us 14 sites with at least 10 fledglings in 2023. Notice that we had
to put the conditions in parentheses ()
so that Python gets the order of
operations right.
2.7. Exercises#
2.7.1. Exercise#
Python’s range
function offers another way to create a sequence of numbers.
Read the help file for this function.
Create an example range. How does this differ from a list?
Describe the three arguments that you can use in
range
. Give examples of each.Convert one of those ranges to a list and print it to screen. What changes in the way Python represents this sequence?
2.7.2. Exercise#
Return to the discussion in Coercion & Casting.
Why does
"3" + 4
raise an error?Why does
True - 1
return 0?Why does
int(4.6) < 4.6
returnTrue
?
2.7.3. Exercise#
Create a new data frame from the least terns data with the following characteristics:
Each entry’s year is between 2010 and 2019 (inclusive).
Each entry reports at least 100 breeding pairs.
The columns are
year
,site_name
,bp_min
,bp_max
,total_nests
.
Use this data frame for the remaining questions.
Count the number of entries for each site. How many sites have at least 100 breeding pairs across all 10 years?
Which site-year combination has the highest number of nests?