2. Iterator & Generator Crash Course#
Learning Objectives
Explain and use for-loops and comprehensions
Use the
sys.getsizeof
function to get the size of an object in memoryExplain what classes are and how they’re typically used
Create new classes
Explain what methods and special methods are
Identify some common special methods
Define special methods for a class to customize behavior
Explain what an iterator is, what an iterable is, and how they differ
Use the
reversed
,sorted
,enumerate
, andzip
functionsExplain what the
itertools
module isExplain and use generator functions and generator expressions
Explain the advantages and disadvantages of generators
This chapter will help you understand how Python objects and classes work at a lower level, particularly iterators and generators. Objects are the atoms of Python, and knowing how to iterate over data is a fundamental programming skill. Generators provide an elegant way to build complex data processing pipelines that can efficiently handle big data and streaming data. Along the way, the chapter also introduces a few of Python’s advanced features.
2.1. Prerequisites#
This chapter assumes you already have basic familiarity with Python. DataLab’s Python Basics Reader and its accompanying workshop provide a suitable introduction.
To follow along, you’ll need the following software versions (or newer) installed on your computer:
Python 3.10
One way to install these is to install the Anaconda Python distribution.
Chapter 2 provides more details about Anaconda and the conda
package manager.
CLICK HERE to get the data set used in the final example for
this chapter. The data set is in a .tar.bz2
archive, which you’ll need to
decompress. If your computer doesn’t already have a program that can decompress
.tar.bz2
files, you can use 7-Zip on Windows or Keka on Mac OS X.
2.2. Iteration Review#
In the context of programming, iteration means running a block of code repeatedly. Each repetition is a single iteration. The inputs usually change from one iteration to the next, so that each iteration carries out a different computation. Thus it’s often useful to think about iteration in terms of iterating over a collection of inputs, with one iteration per input value.
The primary way to iterate over a collection of values in Python is with a for-loop. If you haven’t used for-loops before or need more than a quick review, see this introduction to loops in DataLab’s Python Basics Reader.
You can use for-loops with many different types of objects. For example, for-loops iterate over the elements of a list:
for i in [10, 20, 30]:
print(i)
10
20
30
For dictionaries, for-loops iterate over the keys:
x = {"a": 10, "b": 20, "c": 30}
for k in x:
print(k, ":", x[k])
a : 10
b : 20
c : 30
For-loops iterate over the keys rather than the values in order to be consistent with Python’s syntax for checking whether a key is present in a dictionary:
"a" in x
True
Every dictionary has a .values
method to provide access to just the values,
and a .items
method to provide access to key-value pairs (as tuples).
For-loops also iterate over the characters in a string:
for i in "python":
print(i)
p
y
t
h
o
n
These examples raise a question: how are for-loops able to iterate over so many different types of objects?
2.2.1. Ranges#
You can also use for-loops to iterate over the numbers in a range:
x = range(5)
for i in x:
print(i)
0
1
2
3
4
A range
object keeps track of where the range starts and ends, but not the
values in between:
x
range(0, 5)
You can be sure of this by checking how much memory Python uses to store a
range compared to an equivalent list. The sys.getsizeof
function returns the
size of an object in bytes:
import sys
sys.getsizeof(3)
28
In Python, an int
is a fully-fledged object and uses at least 24 bytes to
store its methods, in addition to however many bytes are needed to store its
value.
All ranges use approximately the same amount of memory regardless of where they start and end:
sys.getsizeof(x)
48
sys.getsizeof(range(1000))
48
You can use the list
function to convert a range into a list. A list stores
all of its elements in memory. So for most ranges, the equivalent list uses far
more memory:
sys.getsizeof(list(range(1000)))
8056
This raises another set of questions: What exactly is a range? When and how does Python compute the values in a range?
2.2.2. Comprehensions#
A comprehension is an efficient and elegant way to write a loop that produces a value for each item over which it iterates.
For example, suppose you want to get the length of each string in a list. You could use a for-loop:
strings = ["this", "is", "a", "python", "workshop"]
lens = []
for s in strings:
lens.append(len(s))
lens
[4, 2, 1, 6, 8]
Alternatively, you can do the same thing more concisely with a list comprehension:
lens = [len(s) for s in strings]
lens
[4, 2, 1, 6, 8]
In some cases, the comprehension is also more efficient.
Comprehensions can optionally include a condition. For example, suppose you
want to skip words that start with a
when counting lengths:
[len(s) for s in strings if not s.startswith("a")]
[4, 2, 6, 8]
In addition to list comprehensions, Python also supports dictionary comprehensions, which produce a dictionary, and set comprehensions, which produce a set.
DataLab’s Python Basics Reader provides another explanation of comprehensions.
2.3. What’s a Class?#
In order to understand how Python’s iteration works at a lower level, first you need to know more about classes and objects.
A class is a description of the data stored and operations supported by objects of a specific type. You can think of classes as templates for different types of objects. Every object is an instance of the class that corresponds to its type.
You can create your own classes (and thus types) with the class
keyword. The
keyword must be followed by a name for the class and a colon, which begins a
new block of code. Any variables or functions defined in the block become
attributes of objects of the class.
Tip
This section explains how to create classes so that you can understand how Python works at a lower level. You generally won’t need to create classes just to analyze data or carry out other common data science tasks.
Like functions, classes provide a way to organize and reuse code. The main
difference is that classes define reusable data structures (or types) rather
than reusable operations. For instance, pandas.Series
and pandas.DataFrame
are both classes.
Classes are fundamental to a programming paradigm called object-oriented programming, which is especially useful for developing large, complex programs. You can learn much more about creating classes and object-oriented programming by reading this chapter of Think Python 2e.
For example, this code creates a new class called Temperature
, which
represents a temperature measurement with some unit. The attributes are value
and unit
:
class Temperature:
value = 0
unit = "celsius"
You can construct or instantiate new objects of a specific class by calling the class:
t = Temperature()
You can get or set attributes of an object with the .
operator:
t.value
0
t.value = 20
t.value
20
2.3.1. Methods#
A method is a function attached to an object (as an attribute). When a
method is called, the object itself is implicitly passed as the first argument.
By convention, the parameter name for this argument is self
.
As an example, here’s a new definition of the Temperature
class with a
print
method to neatly print out information about the object:
class Temperature:
value = 0
unit = "celsius"
def print(self):
msg = f"{self.value} degrees {self.unit}"
print(msg)
To test out the method, first you need to create a new Temperature
object.
Python will not automatically update Temperature
objects you created before
redefining the class. Then call the method with no arguments:
t = Temperature()
t.print()
0 degrees celsius
Python automatically and implicitly passes t
as the first argument to the
path
method.
Methods can have more than one parameter. Any parameters after the first are
assigned values from the arguments in parentheses ( )
when the method is
called, as in a regular Python function.
Here’s a new version of the Temperature
class with a to_unit
method to
convert the temperature between Celsius and Kelvin:
class Temperature:
value = 0
unit = "celsius"
def print(self):
msg = f"{self.value} degrees {self.unit}"
print(msg)
def to_unit(self, new_unit):
if new_unit not in ["celsius", "kelvin"]:
raise ValueError("Invalid unit.")
if new_unit == self.unit:
return
elif new_unit == "kelvin":
self.value = self.value + 273.15
else:
self.value = self.value - 273.15
self.unit = new_unit
Once again, you must first create a new Temperature
object to test out the
new method:
t = Temperature()
t.value = 10
t.print()
t.to_unit("kelvin")
t.print()
10 degrees celsius
283.15 degrees kelvin
2.3.2. Special Methods#
The Temperature
class would better if Python called the print
method any
time it needed to print a Temperature
object. Instead, Python prints the type
and memory address of the object:
t
<__main__.Temperature at 0x7fda920d3290>
Fortunately, a class can customize operations such as printing by defining
special methods. These methods have specific names which usually begin and
end with two underscores __
, so they are sometimes also called double
underscore or dunder methods.
For instance, Python prints an object at the console by calling the __repr__
method and printing the string it returns. So here’s a new version of the
Temperature
class where the print
method is replaced by a __repr__
method:
class Temperature:
value = 0
unit = "celsius"
def __repr__(self):
msg = f"{self.value} degrees {self.unit}"
return msg
def to_unit(self, new_unit):
if new_unit not in ["celsius", "kelvin"]:
raise ValueError("Invalid unit.")
if new_unit == self.unit:
return
elif new_unit == "kelvin":
self.value = self.value + 273.15
else:
self.value = self.value - 273.15
self.unit = new_unit
Now Temperature
objects print neatly:
t = Temperature()
t
0 degrees celsius
Python also provides the built-in function repr
as a standard way to call an
object’s __repr__
method directly:
repr(t)
'0 degrees celsius'
A wide variety of behaviors can be customized by defining special methods. A few examples are:
__init__
for object construction, as inTemperature()
__repr__
for display as a string, as inrepr(t)
__str__
for conversion to a string, as instr(t)
__lt__
,__gt__
,__eq__
, and more for comparisons, as int < u
__add__
,__mul__
, and more for arithmetic, as int + u
__getitem__
for indexing, as int[i]
__call__
for calls, as int()
You can find a complete list of special methods in this chapter
of the Python Language Reference. In the next section, you’ll learn about the
__iter__
and __next__
special methods, which are fundamental to how
iteration works in Python.
2.4. What’s an Iterator?#
An iterator is any object which has a __next__
method. For-loops iterate
over an iterators by calling this method at the beginning of each iteration to
get the next input. Iteration stops when the method raises a StopIteration
exception rather than returning a value.
None of the objects from the examples in Section 2.2 are
iterators. To check, note that you can use Python’s built-in dir
function to
get a list of an object’s attributes. For example, here’s code to check for a
__next__
method on a string:
"__next__" in dir("python")
False
Since __next__
is not present, strings are not iterators. The same is true
for lists, dictionaries, and ranges.
Instead, all of these objects are iterable, which means they have an
__iter__
method. When called, the __iter__
method returns an iterator over
the object to which it’s attached. Thus for-loops call __iter__
before the
first iteration to get an iterator, and then call __next__
at the beginning
of each iteration as described above.
Python provides the built-in functions next
and iter
to call the __next__
and __iter__
methods manually. For instance, here’s the code to manually
iterate over and print the first three items in a list:
x = [10, 20, 30]
xiter = iter(x)
i = next(xiter)
print(i)
i = next(xiter)
print(i)
i = next(xiter)
print(i)
10
20
30
Calling next
one more time raises a StopIteration
exception, which
indicates the iterator has no more values to produce:
next(xiter)
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
Cell In[29], line 1
----> 1 next(xiter)
StopIteration:
The code above is effectively the same as this code:
for i in x:
print(i)
10
20
30
The next
and iter
function are often useful when writing, testing, or
debugging code. For example, they provide an easy way to peek at the first few
elements of objects that don’t support indexing (such as TensorFlow data sets).
One way to create a custom iterator is to create a class. For example, here’s an iterator (which is also iterable) that produces the specified number of elements from an object, cycling back to the beginning if the object is not long enough:
class Cycle:
def __init__(self, word, n):
self.word = word
self.n = n
self.i = 0
def __iter__(self):
return self
def __next__(self):
if self.n == 0:
raise StopIteration()
char = self.word[self.i]
self.i = (self.i + 1) % len(self.word)
self.n = self.n - 1
return char
And here’s an example of how it can be used:
"".join(Cycle("cat", 7))
'catcatc'
This example also demonstrates that for-loops are not the only way to use
iterable objects. Many functions for aggregation (such as sum
and max
) or
instantiating data structures (such as list
and dict
) also accept iterable
objects as input.
2.5. Working with Iterables#
Python provides several built-in functions to make it easier to work with iterable objects.
The reversed
function reverses an iterable object. For instance:
for i in reversed("python"):
print(i)
n
o
h
t
y
p
The sorted
function sorts the elements of an iterable object. Here’s an
example:
for i in sorted("python"):
print(i)
h
n
o
p
t
y
The enumerate
function produces tuples by combining each element of an
iterable with a 0-based index indicating its position. The index is the first
element of each tuple:
for index, i in enumerate([10, 20, 30]):
print(i, "at position", index)
10 at position 0
20 at position 1
30 at position 2
The enumerate
function is especially useful when you need to iterate over the
elements of an object but also need the positions of the elements. An example
application is iterating over a list of data sets and saving them in
sequentially numbered files.
The zip
function produces tuples by combining two or more iterables. For
instance, suppose you have a list of names and a list of ages (or two data
frame columns):
for name, age in zip(["kim", "taylor"], [33, 45]):
print(name, "is", age)
kim is 33
taylor is 45
The zip
function is useful when you need to iterate over two or more
iterables at once. If the iterables have different lengths, the zip
function
discards all elements beyond the end of the shortest iterable.
The splat operator *
, which converts the elements of a container into
separate arguments, provides a way to invert the zip
function:
zipped = zip([1, 2, 3], [10, 20, 30])
zipped = list(zipped)
zipped
[(1, 10), (2, 20), (3, 30)]
unzipped = zip(*zipped)
list(unzipped)
[(1, 2, 3), (10, 20, 30)]
That is, use zip(*zipped)
to invert a zip. One application of this pattern is
separating the elements of several identical dictionaries (or records) into
tuples (or lists or columns):
records = [{"name": "doris", "age": 13}, {"name": "steve", "age": 81}]
pairs = [r.values() for r in records]
list(zip(*pairs))
[('doris', 'steve'), (13, 81)]
2.5.1. The itertools
Module#
Python’s built-in itertools
module provides even more functions for working
with iterables. There are functions to combine, repeat, filter, group, and
more.
As an example, recall the Cycle
class defined in Section 2.4,
which produces a given number the elements from an object, cycling back to the
beginning as necessary. You can get the same effect by combining the
itertools
functions cycle
, which cycles through the elements of an iterable
endlessly, and islice
, which slices an iterable. Here’s the code:
# With Cycle class
x = Cycle("cat", 7)
"".join(x)
'catcatc'
# With itertools
import itertools as it
x = it.islice(it.cycle("cat"), 7)
"".join(x)
'catcatc'
You can find a complete list of itertools
functions in the module’s
documentation.
2.6. What’s a Generator?#
Note
This section and the next section were informed by David Beazley’s fantastic presentation Generator Tricks for Systems Programmers, v3.0. If you want to learn even more about generators, check out his slides and videos.
A generator function (or generator) is a function which contains the
yield
keyword and therefore produces or generates a sequence of values. You
can think of generators as a concise way to create custom iterators. Moreover,
generators are lazy, which means values in the sequence are only computed
as needed. These properties make generators ideal for creating pipelines that
process big data.
Python evaluates generator functions differently from regular functions. When
you call a generator function, Python returns a generator iterator (or
generator), but doesn’t evaluate the code in the function. When you call
the iterator’s __next__
method, Python evaluates the function up to the first
yield
and then pauses. The next time you call the __next__
method,
evaluation resumes and then pauses at the next yield.
As an example, here’s a generator which counts down from n - 1
to 0:
def countdown(n):
while n > 0:
n = n - 1
yield n
Calling the function returns a generator iterator:
ct = countdown(3)
ct
<generator object countdown at 0x7fda9071f040>
next(ct)
2
next(ct)
1
next(ct)
0
Just like an iterator, when the generator runs out of values, it raises a
StopIteration
exception:
next(ct)
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
Cell In[47], line 1
----> 1 next(ct)
StopIteration:
As another example, generators provide a third way to create an iterator which
produces a given number the elements from an object, cycling back to the
beginning as necessary. Section 2.4 showed how to create this
iterator as a class, and Section 2.5.1 showed how to create
this iterator using functions from the itertools
module. Here’s the code to
create it as a generator:
def cycle(word, n):
i = 0
while n > 0:
yield word[i]
i = (i + 1) % len(word)
n = n - 1
x = cycle("cat", 7)
"".join(x)
'catcatc'
Generators generate values one at a time and only on demand. As a result, generators typically use less CPU time and far less memory than doing equivalent computations with a list. The drawback is that a generator is a one-time operation. If you want to iterate over the generated values more than once, you must either store them as they’re produced the first time, or recompute them by calling the generator function (and iterating) again.
You can use the yield from
keyword to create a generator from another
generator (or iterator). For instance, here’s a generator that counts down from
n - 1
to 0 twice:
def double_countdown(n):
yield from countdown(n)
yield from countdown(n)
And here’s what the output from this generator looks like:
for i in double_countdown(2):
print(i)
1
0
1
0
2.6.1. Generator Expressions#
Another way to create generators is by writing generator expressions. The
syntax for a generator expression is almost the same as for a list
comprehension, but enclosed in parentheses ( )
instead of square brackets [ ]
.
For example, here’s a generator expression version of the list comprehension from Section 2.2.2:
strings = ["this", "is", "a", "python", "workshop"]
lens = (len(s) for s in strings if not s.startswith("a"))
lens
<generator object <genexpr> at 0x7fda906ea340>
Generator expressions have all of the same properties as generators, but are more concise. So in the example, the lengths are not computed until they are explicitly requested:
list(lens)
[4, 2, 6, 8]
2.7. Data Processing Pipelines#
This section provides a concrete example of how you can use generators to solve a data processing problem. You’ll need the Easy Ham Emails data set linked in Section 2.1. The data set contains a collection of email messages, each stored in a separate file. The data processing goal is to create a pipeline to extract all email addresses from the emails.
The first step is to define a generator to open and yield the files:
def gen_open(paths):
for path in paths:
yield open(path, "rt", encoding = "latin-1")
The open
function returns a generator over the lines in the opened file. All
of the lines in all of the files need to be processed the same way, so the next
step is define a generator that combines the lines from all of the files into a
single sequence:
def gen_cat(files):
for f in files:
yield from f
The last step is to actually extract the emails from each line. One way to do
this is to split each line into terms, and then filter out any terms that don’t
contain @
somewhere in the middle (since these can’t be email addresses):
import re
def gen_emails(lines):
for line in lines:
for term in re.split("[^a-zA-Z@.]", line):
if "@" in term.strip("@"):
yield term
Now that the generators are defined, putting together the pipeline is just a sequence of calls:
from pathlib import Path
paths = Path("data/easy_ham").glob("*")
paths = (p for p in paths if not p.is_dir())
files = gen_open(paths)
lines = gen_cat(files)
at_lines = (l for l in lines if "@" in l)
emails = gen_emails(at_lines)
The email addresses are only computed on request. One way to request all of them is to generate a list:
emails = list(emails)
emails[:20]
['pudge@perl.org',
'pudge@perl.org',
'yyyy@localhost.example.com',
'jm@localhost',
'jm@localhost',
'perl@jmason.org',
'perl@jmason.org',
'pudge@perl.org',
'perl@example.com',
'admin@freshrpms.net',
'admin@freshrpms.net',
'yyyy@localhost.example.com',
'jm@localhost',
'jm@localhost',
'rpm@jmason.org',
'list@freshrpms.net',
'matthias@egwn.net',
'zzzlist@freshrpms.net',
'.matthias@egwn.net',
'b.matthias@egwn.net']
You can convert the list to a set to get the total number of unique email addresses:
len(set(emails))
1470
The advantage of this approach is that it scales to any number of email
messages. Moreover, it’s easy to swap out or reuse the components of the
pipeline. For instance, the gen_open
and gen_cat
generator functions are
essentially the same ones David Beazley uses for different tasks in his
presentation [Generator Tricks for Systems Programmers, v3.0][beazley]. The
gen_emails
generator function could easily be replaced with a different
function to extract other information, such as phone numbers. Moreover,
additional generator functions could be added to the pipeline to further
process the results, such as extracting domain names.
The disadvantages of this approach are that it can be difficult to understand if you’re not used to programming with generators, and it can be slightly harder to debug problems in the pipeline than it would be to debug a loop.