25  Web Scraping

NoteLearning Goals

After this lesson, you should be able to:

  • Explain and read hypertext markup language (HTML)
  • View the HTML source of a web page
  • Use Firefox or Chrome’s web developer tools to locate tags within a web page
  • With the rvest package:
    • Read HTML into R
    • Extract HTML tables as data frames
  • With the xml2 package:
    • Use XPath to extract specific elements of a page

25.1 What’s in a Web Page?

Modern web pages usually consist of many files:

  • Hypertext markup language (HTML) for structure and formatting
  • Cascading style sheets (CSS) for more formatting
  • JavaScript (JS) for interactivity
  • Images

HTML is the only component that always has to be there. Since HTML is what gives a web page structure, it’s what we’ll focus on when scraping.

HTML is closely related to eXtensible markup language (XML). Both languages use tags to mark structural elements of data. In HTML, the elements literally correspond to the elements of a web page: paragraphs, links, tables, and so on.

Most tags come in pairs. The opening tag marks the beginning of an element and the closing tag marks the end. Opening tags are written <NAME>, where NAME is the name of the tag. Closing tags are written </NAME>.

A singleton tag is a tag that stands alone, rather than being part of a pair. Singleton tags are written <NAME />. In HTML (but not XML) they can also be written <NAME>. Fortunately, HTML only has a few singleton tags, so they can be distinguished by name regardless of which way they’re written.

For example, here’s some HTML that uses the em (emphasis, usually italic) and strong (usually bold) tags, as well as the singleton br (line break) tag:

<em><strong>This text</strong> is emphasized.<br /></em> Not emphasized

A pair of tags can contain other elements (paired or singleton tags), but not a lone opening or closing tag. This creates a strict, treelike hierarchy.

Opening and singleton tags can have attributes that contain additional information. Attributes are name-value pairs written NAME="VALUE" after the tag name.

For instance, the HTML a (anchor) tag creates a link to the URL provided for the href attribute:

<a href="http://www.google.com/" id="mytag">My Search Engine</a>

In this case the tag also has a value set for the id attribute.

Now let’s look at an example of HTML for a complete, albeit simple, web page:

<html>
  <head>
    <title>This is the page title!</title>
  </head>
  <body>
    <h1>This is a header!</h1>
    <p>This is a paragraph.
      <a href="http://www.r-project.org/">Here's a website!</a>
    </p>
    <p id="hello">This is another paragraph.</p>
  </body>
</html>

In most web browsers, you can examine the HTML for a web page by right-clicking and choosing “View Page Source”.

See here for a more detailed explanation of HTML, and here for a list of valid HTML elements.

25.2 R’s XML Parsers

A parser converts structured data into familiar data structures. R has two popular packages for parsing XML (and HTML):

The XML package has more features. The xml2 package is more user-friendly, and as part of the Tidyverse, it’s relatively well-documented. This lesson focuses on xml2, since most of the additional features in the XML package are related to writing (rather than parsing) XML documents.

The xml2 package is often used in conjunction with the rvest package, which provides support for CSS selectors (described later in this lesson) and automates scraping HTML tables.

The first time you use these packages, you’ll have to install them:

install.packages("xml2")
install.packages("rvest")

Let’s start by parsing the example of a complete web page from earlier. The xml2 function read_xml reads an XML document, and the rvest function read_html reads an HTML document. Both accept an XML/HTML string or a file path (including URLs):

html = r"(
<html>
  <head>
    <title>This is the page title!</title>
  </head>
  <body>
    <h1>This is a header!</h1>
    <p>This is a paragraph.
      <a href="http://www.r-project.org/">Here's a website!</a>
    </p>
    <p id="hello">This is another paragraph.</p>
  </body>
</html> )"

library("xml2")
library("rvest")

doc = read_html(html)
doc
{html_document}
<html>
[1] <head>\n<meta charset="UTF-8">\n<title>This is the page title!</title>\n< ...
[2] <body>\n    <h1>This is a header!</h1>\n    <p>This is a paragraph.\n     ...

The xml_children function returns all of the immediate children of a given element.

The top element of our document is the html tag, and its immediate children are the head and body tags:

tags = xml_children(doc)

The result from xml_children is a node set (xml_nodeset object). Think of a node set as a vector where the elements are tags rather than numbers or strings. Just like a vector, you can access individual elements with the indexing (square bracket [) operator:

length(tags)
[1] 2
head = tags[1]
head
{xml_nodeset (1)}
[1] <head>\n<meta charset="UTF-8">\n<title>This is the page title!</title>\n< ...

The xml_text function returns the text contained in a tag. Let’s get the text in the title tag, which is beneath the head tag. First we isolate the tag, then use xml_text:

title = xml_children(head)
xml_text(title)
[1] "This is the page title!"

Navigating through the tags by hand is tedious and easy to get wrong, but fortunately there’s a better way to find the tags we want.

25.3 XPath

An XML document is a tree, similar to the file system on your computer:

html
├── head
│   └── title
└── body
    ├── h1
    ├── p
    └── p
        └── a

When we wanted to find files, we wrote file paths. We can do something similar to find XML elements.

XPath is a language for writing paths to elements in an XML document. XPath is not R-specific. At a glance, an XPath looks similar to a file path:

XPath Description
/ root, or element separator
. current tag
.. parent tag
* any tag (wildcard)

The xml2 function xml_find_all finds all elements at given XPath:

xml_find_all(doc, "/html/body/p")
{xml_nodeset (2)}
[1] <p>This is a paragraph.\n      <a href="http://www.r-project.org/">Here's ...
[2] <p id="hello">This is another paragraph.</p>

Unlike a file path, an XPath can identify multiple elements. If you only want a specific element, use indexing to get it from the result.

XPath also has some features that are different from file paths. The // separator means “at any level beneath.” It’s a useful shortcut when you want to find a specific element but don’t care where it is.

Let’s get all of the p elements at any level of the document:

xml_find_all(doc, "//p")
{xml_nodeset (2)}
[1] <p>This is a paragraph.\n      <a href="http://www.r-project.org/">Here's ...
[2] <p id="hello">This is another paragraph.</p>

Let’s also get all a elements at any level beneath a p element:

xml_find_all(doc, "//p/a")
{xml_nodeset (1)}
[1] <a href="http://www.r-project.org/">Here's a website!</a>

The vertical bar | means “or.” You can use it to get two different sets of elements in one query.

Let’s get all h1 or p tags:

xml_find_all(doc, "//h1|//p")
{xml_nodeset (3)}
[1] <h1>This is a header!</h1>
[2] <p>This is a paragraph.\n      <a href="http://www.r-project.org/">Here's ...
[3] <p id="hello">This is another paragraph.</p>

25.3.1 Predicates

In XPath, the predicate operator [] gets elements at a position or matching a condition. Most conditions are about the attributes of the element. In the predicate operator, attributes are always prefixed with @.

For example, suppose we want to find all tags where the id attribute is equal to "hello":

xml_find_all(doc, "//*[@id = 'hello']")
{xml_nodeset (1)}
[1] <p id="hello">This is another paragraph.</p>

Notice that the equality operator in XPath is =, not ==. Strings in XPath can be quoted with single or double quotes.

You can combine multiple conditions in the predicate operator with and and or. There are also several XPath functions you can use in the predicate operator. These functions are not R functions, but rather built into XPath. Here are a few:

XPath Description
not() negation
contains() check string x contains y
text() get tag text
substring() get a substring

For instance, suppose we want to get elements that contain the word “paragraph”:

xml_find_all(doc, "//*[contains(text(), 'paragraph')]")
{xml_nodeset (2)}
[1] <p>This is a paragraph.\n      <a href="http://www.r-project.org/">Here's ...
[2] <p id="hello">This is another paragraph.</p>

Finally, note that you can also use the predicate operator to get elements at a specific position. For example, to get the second p element anywhere in the document:

xml_find_all(doc, "//p[2]")
{xml_nodeset (1)}
[1] <p id="hello">This is another paragraph.</p>

Notice that this is the same as if we had used R to get the second element:

xml_find_all(doc, "//p")[2]
{xml_nodeset (1)}
[1] <p id="hello">This is another paragraph.</p>
Warning

Beware that although the XPath predicate operator resembles R’s indexing operator, the syntax is not always the same.

We’ll learn more XPath in the examples. There’s a complete list of XPath functions on Wikipedia.

25.4 The Web Scraping Workflow

Scraping a web page is part technology, part art. The goal is to find an XPath that’s concise but specific enough to identify only the elements you want. If you plan to scrape the web page again later or want to scrape a lot of similar web pages, the XPath also needs to be general enough that it still works even if there are small variations.

Firefox and Chrome include “web developer tools” that are invaluable for planning a web scraping strategy. Press Ctrl + Shift + i (Cmd + Shift + i on OS X) in Firefox or Chrome to open the web developer tools.

We can also use the web developer tools to interactively identify the element that corresponds to a specific part of a web page. Press Ctrl + Shift + c and then click on the part of the web page you want to identify.

The best way to approach web scraping (and programming in general) is as an incremental, iterative process. Use the web developer tools to come up with a basic strategy, try it out in R, check which parts don’t work, and then repeat to adjust the strategy. Expect to go back and forth between your web browser and R several times when you’re scraping.

Most scrapers follow the same four steps, regardless of the web page and the language of the scraper:

  1. Download pages with an HTTP request (usually GET)
  2. Parse pages to extract text
  3. Clean up extracted text with string methods or regex
  4. Save cleaned results

In R, xml2’s read_xml function takes care of step 1 for you, although you can also use httr functions to make the request yourself.

25.4.1 Being Polite

Making an HTTP request is not free! It has a real cost in CPU time and also cash. Server administrators will not appreciate it if you make too many requests or make requests too quickly. So:

  • If you’re making multiple requests, slow them down by using R’s Sys.sleep function to make R do nothing for a moment. Aim for no more than 20-30 requests per second, unless you’re using an API that says more are okay.
  • Avoid requesting the same page twice. One way to do this is by caching (saving) the results of the requests you make. You can do this manually, or use a package that does it automatically, like the httpcache package.
Important

Failing to be polite can get you banned from websites! Also check the website’s terms of service to make sure scraping is not explicitly forbidden.

25.4.2 Case Study: CA Cities

Wikipedia has many pages that are just tables of data. For example, there’s this list of cities and towns in California. Let’s scrape the table to get a data frame.

Step 1 is to download the page:

wiki_url =
  "https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_California"

wiki_doc = read_html(wiki_url)

Step 2 is to extract the table element from the page. We can use Firefox or Chrome’s web developer tools to identify the table. HTML tables usually use the table tag. Let’s see if it’s the only table in the page:

tables = xml_find_all(wiki_doc, "//table")
tables
{xml_nodeset (4)}
[1] <table class="wikitable sortable"><tbody>\n<tr>\n<th scope="row" style="b ...
[2] <table class="wikitable plainrowheaders sortable"><tbody>\n<tr>\n<th scop ...
[3] <table class="nowraplinks hlist mw-collapsible autocollapse navbox-inner" ...
[4] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" style ...

The page has 4 tables. We can either make our XPath more specific, or use indexing to get the table we want. Refining the XPath makes our scraper more robust, but indexing is easier.

For the sake of learning, let’s refine the XPath. Going back to the browser, we can see that the table includes "wikitable" and "sortable" in its class attribute. So let’s search for these among the table elements:

tab = xml_find_all(tables, "//*[contains(@class, 'sortable')]")
tab
{xml_nodeset (2)}
[1] <table class="wikitable sortable"><tbody>\n<tr>\n<th scope="row" style="b ...
[2] <table class="wikitable plainrowheaders sortable"><tbody>\n<tr>\n<th scop ...

Now we get just two tables, and the second one is the one we want! Here we used a second XPath applied only to the results from the first, but we also could’ve done this all with one XPath: //table[contains(@class, 'sortable')].

The next part of extracting the data is to extract the value from each individual cell in the table. HTML tables have a strict layout order, with tags to indicate rows and cells. We could extract each cell by hand and then reassemble them into a data frame, but the rvest function html_table can do it for us automatically:

cities = html_table(tab, fill = TRUE)
cities = cities[[2]]
head(cities)
Name Type County Population (2020)[1] Population (2010)[9] Change Land area[10] Land area[10] Population density[10] Incorporated[8]
Name Type County Population (2020)[1] Population (2010)[9] Change sq mi km2 Population density[10] Incorporated[8]
Adelanto City San Bernardino 38,046 31,765 +19.8% 52.87 136.9 719.6/sq mi (277.8/km2) December 22, 1970
Agoura Hills City Los Angeles 20,299 20,330 −0.2% 7.80 20.2 2,602.4/sq mi (1,004.8/km2) December 8, 1982
Alameda City Alameda 78,280 73,812 +6.1% 10.45 27.1 7,490.9/sq mi (2,892.3/km2) April 19, 1854
Albany City Alameda 20,271 18,539 +9.3% 1.79 4.6 11,324.6/sq mi (4,372.4/km2) September 22, 1908
Alhambra City Los Angeles 82,868 83,089 −0.3% 7.63 19.8 10,860.8/sq mi (4,193.4/km2) July 11, 1903

The fill = TRUE argument ensures that empty cells are filled with NA. We’ve successfully imported the data from the web page into R, so we’re done with step 2.

Step 3 is to clean up the data frame. The column names contain symbols, the first row is part of the header, and the column types are not correct.

# Fix column names.
names(cities) = c(
  "city", "type", "county", "population2020", "population2010",
  "population_change", "mi2", "km2", "density", "date"
)

# Remove fake first row.
cities = cities[-1, ]
# Reset row names.
rownames(cities) = NULL

How can we clean up the date column? The as.Date function converts a string into a date R understands. The idea is to match the date string to a format string where the components of the date are indicated by codes that start with %. For example, %m stands for the month as a two-digit number. You can read about the different date format codes in ?strptime.

Here’s the code to convert the dates in the data frame:

dates = as.Date(cities$date, "%B %m, %Y")
cities$date = dates

We can also convert the population to a number:

class(cities$population2020)
[1] "character"
# Remove commas and footnotes (e.g., [1]) before conversion
library("stringr")

pop = str_replace_all(cities$population2020, ",", "")
pop = str_replace_all(pop, "\\[[0-9]+\\]", "")
pop = as.numeric(pop)

# Check for missing values, which can mean conversion failed
any(is.na(pop))
[1] FALSE
cities$population2020 = pop

25.4.3 Case Study: The CA Aggie

Suppose we want to scrape The California Aggie.

In particular, we want to get all the links to news articles on the features page https://theaggie.org/category/features/. This could be one part of a larger scraping project where we go on to scrape individual articles.

First for Step 1, let’s download the features page so we can extract the links:

url = "https://theaggie.org/category/features/"
doc = read_html(url)

We know that links are in a tags, but we only want links to articles. Looking at the features page with the web developer tools, the links to feature articles are all inside of a div tag with class td_block_inner. So for Step 2, let’s get that tag:

xml_find_all(doc, "//div[contains(@class, 'td_block_inner')]")
{xml_nodeset (9)}
[1] <div id="tdi_50" class="td_block_inner td-fix-index">\n<div class="tdb-ma ...
[2] <div id="tdi_51" class="td_block_inner">\n        <div class="tdb_module_ ...
[3] <div id="tdi_52" class="td_block_inner">\n        <div class="tdb_module_ ...
[4] <div id="tdi_53" class="td_block_inner">\n        <div class="tdb_module_ ...
[5] <div id="tdi_54" class="td_block_inner">\n        <div class="tdb_module_ ...
[6] <div id="tdi_55" class="td_block_inner">\n        <div class="tdb_module_ ...
[7] <div id="tdi_56" class="td_block_inner">\n        <div class="tdb_module_ ...
[8] <div id="tdi_57" class="td_block_inner">\n        <div class="tdb_module_ ...
[9] <div id="tdi_74" class="td_block_inner tdb-block-inner td-fix-index">\n   ...
# OR html_nodes(doc, "div.td-block-inner")

That returns a lot of results, so let’s try using the id attribute, which is "tdi_113", instead. Usually the id of an element is unique, so this ensures that we get the right section.

We can also add in a part about getting links now:

div = xml_find_all(doc, "//div[@id = 'tdi_113']")
# OR html_nodes(doc, "div#tdi_113")

links = xml_find_all(div, ".//a")
# OR html_nodes(div, "a")

length(links)
[1] 0

That gives us 0 links, but there are only 15 articles on the page, so something’s still not right. Inspecting the page again, there are actually three links to each article: on the image, on the title, and on “Continue Reading”.

Let’s focus on the title link. All of the title links are inside of an h3 tag. Generally it’s more robust to rely on tags (structure) than to rely on attributes (other than id and class). So let’s use the h3 tag here:

links = xml_find_all(div, ".//h3/a")
# OR html_nodes(div, "h3 > a")

length(links)
[1] 0

Now we’ve got the 15 links, so let’s get the URLs from the href attribute.

feature_urls = xml_attr(links, "href")

The other article listings (Sports, Science, etc) on The Aggie have a similar structure, so we can potentially reuse our code to scrape those.

So let’s turn our code into a function. The input will be a downloaded page, and the output will be the article links.

parse_article_links = function(page) {
  div = xml_find_all(page, "//div[@id = 'tdi_113']")
  links = xml_find_all(div, ".//h3/a")
  xml_attr(links, "href")
}

We can test this out on the Sports page. First we download the page:

sports = read_html("https://theaggie.org/category/sports")

Then we call the function on the document:

sports_urls = parse_article_links(sports)
head(sports_urls)
character(0)

It looks like the function works even on other pages! We can also set up the function to extract the link to the next page, in case we want to scrape multiple pages of links.

The link to the next page of features (an arrow at the bottom) is an a tag with attribute aria-label in a div with class page-nav. Let’s see if that’s specific enough to isolate the tag:

nav = xml_find_all(doc, "//div[contains(@class, 'page-nav')]")
# OR html_nodes(doc, "div.page-nav")
next_page = xml_find_all(nav, ".//a[contains(@aria-label, 'next-page')]")
# OR html_nodes(nav, "a[aria-label *= 'next-page']")

It looks like it is. We use contains here rather than = because it is common for the class attribute to have many parts. Using contains makes our code robust against changes in the future.

We can now modify our parser function to return the link to the next page:

parse_article_links = function(page) {
  # Get article URLs
  div = xml_find_all(page, "//div[@id = 'tdi_113']")
  links = xml_find_all(div, ".//h3/a")
  urls = xml_attr(links, "href")

  # Get next page URL
  nav = xml_find_all(page, "//div[contains(@class, 'page-nav')]")
  next_page = xml_find_all(nav, ".//a[contains(@aria-label, 'next-page')]")
  next_url = xml_attr(next_page, "href")

  # Using a list allows us to return two objects
  list(urls = urls, next_url = next_url)
}

Since our function gets URL for the next page, what happens on the last page?

Looking at the last page in the browser, there is no link to the next page. Let’s see what our scraper function does:

last_page = read_html("https://theaggie.org/category/features/page/187/")
parse_article_links(last_page)
$urls
character(0)

$next_url
[1] "https://theaggie.org/category/features/page/188/"

We get an empty character vector as the URL for the next page. This is because the xml_find_all function returns an empty node set for the next page URL, so there aren’t any href fields for xml_attr to extract. It’s convenient that the xml2 functions behave this way, but we could also add an if-statement to the function to check (and possibly return NA as the next URL in this case).

Then the code becomes:

parse_article_links = function(page) {
  # Get article URLs
  div = xml_find_all(page, "//div[@id = 'tdi_113']")
  links = xml_find_all(div, ".//h3/a")
  urls = xml_attr(links, "href")

  # Get next page URL
  nav = xml_find_all(page, "//div[contains(@class, 'page-nav')]")
  next_page = xml_find_all(nav, ".//a[contains(@aria-label, 'next-page')]")
  if (length(next_page) == 0) {
    next_url = NA
  } else {
    next_url = xml_attr(next_page, "href")
  }

  # Using a list allows us to return two objects
  list(urls = urls, next_url = next_url)
}

Now our function should work well even on the last page.

If we want to scrape links to all of the articles in the features section, we can use our function in a loop:

# NOTE: This code is likely to take a while to run, and is meant more for
# reading than for you to run and try out.

url = "https://theaggie.org/category/features/"
article_urls = list()
i = 1

# On the last page, the next URL will be `NA`.
while (!is.na(url)) {
  # Download and parse the page.
  page = read_html(url)
  result = parse_article_links(page)

  # Save the article URLs in the `article_urls` list. The variable `i` is the
  # page number.
  article_urls[[i]] = result$url
  i = i + 1

  # Set the URL to the next URL.
  url = result$next_url

  # Sleep for 1/30th of a second so that we never make more than 30 requests
  # per second.
  Sys.sleep(1/30)
}

Now we’ve got the basis for a simple scraper.

25.5 CSS Selectors

Cascading style sheets (CSS) is a language used to control the formatting of an XML or HTML document.

CSS selectors are the CSS way to write paths to elements. CSS selectors are more concise than XPath, so many people prefer them. Even if you prefer CSS selectors, it’s good to know XPath because CSS selectors are less flexible.

Here’s the basic syntax of CSS selectors:

CSS Description
a tags a
a > b tags b directly beneath a
a b tags b anywhere beneath a
a, b tags a or b
#hi tags with attribute id="hi"
.hi tags with attribute class that contains "hi"
[foo="hi"] tags with attribute foo="hi"
[foo*="hi"] tags with attribute foo that contains "hi"

If you want to learn more, CSS Diner is an interactive tutorial that covers the entire CSS selector language.

In Firefox, you can get CSS selectors from the web developer tool. Right-click the tag you want a selector for and choose “Copy Unique Selector”. Beware that the selectors Firefox generates are often too specific to be useful for anything beyond the simplest web scrapers.

The rvest package uses CSS selectors by default. Behind the scenes, the package translates these into XPath and passes them to xml2.

Here are a few examples of CSS selectors, using rvest’s html_nodes function:

html = r"(
<html>
  <head>
    <title>This is the page title!</title>
  </head>
  <body>
    <h1>This is a header!</h1>
    <p>This is a paragraph.
      <a href="http://www.r-project.org/">Here's a website!</a>
    </p>
    <p id="hello">This is another paragraph.</p>
  </body>
</html> )"

doc = read_html(html)

# Get all p elements
html_nodes(doc, "p")
{xml_nodeset (2)}
[1] <p>This is a paragraph.\n      <a href="http://www.r-project.org/">Here's ...
[2] <p id="hello">This is another paragraph.</p>
# Get all links
html_nodes(doc, "a")
{xml_nodeset (1)}
[1] <a href="http://www.r-project.org/">Here's a website!</a>
# Get all tags with id="hello"
html_nodes(doc, "#hello")
{xml_nodeset (1)}
[1] <p id="hello">This is another paragraph.</p>