3.14 * 3^2
[1] 28.26
After completing this chapter, learners should be able to:
R is a program for statistical computing. It provides a rich set of built-in tools for cleaning, exploring, modeling, and visualizing data.
The main way you’ll interact with R is by writing code or expressions in the R programming language. Most people use “R” as a blanket term to refer to both the program and the programming language. Usually, the distinction doesn’t matter, but in cases where it does, we’ll point it out and be specific.
By writing code, you create an unambiguous record of every step taken in an analysis. This it one of the major advantages of R (and other programming languages) over point-and-click software like Tableau and Microsoft Excel. Code you write in R is reproducible: you can share it with someone else, and if they run it with the same inputs, they’ll get the same results.
Another advantage of writing code is that it’s often reusable. This can mean automating a repetitive task within a single analysis, recycling code from one analysis into another, or packaging useful code for distribution to the general public. At the time of writing, there were over 17,000 user-contributed packages available for R, spanning a broad range of disciplines.
R is one of many programming languages used in data science. Compared to other programming languages, R’s particular strengths are its interactivity, built-in support for handling missing data, the ease with which you can produce high-quality data visualizations, and its broad base of user-contributed packages (due to both its age and growing popularity).
You can download R for free here, and can find an install guide here.
In addition to R, you’ll need RStudio. RStudio is an integrated development environment (IDE), which means it’s a comprehensive program for writing, editing, searching, and running code. You can do all of these things without RStudio, but RStudio makes the process easier. You can download RStudio Desktop Open-Source Edition for free here, and can find an install guide here.
The first time you open RStudio, you’ll see a window divided into several panes, like this:
The console pane, on the left, is the main interface to R. If you type R code into the console and press the Enter
key on your keyboard, R will run your code and return the result.
On the right are the environment pane and the plots pane. The environment pane shows data in your R workspace. The plots pane shows any plots you make, and also has tabs to browse your file system and to view R’s built-in help files. You’ll learn more about these gradually, but for now, focus on the console pane.
Let’s start by using R to do some arithmetic. In the console, you’ll see that the cursor is on a line that begins with >
, called the prompt. You can make R compute the sum \(2 + 2\) by typing the code 2 + 2
after the prompt and then pressing the Enter
key. Your code and the result from R should look like this:
R always puts the result on a separate line (or lines) from your code. In this case, the result begins with the tag [1]
, which is a hint from R that the result is a vector and that this line starts with the element at position 1. You’ll learn more about vectors in Section 2.1, and eventually learn about other data types that are displayed differently. The result of the sum, 4
, is displayed after the tag. In this reader, results from R will usually be typeset in monospace and further prefixed with ##
to indicate that they aren’t code.
If you enter an incomplete expression, R will change the prompt to +
, then wait for you to type the rest of the expression and press the Enter
key. Here’s what it looks like if you only enter 2 +
:
You can finish entering the expression, or you can cancel it by pressing the Esc
key (or Ctrl-c
if you’re using R without RStudio). R can only tell an expression is incomplete if it’s missing something, like the second operand in 2 +
. So if you mean to enter 2 + 2
but accidentally enter 2
, which is a complete expression by itself, don’t expect R to read your mind and wait for more input!
Try out some other arithmetic in the R console. Besides +
for addition, the other arithmetic operators are:
-
for subtraction*
for multiplication/
for division%%
for remainder division (modulo)^
or **
for exponentiationYou can combine these and use parentheses to make more complicated expressions, just as you would when writing a mathematical expression. When R computes a result, it follows the standard order of operations: parentheses, exponentiation, multiplication, division, addition, and finally subtraction. For example, to estimate the area of a circle with radius 3, you can write:
3.14 * 3^2
[1] 28.26
You can write R expressions with any number of spaces (including none) around the operators and R will still compute the result. Nevertheless, putting spaces in your code makes it easier for you and others to read, so it’s good to make it a habit. Put spaces around most operators, after commas, and after keywords.
Since R is designed for mathematics and statistics, you might expect that it provides a better appoximation for \(\pi\) than 3.14
. R and most other programming languages allow you to create named values, or variables. R provides a built-in variable called pi
for the value of \(\pi\). You can display a variable’s value by entering its name in the console:
pi
[1] 3.141593
You can also use variables in expressions. For instance, here’s a more precise expression for the area of a circle with radius 3:
* 3^2 pi
[1] 28.27433
You can define your own variables with the assignment operator =
or <-
. In most circumstances these two operators are interchangeable. For clarity, it’s best to choose one you like and use it consistently in all of your R code. In this reader, we use =
for assignment because this is the assignment operator in most programming languages.
The main reason to use variables is to save results so that you can use them on other expressions later. For example, to save the area of the circle in a variable called area
, you can write:
= pi * 3^2 area
In R, variable names can contain any combination of letters, numbers, dots .
, and underscores _
, but must always start with a letter or a dot. Spaces and other symbols are not allowed in variable names.
Now you can use the area
variable anywhere you want the computed area. Notice that when you assign a result to a variable, R doesn’t automatically display that result. If you want to see the result as well, you have to enter the variable’s name as a separate expression:
area
[1] 28.27433
Another reason to use variables is to make an expression more general. For instance, you might want to compute the area of several circles with different radii. Then the expression pi * 3^2
is too specific. You can rewrite it as pi * r^2
, and then assign a value to the variable r
just before you compute the area. Here’s the code to compute and display the area of a circle with radius 1 this way:
= 1
r = pi * r^2
area area
[1] 3.141593
Now if you want to compute the area for a different radius, all you have to do is change r
and run the code again (R will not change area
until you do this). Writing code that’s general enough to reuse across multiple problems can be a big time-saver in the long run. Later on, you’ll learn ways to make this code even easier to reuse.
R treats anything inside single or double quotes as literal text rather than as an expression to evaluate. In programming jargon, a piece of literal text is called a string. You can use whichever kind of quotes you prefer, but the quote at the beginning of the string must match the quote at the end.
'Hi'
[1] "Hi"
"Hello!"
[1] "Hello!"
Numbers and strings are not the same thing, so for example R considers 1
different from "1"
. As a result, you can’t use strings with most of R’s arithmetic operators. For instance, this code causes an error:
"1" + 3
Error in "1" + 3: non-numeric argument to binary operator
The error message notes that +
is not defined for non-numeric values.
Besides arithmetic, you can also use R to compare values. The comparison operators are:
<
for “less than”>
for “greater than”<=
for “less than or equal to”>=
for “greater than or equal to”==
for “equal to”!=
for “not equal to”The “equal to” operator uses two equal signs so that R can distinguish it from =
, the assignment operator.
Let’s look at a few examples:
1.5 < 3
[1] TRUE
"a" > "b"
[1] FALSE
== 3.14 pi
[1] FALSE
"hi" == 'hi'
[1] TRUE
When you make a comparison, R returns a logical value, TRUE
or FALSE
, to indicate the result. Logical values are not the same as strings, so they are not quoted.
Logical values are values, so you can use them in other computations. For example:
TRUE
[1] TRUE
TRUE == FALSE
[1] FALSE
Section 2.4.5 describes more ways to use and combine logical values.
Beware that the equality operators don’t always return FALSE
when you compare two different types of data:
"1" == 1
[1] TRUE
"TRUE" <= TRUE
[1] TRUE
"FALSE" <= TRUE
[1] TRUE
Section 2.2.2 explains why this happens, and Section 5.1 explains several other ways to compare values.
Most of R’s features are provided through functions, pieces of reusable code. You can think of a function as a machine that takes some inputs and uses them to produce some output. In programming jargon, the inputs to a function are called arguments, the output is called the return value, and when you use a function, you’re calling the function.
To call a function, write its name followed by parentheses. Put any arguments to the function inside the parentheses. For example, in R, the sine function is named sin
(there are also cos
and tan
). So you can compute the sine of \(\pi / 4\) with this code:
sin(pi / 4)
[1] 0.7071068
There are many functions that accept more than one argument. For instance, the sum
function accepts any number of arguments and adds them all together. When you call a function with multiple arguments, separate the arguments with commas. So another way to compute \(2 + 2\) in R is:
sum(2, 2)
[1] 4
When you call a function, R assigns each argument to a parameter. Parameters are special variables that represent the inputs to a function and only exist while that function runs. For example, the log
function, which computes a logarithm, has parameters x
and base
for the operand and base of the logarithm, respectively. The next section, Section 1.3, explains how to look up the parameters for a function.
By default, R assigns arguments to parameters based on their order. The first argument is assigned to the function’s first parameter, the second to the second, and so on. So you can compute the logarithm of 64, base 2, with this code:
log(64, 2)
[1] 6
The argument 64 is assigned to the parameter x
, and the argument 2 is assigned to the parameter base
. You can also assign arguments to parameters by name with =
(not <-
), overriding their positions. So some other ways you could write the call above are:
log(64, base = 2)
[1] 6
log(x = 64, base = 2)
[1] 6
log(base = 2, x = 64)
[1] 6
log(base = 2, 64)
[1] 6
All of these are equivalent. When you write code, choose whatever seems the clearest to you. Leaving parameter names out of calls saves typing, but including some or all of them can make the code easier to understand.
Parameters are not regular variables, and only exist while their associated function runs. You can’t set them before a call, nor can you access them after a call. So this code causes an error:
= 64
x log(base = 2)
Error in eval(expr, envir, enclos): argument "x" is missing, with no default
In the error message, R says that you forgot to assign an argument to the parameter x
. You can keep the variable x
and correct the call by making x
an argument (for the parameter x
):
log(x, base = 2)
[1] 6
Or, written more explicitly:
log(x = x, base = 2)
[1] 6
In summary, variables and parameters are distinct, even if they happen to have the same name. The variable x
is not the same thing as the parameter x
.
Learning and using a language is hard, so it’s important to know how to get help. The first place to look for help is R’s built-in documentation. In the console, you can access a specific help page by name with ?
followed by the name of the page.
There are help pages for all of R’s built-in functions, usually with the same name as the function itself. So the code to open the help page for the log
function is:
?log
For functions, help pages usually include a brief description, a list of parameters, a description of the return value, and some examples.
There are also help pages for other topics, such as built-in mathematical constants (such as ?pi
), data sets (such as ?iris
), and operators. To look up the help page for an operator, put the operator’s name in single or double quotes. For example, this code opens the help page for the arithmetic operators:
"+" ?
It’s always okay to put quotes around the name of the page when you use ?
, but they’re only required if it contains non-alphabetic characters. So ?sqrt
, ?'sqrt'
, and ?"sqrt"
all open the documentation for sqrt
, the square root function.
Sometimes you might not know the name of the help page you want to look up. You can do a general search of R’s help pages with ??
followed by a string of search terms. For example, to get a list of all help pages related to linear models:
"linear model" ??
This search function doesn’t always work well, and it’s often more efficient to use an online search engine. When you search for help with R online, include “R” as a search term. Alternatively, you can use RSeek, which restricts the search to a selection of R-related websites.
As a programmer, sooner or later you’ll run some code and get an error message or result you didn’t expect. Don’t panic! Even experienced programmers make mistakes regularly, so learning how to diagnose and fix problems is vital.
Try going through these steps:
If none of these steps help, try asking online. Stack Overflow is a popular question and answer website for programmers. Before posting, make sure to read about how to ask a good question.
Most of the time, you won’t just write code directly into the R console. Reproducibility and reusability are important benefits of R over point-and-click software, and in order to realize these, you have to save your code to your computer’s hard drive. Let’s start by reviewing how files on a computer work. You’ll need to understand that before you can save your code, and it will also be important later on for loading data sets.
Your computer’s file system is a collection of files (chunks of data) and directories (or “folders”) that organize those files. For instance, the file system on a computer shared by Ada and Charles, two pioneers of computing, might look like this:
File systems have a tree-like structure, with a top-level directory called the root directory. On Ada and Charles’ computer, the root is called /
, which is also what it’s called on all macOS and Linux computers. On Windows, the root is usually called C:/
, but sometimes other letters, like D:/
, are also used depending on the computer’s hardware.
A path is a list of directories that leads to a specific file or directory on a file system (imagine giving directons to someone as they walk through the file system). Use forward slashes /
to separate the directories in a path. The root directory includes a forward slash as part of its name, and doesn’t need an extra one.
For example, suppose Ada wants to write a path to the file cats.csv
. She can write the path like this:
/Users/ada/cats.csv
You can read this path from left-to-right as, “Starting from the root directory, go to the Users
directory, then from there go to the ada
directory, and from there go to the file cats.csv
.” Alternatively, you can read the path from right-to-left as, “The file cats.csv
inside of the ada
directory, which is inside of the Users
directory, which is in the root directory.”
As another example, suppose Charles wants a path to the Programs
directory. He can write:
/Programs/
The /
at the end of this path is reminder that Programs
is a directory, not a file. Charles could also write the path like this:
/Programs
This is still correct, but it’s not as obvious that Programs
is a directory. In other words, when a path leads to a directory, including a trailing slash is optional, but makes the meaning of the path clearer. Paths that lead to files never have a trailing slash.
On Windows computers, paths are usually written with backslashes \
to separate directories instead of forward slashes. Fortunately, R uses forward slashes /
for all paths, regardless of the operating system. So when you’re working in R, use forward slashes and don’t worry about the operating system. This is especially convenient when you want to share code with someone that uses a different operating system than you.
A path that starts from the root directory, like all of the ones you’ve seen so far, is called an absolute path. The path is “absolute” because it unambiguously describes where a file or directory is located. The downside is that absolute paths usually don’t work well if you share your code.
For example, suppose Ada uses the path /Programs/ada/cats.csv
to load the cats.csv
file in her code. If she shares her code with another pioneer of computing, say Gladys, who also has a copy of cats.csv
, it might not work. Even though Gladys has the file, she might not have it in a directory called ada
, and might not even have a directory called ada
on her computer. Because Ada used an absolute path, her code works on her own computer, but isn’t portable to others.
On the other hand, a relative path is one that doesn’t start from the root directory. The path is “relative” to an unspecified starting point, which usually depends on the context.
For instance, suppose Ada’s code is saved in the file analysis.R
(more about .R
files in Section 1.4.2), which is in the same directory as cats.csv
on her computer. Then instead of an absolute path, she can use a relative path in her code:
cats.csv
The context is the location of analysis.R
, the file that contains the code. In other words, the starting point on Ada’s computer is the ada
directory. On other computers, the starting point will be different, depending on where the code is stored.
Now suppose Ada sends her corrected code in analysis.R
to Gladys, and tells Gladys to put it in the same directory as cats.csv
. Since the path cats.csv
is relative, the code will still work on Gladys’ computer, as long as the two files are in the same directory. The name of that directory and its location in the file system don’t matter, and don’t have to be the same as on Ada’s computer. Gladys can put the files in a directory /Users/gladys/from_ada/
and the path (and code) will still work.
Relative paths can include directories. For example, suppose that Charles wants to write a relative path from the Users
directory to a cool selfie he took. Then he can write:
charles/cool_hair_selfie.jpg
You can read this path as, “Starting from wherever you are, go to the charles
directory, and from there go to the cool_hair_selfie.jpg
file.” In other words, the relative path depends on the context of the code or program that uses it.
When use you paths in R code, they should almost always be relative paths. This ensures that the code is portable to other computers, which is an important aspect of reproducibility. Another benefit is that relative paths tend to be shorter, making your code easier to read (and write).
When you write paths, there are three shortcuts you can use. These are most useful in relative paths, but also work in absolute paths:
.
means the current directory...
means the directory above the current directory.~
means the home directory. Each user has their own home directory, whose location depends on the operating system and their username. Home directories are typically found inside C:/Users/
on Windows, /Users/
on macOS, and /home/
on Linux.As an example, suppose Ada wants to write a (relative) path from the ada
directory to Charles’ cool selfie. Using these shorcuts, she can write:
../charles/cool_hair_selfie.jpg
Read this as, “Starting from wherever you are, go up one directory, then go to the charles
directory, and then go to the cool_hair_selfie.jpg
file.” Since /Users/ada
is Ada’s home directory, she could also write the path as:
~/../charles/cool_hair_selfie.jpg
This path has the same effect, but the meaning is slightly different. You can read it as “Starting from your home directory, go up one directory, then go to the charles
directory, and then go to the cool_hair_selfie.jpg
file.”
The ..
and ~
shortcut are frequently used and worth remembering. The .
shortcut is included here in case you see it in someone else’s code. Since it means the current directory, a path like ./cats.csv
is identical to cats.csv
, and the latter is preferable for being simpler. There are a few specific situations where .
is necessary, but they fall outside the scope of this text.
Now that you know how file systems and paths work, you’re ready to learn how to save your R code to a file. R code is usually saved into an R script (extension .R
) or an R Markdown file (extension .Rmd
). R scripts are slightly simpler, so let’s focus on those.
In RStudio, you can create a new R script with this menu option:
File -> New File -> R Script
This will open a new pane in RStudio, like this:
The new pane is the scripts pane, which displays all of the R scripts you’re editing. Each script appears in a separate tab. In the screenshot, only one script, the new script, is open.
Editing a script is similar to editing any other text document. You can write, delete, copy, cut, and paste text. You can also save the file to your computer’s file system. When you do, pay attention to where you save the file, as you might need it later.
The contents of an R script should be R code. Anything else you want to write in the script (notes, documentation, etc.) should be in a comment. In R, comments begin with #
and extend to the end of the line:
# This is a comment.
R will ignore comments when you run your code.
When you start a new project, it’s a good idea to create a specific directory for all of the project’s files. If you’re using R, you should also create one or more R scripts in that directory. As you work, write your code directly into a script. Arrange your code in the order of the steps to solve the problem, even if you write some parts before others. Comment out or delete any lines of code that you try but ultimately decide you don’t need. Make sure to save the file periodically so that you don’t lose your work. Following these guidelines will help you stay organized and make it easier to share your code with others later.
While editing, you can run the current line in the R console by pressing Ctrl
+Enter
on Windows and Linux, or Cmd
+Enter
on macOS. This way you can test and correct your code as you write it.
If you want, you can instead run (or source) the entire R script, by calling the source
function with the path to the script as an argument. This is also what the “Source on Save” check box refers to in RStudio. The code runs in order, only stopping if an error occurs.
For instance, if you save the script as my_cool_script.R
, then you can run source("my_cool_script.R")
in the console to run the entire script (pay attention to the path—it may be different on your computer).
R Markdown files are an alternative format for storing R code. They provide a richer set of formatting options, and are usually a better choice than R scripts if you’re writing a report that contains code. You can learn more about R Markdown files here.
Section 1.4.1 explained that relative paths have a starting point that depends on the context where the path is used. Let’s make that idea more concrete for R. The working directory is the starting point R uses for relative paths. Think of the working directory as the directory R is currently “at” or watching.
The function getwd
returns the absolute path for the current working directory, as a string. It doesn’t require any arguments:
getwd()
[1] "/home/nick/mill/datalab/teaching/r_basics"
On your computer, the output from getwd
will likely be different. This is a very useful function for getting your bearings when you write relative paths. If you write a relative path and it doesn’t work as expected, the first thing to do is check the working directory.
The related setwd
function changes the working directory. It takes one argument: a path to the new working directory. Here’s an example:
setwd("..")
# Now check the working directory.
getwd()
Generally, you should avoid using calls to setwd
in your R scripts and R Markdown files. Calling setwd
makes your code more difficult to understand, and can always be avoided by using appropriate relative paths. If you call setwd
with an absolute path, it also makes your code less portable to other computers. It’s fine to use setwd
interactively (in the R console), but avoid making your saved code dependent on it.
Another function that’s useful for dealing with the working directory and file system is list.files
. The list.files
function returns the names of all of the files and directories inside of a directory. It accepts a path to a directory as an argument, or assumes the working directory if you don’t pass a path. For instance:
# List files and directories in /home/.
list.files("/home/")
[1] "lost+found" "nick"
# List files and directories in the working directory.
list.files()
[1] "_freeze" "_quarto.yml" "assessment" "chapters" "cover.png"
[6] "data" "docs" "images" "index.html" "index.qmd"
[11] "LICENSE" "notes" "R" "raw" "README.md"
[16] "site_libs"
As usual, since you have a different computer, you’re likely to see different output if you run this code. If you call list.files
with an invalid path or an empty directory, the output is character(0)
:
list.files("/this/path/is/fake/")
character(0)
Later on, you’ll learn about what character(0)
means more generally.
Analyzing data sets is one of the most common things to do in R. The first step is to get R to read your data. Data sets come in a variety of file formats, and you need to identify the format in order to tell R how to read the data.
Most of the time, you can guess the format of a file by looking at its extension, the characters (usually three) after the last dot .
in the filename. For example, the extension .jpg
or .jpeg
indicates a JPEG image file. Some operating systems hide extensions by default, but you can find instructions to change this setting online by searching for “show file extensions” and your operating system’s name. The extension is just part of the file’s name, so it should be taken as a hint about the file’s format rather than a guarantee.
R has built-in functions for reading a variety of formats. The R community also provides packages, shareable and reusable pieces of code, to read even more formats. You’ll learn more about packages later, in Section 3.2. For now, let’s focus on data sets that can be read with R’s built-in functions.
Here are several formats that are frequently used to distribute data, along with the name of a built-in function or contributed package that can read the format:
Name | Extension | Function or Package | Tabular? | Text? |
---|---|---|---|---|
Comma-separated Values | .csv |
read.csv |
Yes | Yes |
Tab-separated Values | .tsv |
read.delim |
Yes | Yes |
Fixed-width File | .fwf |
read.fwf |
Yes | Yes |
Microsoft Excel | .xlsx |
readr package | Yes | No |
Microsoft Excel 1993-2007 | .xls |
readr package | Yes | No |
Apache Arrow | .feather |
arrow package | Yes | No |
R Data | .rds |
readRDS |
Sometimes | No |
R Data | .rda |
load |
Sometimes | No |
Plaintext | .txt |
readLines |
Sometimes | Yes |
Extensible Markup Language | .xml |
xml2 package | No | Yes |
JavaScript Object Notation | .json |
jsonlite package | No | Yes |
A tabular data set is one that’s structured as a table, with rows and columns. This reader focuses on tabular data sets, since they’re common in practice and present the fewest programming challenges. Here’s an example of a tabular data set:
Fruit | Quantity | Price |
---|---|---|
apple | 32 | 1.49 |
banana | 541 | 0.79 |
pear | 10 | 1.99 |
A text file is a file that contains human-readable lines of text. You can check this by opening the file with a text editor such as Microsoft Notepad or macOS TextEdit. Many file formats use text in order to make the format easier to work with.
For instance, a comma-separated values (CSV) file records a tabular data using one line per row, with commas separating columns. If you store the table above in a CSV file and open the file in a text editor, here’s what you’ll see:
Fruit,Quantity,Price
apple,32,1.49
banana,541,0.79
pear,10,1.99
A binary file is one that’s not human-readable. You can’t just read off the data if you open a binary file in a text editor, but they have a number of other advantages. Compared to text files, binary files are often faster to read and take up less storage space (bytes).
As an example, R’s built-in binary format is called RDS (which may stand for “R data serialized”). RDS files are extremely useful for backing up work, since they can store any kind of R object, even ones that are not tabular. You can learn more about how to create an RDS file on the ?saveRDS
help page, and how to read one on the ?readRDS
help page.
Let’s read our first data set! Over the next few sections, you’re going to explore a data set about the people who are depicted on banknotes (paper money) from around the world. Click here to download the data set (you’ll need to click the “Download raw file” button).
The banknotes data set is derived from one created by The Pudding, a digital publication that makes awesome stories and visualizations with data. Check out their article about banknotes around the world.
The data set is in a file called banknotes.csv
, which suggests it’s a CSV file. In this case, the extension is correct, so you can read the file into R with the built-in read.csv
function. The first argument is the path to where you saved the file, which may be different on your computer. The read.csv
function returns the data set, but R won’t keep the data in memory unless you assign the returned result to a variable:
= read.csv("data/banknotes.csv") banknotes
The variable name banknotes
here is arbitrary; you can choose something different if you want. However, in general, it’s a good habit to choose variable names that describe the contents of the variable somehow.
If you tried running the line of code above and got an error message, pay attention to what the error message says, and remember the strategies to get help in Section 1.3. The most common mistake when reading a file is incorrectly specifying the path, so first check that you got the path right.
If you ran the line of code and there was no error message, congratulations, you’ve read your first data set into R!
Now that you’ve loaded the data, let’s take a look at it. When you’re working with a new data set, it’s usually not a good idea to print it out directly (by typing banknotes
, in this case) until you know how big it is. Big data sets can take a long time to print, and the output can be difficult to read.
Instead, you can use the head
function to print only the beginning, or head, of a data set. Let’s take a peek:
head(banknotes)
currency_code country currency_name name gender
1 ARS Argentina Argentinian Peso Eva Perón F
2 ARS Argentina Argentinian Peso Julio Argentino Roca M
3 ARS Argentina Argentinian Peso Domingo Faustino Sarmiento M
4 ARS Argentina Argentinian Peso Juan Manuel de Rosas M
5 ARS Argentina Argentinian Peso Manuel Belgrano M
6 AUD Australia Australian Dollar David Unaipon M
bill_count profession known_for_being_first current_bill_value
1 1.0 Activist No 100
2 1.0 Head of Gov't No 100
3 1.0 Head of Gov't No 50
4 1.0 Politician No 20
5 1.0 Founder Yes 10
6 0.5 STEM Yes 50
prop_total_bills first_appearance_year death_year
1 NA 2012 1952
2 NA 1988 1914
3 NA 1999 1888
4 NA 1992 1877
5 NA 1970 1820
6 0.48 1995 1967
comments
1
2
3
4
5 Came up with the first Argentine flag.
6 Shares with another person. In 1927 when his book of Aboriginal legends, Hurgarrda was published, he became the first Australian Aboriginal writer to be published.
hover_text has_portrait id
1 true ARS_Evita
2 true ARS_Argentino
3 true ARS_Domingo
4 true ARS_Rosas
5 Designed first Argentine flag true ARS_Belgrano
6 First Australian Aboriginal writer to be published true AUD_Unaipon
scaled_bill_value
1 1.0000000
2 1.0000000
3 0.4444444
4 0.1111111
5 0.0000000
6 0.4736842
This data set is tabular—as you might have already guessed, since it came from a CSV file. In R, it’s represented by a data frame, a table with rows and columns. R uses data frames to represent most (but not all) kinds of tabular data. The read.csv
function, which you used to read this data, always returns a data frame.
For a data frame, the head
function only prints the first six rows. If there are lots of columns or the columns are wide, as is the case here, R wraps the output across lines.
When you first read an object into R, you might not know whether it’s a data frame. One way to check is visually, by printing it (as you just did with head
). A better way to check is with the class
function, which returns information about what an object is. For a data frame, the result will always contain data.frame
:
class(banknotes)
[1] "data.frame"
You’ll learn more about classes in Section 2.2, but for now you can use this function to identify data frames.
By counting the columns in the output from head(banknotes)
, you can see that this data set has 17 columns. A more convenient way to check the number of columns in a data set is with the ncol
function:
ncol(banknotes)
[1] 17
The similarly-named nrow
function returns the number of rows:
nrow(banknotes)
[1] 279
Alternatively, you can get both numbers at the same time with the dim
(short for “dimensions”) function.
Since the columns have names, you might also want to get just these. You can do that with the names
or colnames
functions. Both return the same result:
names(banknotes)
[1] "currency_code" "country" "currency_name"
[4] "name" "gender" "bill_count"
[7] "profession" "known_for_being_first" "current_bill_value"
[10] "prop_total_bills" "first_appearance_year" "death_year"
[13] "comments" "hover_text" "has_portrait"
[16] "id" "scaled_bill_value"
colnames(banknotes)
[1] "currency_code" "country" "currency_name"
[4] "name" "gender" "bill_count"
[7] "profession" "known_for_being_first" "current_bill_value"
[10] "prop_total_bills" "first_appearance_year" "death_year"
[13] "comments" "hover_text" "has_portrait"
[16] "id" "scaled_bill_value"
If the rows have names, you can get those with the rownames
function. For this particular data set, the rows don’t have names.
An efficient way to get a sense of what’s actually in a data set is to have R compute summary information. This works especially well for data frames, but also applies to other data. R provides two different functions to get summaries: str
and summary
.
The str
function returns a structural summary of an object. This kind of summary tells us about the structure of the data—the number of rows, the number and names of columns, what kind of data is in each column, and some sample values. Here’s the structural summary for the banknotes data:
str(banknotes)
'data.frame': 279 obs. of 17 variables:
$ currency_code : chr "ARS" "ARS" "ARS" "ARS" ...
$ country : chr "Argentina" "Argentina" "Argentina" "Argentina" ...
$ currency_name : chr "Argentinian Peso" "Argentinian Peso" "Argentinian Peso" "Argentinian Peso" ...
$ name : chr "Eva Perón" "Julio Argentino Roca" "Domingo Faustino Sarmiento" "Juan Manuel de Rosas" ...
$ gender : chr "F" "M" "M" "M" ...
$ bill_count : num 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 ...
$ profession : chr "Activist" "Head of Gov't" "Head of Gov't" "Politician" ...
$ known_for_being_first: chr "No" "No" "No" "No" ...
$ current_bill_value : int 100 100 50 20 10 50 10 20 10 50 ...
$ prop_total_bills : num NA NA NA NA NA 0.48 0.08 0.1 0.08 0.48 ...
$ first_appearance_year: int 2012 1988 1999 1992 1970 1995 1993 1994 1993 1995 ...
$ death_year : chr "1952" "1914" "1888" "1877" ...
$ comments : chr "" "" "" "" ...
$ hover_text : chr "" "" "" "" ...
$ has_portrait : chr "true" "true" "true" "true" ...
$ id : chr "ARS_Evita" "ARS_Argentino" "ARS_Domingo" "ARS_Rosas" ...
$ scaled_bill_value : num 1 1 0.444 0.111 0 ...
This summary lists information about each column, and includes most of what you found earlier by using several different functions separately. The summary uses chr
to indicate columns of text (“characters”) and int
to indicate columns of integers.
In contrast to str
, the summary
function returns a statistical summary of an object. This summary includes summary statistics for each column, choosing appropriate statistics based on the kind of data in the column. For numbers, this is generally the mean, median, and quantiles. For categories, this is the frequencies. Other kinds of statistics are shown for other kinds of data. Here’s the statistical summary for the banknotes data:
summary(banknotes)
currency_code country currency_name name
Length:279 Length:279 Length:279 Length:279
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
gender bill_count profession known_for_being_first
Length:279 Min. :0.2500 Length:279 Length:279
Class :character 1st Qu.:0.5000 Class :character Class :character
Mode :character Median :1.0000 Mode :character Mode :character
Mean :0.8456
3rd Qu.:1.0000
Max. :1.0000
current_bill_value prop_total_bills first_appearance_year death_year
Min. : 1 Min. :0.0100 Min. :1869 Length:279
1st Qu.: 20 1st Qu.:0.0550 1st Qu.:1980 Class :character
Median : 100 Median :0.1000 Median :1996 Mode :character
Mean : 4039 Mean :0.1669 Mean :1992
3rd Qu.: 1000 3rd Qu.:0.2300 3rd Qu.:2012
Max. :100000 Max. :0.7500 Max. :2021
NA's :220
comments hover_text has_portrait id
Length:279 Length:279 Length:279 Length:279
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
scaled_bill_value
Min. :0.00000
1st Qu.:0.01828
Median :0.11111
Mean :0.30606
3rd Qu.:0.48914
Max. :1.00000
NA's :1
You can select an individual column from a data frame by name with $
, the dollar sign operator. The syntax is:
$COLUMN_NAME VARIABLE
For instance, for the banknotes data, banknotes$country
selects the country
column, and banknotes$first_appearance_year
selects the first_appearance_year
column. So one way to compute the mean of the first_appearance_year
column is:
mean(banknotes$first_appearance_year)
[1] 1992.319
Similarly, to compute the range of the current_bill_value
column:
range(banknotes$current_bill_value)
[1] 1 100000
You can also use the dollar sign operator to assign values to columns. For instance, to assign USD
to the entire currency_code
column:
$currency_code = "USD" banknotes
Be careful when you do this, as there is no undo. Fortunately, you haven’t saved any transformations to the banknotes data to your computer’s hard drive yet, so you can reset the banknotes
variable back to what it was by reloading the data set:
= read.csv("data/banknotes.csv") banknotes
In Section 2.4, you’ll learn how to select rows and individual elements from a data frame, as well as other ways to select columns.
In a string, an escape sequence or escape code consists of a backslash followed by one or more characters. Escape sequences make it possible to:
For example, the escape sequence \n
corresponds to the newline character. There’s a complete list of escape sequences for R in the ?Quotes
help file. Other programming languages also use escape sequences, and many of them are the same as in R.
newline
. Then make R display the value of the variable by entering newline
at the R prompt.message
function prints output to the R console, so it’s one way you can make your R code report information as it runs. Use the message
function to print newline
.Choose a directory on your computer that you’re familiar with, such as one you created. Determine the path to the directory, then use list.files
to display its contents. Do the files displayed match what you see in your system’s file browser?
What does the all.files
parameter of list.files
do? Give an example.
The read.table
function is another function for reading tabular data. Take a look at the help file for read.table
. Recall that read.csv
reads tabular data where the values are separated by commas, and read.delim
reads tabular data where the values are separated by tabs.
read.table
expect by default?read.table
to read a CSV? Explain. If your answer is yes, show how to use read.table
to load the banknotes data from Section 1.5.1.