1 Concepts

1.1 What is a Relational Database?

A relational database is a collection of tables (organized in rows and columns of data) that are related to each other in some way.

Database tables would be the equivalent to spreadsheets in Excel and data frames in programming languages like R or Python.

We’ll go more in-depth on the relational part when we cover the workshop dataset!

1.2 What is SQL?

SQL stands for structured query language. SQL is a programming language that allows you to answer questions about the data in a database using a standard set of keywords. You can pronounce SQL as “ess cue ell” or “sequel”.

  • A query is a request for information.
  • A SQL query is a set of instructions made up of keywords sent to the database to retrieve rows of table(s).

1.2.1 What kinds of questions can SQL answer?

If you already work with data, you probably think about answering specific questions such as:

  • How many of X or Y are there?
  • What are the unique values of X?
  • Which of X, Y, and Z that occurred in a given time period?

You can use SQL to answer these kinds of questions.

1.3 What is a Relational Database Management System?

A relational database management system (RDBMS) is a software system that manages the users of and tables within a database. There are many different systems available. For instance, you may have heard of MySQL, Postgres, and Microsoft SQL Server.

For this workshop, we’ll use SQLite, which is likely the simplest and most widely-used RDBMS. It runs on Windows, MacOS X, and Linux with no setup necessary!

Every RDBMS has its own implementation or “dialect” of SQL. In other words, the set of SQL keywords supported differs from one RDBMS to another, and sometimes queries have to be written differently. Details about the supported keywords for a given RDBMS can be found in that system’s documentation. The keywords covered in this workshop are supported by almost all popular systems.

1.4 Advantages & Disadvantages of SQL

SQL queries are declarative: you describe the data that you want. Then it’s up to SQL to determine the most efficient way to get that data. This approach to programming has several advantages:

  • You can work with very large data sets. Databases are typically stored on your computer’s hard disk, and only the results from queries need to fit in memory (and there are ways to work around even this limitation).
  • Queries are relatively fast compared to other data programming tools (such as R or Python). This is because the database system is free to choose the most efficient way to run the query.
  • Queries are reproducible. You can save, edit, and run queries at a later date.
  • SQL is less expressive and flexible than other data programming tools, so there are not too many different keywords to remember.

SQL also has a few disadvantages:

  • SQL is less expressive and flexible than other data programming tools. For instance, you can’t use SQL to make a plot.
  • Each RDBMS has its own dialect of SQL, so each time you work with a new database, you must familiarize yourself with which keywords are supported.
  • Setting up most database systems (other than SQLite) is a relatively complicated process.