1 Concepts
1.1 What is a Relational Database?
A relational database is a collection of tables (organized in rows and columns of data) that are related to each other in some way.
Database tables are analogous to CSV files, spreadsheets in Excel, or data frames in programming languages like R or Python.
Ideally each table can be connected to another table by a column that is present in both tables. That column may have different numbers of observations in each table, but the values will match up. This column is called a key. For example, a key commonly used on campus is your student or employee ID number.
Let’s look at an example dataset of fictional student data with data about courses, grades, and employment. Can we say anything about the relationship between course grades and employment based on this data?
Table: Student
ID | Name |
---|---|
123 | Jane Smith |
456 | Maria Martinez |
789 | Paul Jones |
Table: Courses
ID | Course | Grade |
---|---|---|
123 | Calculus | A- |
456 | Calculus | A |
789 | Calculus | C+ |
123 | Data Science | A- |
456 | Data Science | B |
789 | Data Science | B- |
Table: Employment
ID | Position | Employer | HoursPerWeek |
---|---|---|---|
123 | Student Assistant | University Research Lab | 5 |
456 | Customer Service | Alumni Center | 5 |
456 | Research Assistant | University Research Lab | 15 |
789 | Student Assistant | University Research Lab | 10 |
789 | Stock Room | Medical Supplier | 20 |
789 | Customer Service | Alumi Center | 15 |
1.2 What is SQL?
SQL stands for structured query language. SQL is a programming language that allows you to request (query) information from a database using a standard set of keywords. You can pronounce SQL as “ess cue ell” or “sequel”.
1.2.1 What kinds of questions can SQL answer?
SQL excels at extracting and combining information from large datasets. Some questions you might ask with SQL include:
- How many items are there in my data with a specific label?
- What are the unique values in a given column?
- Which records (rows) relate to a specific time period in my data?
1.3 What is a Relational Database Management System?
A relational database management system (RDBMS) is a software system that creates, updates, and manages relational databases as well as managing user’s access to the database. There are many different systems available. For instance, you may have heard of MySQL, Postgres, and Microsoft SQL Server.
For this workshop, we’ll use SQLite, which is a simple and widely-used RDBMS. It runs on Windows, MacOS X, and Linux with no setup necessary!
Every RDBMS has its own implementation or “dialect” of SQL. In other words, the set of SQL keywords supported differs slightly from one RDBMS to another, and sometimes queries have to be written differently, but the basics are the same. Details about the supported keywords for a given RDBMS can be found in that system’s documentation. How does this impact you working with any given SQL RDBMS? When you’re searching for help with syntax, include the version of SQL you’re working with. The keywords covered in this workshop are supported by most systems.
Some RDBMS allow you to add functions with extensions. For example, the PostGIS extension adds keywords to PostgreSQL to all you to work with location information to do spatial analysis.
1.4 Advantages & Disadvantages of SQL
SQL has major advantages in several areas important to researchers:
- Efficiency
- Write a few lines of code rather than lots of manual data manipulation
- SQL is meant for data manipulation
- Reproducibility
- save queries as a record of your workflow
- re-run code with updates
- Work with large amounts of data
- Typically faster to run a process in a database than in a spreadsheet
- Store lots of data (compare with Excel’s row limits)
- Data management
- One database file is the equivalent to many, many spreadsheet files (like csvs or xlsx files)
- Write a query instead of making a new files or tabs
What does SQL not do well?
- Most RDBMS do not visualize data, however, you can connect your database to visualization tools to perform these kinds of tasks seamlessly.
- The SQL language is designed for data querying not data analysis. If you want to run statistics on your data you can connect to your database from a programming language like R or python, or from statistical software.
- SQL assumes you work with tabular data. If your data is another type - for example graph data or tree data - you might want to explore other database types.