Post: https://towardsdatascience.com/find-your-first-job-as-a-data-scientist-81e4401fe5bf
Much material in this lecture is adpated from http://www2.stat.duke.edu/~cr173/Sta523_Fa17/sql.html and http://www2.stat.duke.edu/~cr173/Sta523_Fa17/bigish_data.html.
Small data: those can fit into computer memory.
Bigish data: those can fit into disk(s) of a single machine.
Big data: those cannot fit into disk(s) of a single machine.
Key to high performance is effective use of memory hierarchy. True on all architectures.
Operation | Time |
---|---|
L1 cache reference | 0.5 ns |
L2 cache reference | 7 ns |
Main memory reference | 100 ns |
Read 1 MB sequentially from memory | 250,000 ns |
Read 1 MB sequentially from SSD | 1,000,000 ns |
Read 1 MB sequentially from disk | 20,000,000 ns |
Suppose we have a 10 GB flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set.
If we can store the file in memory:
10 GB × (250 μs/1 MB) = 2.5 seconds
If we have to access the file from SSD (~1GB/sec):
10 GB × (1 ms/1 MB) = 10 seconds
If we have to access the file from disk:
10 GB × (20 ms/1 MB) = 200 seconds
This is just for reading data, if we make any modifications (writing) things are much worse.
Cost: Disk << Memory
Speed: Disk <<< Memory
So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and what if we can’t fit everything into memory?
Create blocks - group rows based on similar attributes and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.
Even with blocks, any kind of subsetting of rows requires a linear search, which requires \(O(N)\) accesses where \(N\) is the number of blocks.
We can do much better if we are careful about how we structure our data, specifically sorting some or all of the columns.
Sorting is expensive, \(O(N \log N)\), but it only needs to be done once.
After sorting, we can use a binary search for any subsetting tasks \(O(\log N)\).
These sorted columns are known as indexes.
Indexes require additional storage, but usually small enough to be kept in memory while blocks stay on disk.
Structured Query Language (SQL) is a special purpose language for interacting with (querying and modifying) these indexed tabular data structures.
ANSI Standard but with some dialect divergence.
SQL functionalities map very closely (but not exactly) with the data manipulation verbs present in dplyr.
We will see this mapping in more detail in a bit.
DBI package provides a common interface for connecting to databases.
dbplyr package is the backend that translates dplyr verbs to database SQL queries.
To install database drivers, follow instructions at https://db.rstudio.com/best-practices/drivers/.
Create a SQLite database employee.sqlite
on disk (or in memory) for learning purpose.
library("DBI")
library("RSQLite")
con = dbConnect(RSQLite::SQLite(), "employee.sqlite")
# con = dbConnect(RSQLite::SQLite(), ":memory:")
str(con)
## Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots
## ..@ ptr :<externalptr>
## ..@ dbname : chr "/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/employee.sqlite"
## ..@ loadable.extensions: logi TRUE
## ..@ flags : int 70
## ..@ vfs : chr ""
## ..@ ref :<environment: 0x7fc7efe74e68>
## ..@ bigint : chr "integer64"
First table:
library("tidyverse")
## ── Attaching packages ──────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
(employees <- tibble(name = c("Alice", "Bob", "Carol", "Dave", "Eve", "Frank"),
email = c("alice@company.com", "bob@company.com",
"carol@company.com", "dave@company.com",
"eve@company.com", "frank@comany.com"),
salary = c(52000, 40000, 30000, 33000, 44000, 37000),
dept = c("Accounting", "Accounting", "Sales",
"Accounting", "Sales", "Sales")))
## # A tibble: 6 x 4
## name email salary dept
## <chr> <chr> <dbl> <chr>
## 1 Alice alice@company.com 52000 Accounting
## 2 Bob bob@company.com 40000 Accounting
## 3 Carol carol@company.com 30000 Sales
## 4 Dave dave@company.com 33000 Accounting
## 5 Eve eve@company.com 44000 Sales
## 6 Frank frank@comany.com 37000 Sales
Second table:
(phone <- tibble(name = c("Bob", "Carol", "Eve", "Frank"),
phone = c("919 555-1111", "919 555-2222", "919 555-3333", "919 555-4444")))
## # A tibble: 4 x 2
## name phone
## <chr> <chr>
## 1 Bob 919 555-1111
## 2 Carol 919 555-2222
## 3 Eve 919 555-3333
## 4 Frank 919 555-4444
Write tables to the database:
dbWriteTable(con, "employees", employees, overwrite = TRUE)
dbWriteTable(con, "phone", phone, overwrite = TRUE)
dbListTables(con)
## [1] "employees" "phone"
dbWriteTable(con, "employs", employees)
dbListTables(con)
## [1] "employees" "employs" "phone"
dbRemoveTable(con, "employs")
dbListTables(con)
## [1] "employees" "phone"
# select all columns from table employees
res <- dbSendQuery(con, "SELECT * FROM employees")
# execute the query
dbFetch(res)
## name email salary dept
## 1 Alice alice@company.com 52000 Accounting
## 2 Bob bob@company.com 40000 Accounting
## 3 Carol carol@company.com 30000 Sales
## 4 Dave dave@company.com 33000 Accounting
## 5 Eve eve@company.com 44000 Sales
## 6 Frank frank@comany.com 37000 Sales
dbClearResult(res)
dbDisconnect(con)
Following we demonstrate some common SQL commands, although all task can be achieved by using dplyr as well.
con <- dbConnect(RSQLite::SQLite(), dbname = "employee.sqlite")
dbListTables(con)
## [1] "employees" "phone"
knitr::opts_chunk$set(connection = "con")
Below we demonstrate some common SQL queries.
SELECT * FROM employees;
name | salary | dept | |
---|---|---|---|
Alice | alice@company.com | 52000 | Accounting |
Bob | bob@company.com | 40000 | Accounting |
Carol | carol@company.com | 30000 | Sales |
Dave | dave@company.com | 33000 | Accounting |
Eve | eve@company.com | 44000 | Sales |
Frank | frank@comany.com | 37000 | Sales |
SELECT * FROM phone;
name | phone |
---|---|
Bob | 919 555-1111 |
Carol | 919 555-2222 |
Eve | 919 555-3333 |
Frank | 919 555-4444 |
SELECT name AS first_name, salary FROM employees;
first_name | salary |
---|---|
Alice | 52000 |
Bob | 40000 |
Carol | 30000 |
Dave | 33000 |
Eve | 44000 |
Frank | 37000 |
SELECT name AS first_name, salary FROM employees ORDER BY salary;
first_name | salary |
---|---|
Carol | 30000 |
Dave | 33000 |
Frank | 37000 |
Bob | 40000 |
Eve | 44000 |
Alice | 52000 |
Descending order:
SELECT name AS first_name, salary FROM employees ORDER BY salary DESC;
first_name | salary |
---|---|
Alice | 52000 |
Eve | 44000 |
Bob | 40000 |
Frank | 37000 |
Dave | 33000 |
Carol | 30000 |
SELECT * FROM employees WHERE salary < 40000;
name | salary | dept | |
---|---|---|---|
Carol | carol@company.com | 30000 | Sales |
Dave | dave@company.com | 33000 | Accounting |
Frank | frank@comany.com | 37000 | Sales |
First record in each group.
SELECT * FROM employees GROUP BY dept;
name | salary | dept | |
---|---|---|---|
Alice | alice@company.com | 52000 | Accounting |
Carol | carol@company.com | 30000 | Sales |
First 3 records:
SELECT * FROM employees LIMIT 3;
name | salary | dept | |
---|---|---|---|
Alice | alice@company.com | 52000 | Accounting |
Bob | bob@company.com | 40000 | Accounting |
Carol | carol@company.com | 30000 | Sales |
Last 3 records:
SELECT * FROM employees ORDER BY name DESC LIMIT 3;
name | salary | dept | |
---|---|---|---|
Frank | frank@comany.com | 37000 | Sales |
Eve | eve@company.com | 44000 | Sales |
Dave | dave@company.com | 33000 | Accounting |
By default SQLite uses a CROSS JOIN
(all row combinations) which is not terribly useful. Note only first 10 rows of results are shown.
SELECT * FROM employees JOIN phone;
name | salary | dept | name | phone | |
---|---|---|---|---|---|
Alice | alice@company.com | 52000 | Accounting | Bob | 919 555-1111 |
Alice | alice@company.com | 52000 | Accounting | Carol | 919 555-2222 |
Alice | alice@company.com | 52000 | Accounting | Eve | 919 555-3333 |
Alice | alice@company.com | 52000 | Accounting | Frank | 919 555-4444 |
Bob | bob@company.com | 40000 | Accounting | Bob | 919 555-1111 |
Bob | bob@company.com | 40000 | Accounting | Carol | 919 555-2222 |
Bob | bob@company.com | 40000 | Accounting | Eve | 919 555-3333 |
Bob | bob@company.com | 40000 | Accounting | Frank | 919 555-4444 |
Carol | carol@company.com | 30000 | Sales | Bob | 919 555-1111 |
Carol | carol@company.com | 30000 | Sales | Carol | 919 555-2222 |
By default, the common variable name
is used as key.
SELECT * FROM employees NATURAL JOIN phone;
name | salary | dept | phone | |
---|---|---|---|---|
Bob | bob@company.com | 40000 | Accounting | 919 555-1111 |
Carol | carol@company.com | 30000 | Sales | 919 555-2222 |
Eve | eve@company.com | 44000 | Sales | 919 555-3333 |
Frank | frank@comany.com | 37000 | Sales | 919 555-4444 |
Explicitly specify key.
SELECT * FROM employees JOIN phone ON employees.name = phone.name;
name | salary | dept | name | phone | |
---|---|---|---|---|---|
Bob | bob@company.com | 40000 | Accounting | Bob | 919 555-1111 |
Carol | carol@company.com | 30000 | Sales | Carol | 919 555-2222 |
Eve | eve@company.com | 44000 | Sales | Eve | 919 555-3333 |
Frank | frank@comany.com | 37000 | Sales | Frank | 919 555-4444 |
SELECT * FROM employees NATURAL LEFT JOIN phone;
name | salary | dept | phone | |
---|---|---|---|---|
Alice | alice@company.com | 52000 | Accounting | NA |
Bob | bob@company.com | 40000 | Accounting | 919 555-1111 |
Carol | carol@company.com | 30000 | Sales | 919 555-2222 |
Dave | dave@company.com | 33000 | Accounting | NA |
Eve | eve@company.com | 44000 | Sales | 919 555-3333 |
Frank | frank@comany.com | 37000 | Sales | 919 555-4444 |
SELECT * FROM employees LEFT JOIN phone ON employees.name = phone.name;
name | salary | dept | name | phone | |
---|---|---|---|---|---|
Alice | alice@company.com | 52000 | Accounting | NA | NA |
Bob | bob@company.com | 40000 | Accounting | Bob | 919 555-1111 |
Carol | carol@company.com | 30000 | Sales | Carol | 919 555-2222 |
Dave | dave@company.com | 33000 | Accounting | NA | NA |
Eve | eve@company.com | 44000 | Sales | Eve | 919 555-3333 |
Frank | frank@comany.com | 37000 | Sales | Frank | 919 555-4444 |
SQLite does not support directly an OUTER JOIN
or a RIGHT JOIN
.
CREATE INDEX index_name ON employees (name);
CREATE INDEX index_name_email ON employees (name, email);
sqlite3 employee.sqlite .indices
## index_name index_name_email
dbDisconnect(con)