Data scientist

Post: https://towardsdatascience.com/find-your-first-job-as-a-data-scientist-81e4401fe5bf

A typical data science project:

Why databases?

Size of data

  • Small data: those can fit into computer memory.

  • Bigish data: those can fit into disk(s) of a single machine.

  • Big data: those cannot fit into disk(s) of a single machine.

Computer architecture

Key to high performance is effective use of memory hierarchy. True on all architectures.

Numbers everyone should know

Operation Time
L1 cache reference 0.5 ns
L2 cache reference 7 ns
Main memory reference 100 ns
Read 1 MB sequentially from memory 250,000 ns
Read 1 MB sequentially from SSD 1,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns

Source: https://gist.github.com/jboner/2841832

Implications for bigish data

Suppose we have a 10 GB flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set.

If we can store the file in memory:
10 GB × (250 μs/1 MB) = 2.5 seconds

If we have to access the file from SSD (~1GB/sec):
10 GB × (1 ms/1 MB) = 10 seconds

If we have to access the file from disk:
10 GB × (20 ms/1 MB) = 200 seconds

This is just for reading data, if we make any modifications (writing) things are much worse.

Blocks

Cost: Disk << Memory

Speed: Disk <<< Memory

So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and what if we can’t fit everything into memory?

Create blocks - group rows based on similar attributes and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.

Databases

SQL

Structured Query Language (SQL) is a special purpose language for interacting with (querying and modifying) these indexed tabular data structures.

  • ANSI Standard but with some dialect divergence.

  • SQL functionalities map very closely (but not exactly) with the data manipulation verbs present in dplyr.

  • We will see this mapping in more detail in a bit.

Access databases from R

  • dplyr package supports a variety of databases.
    • Open source databases: SQLite, MySQL, PostgreSQL, BigQuery.
    • Commercial databases: Oracle, Microsoft SQL Server.
    • See link for a complete list.
  • DBI package provides a common interface for connecting to databases.

  • dbplyr package is the backend that translates dplyr verbs to database SQL queries.

  • To install database drivers, follow instructions at https://db.rstudio.com/best-practices/drivers/.

A sample session using SQLite

Create a SQLite database

Create a SQLite database employee.sqlite on disk (or in memory) for learning purpose.

library("DBI")
library("RSQLite")
con = dbConnect(RSQLite::SQLite(), "employee.sqlite")
# con = dbConnect(RSQLite::SQLite(), ":memory:")
str(con)
## Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots
##   ..@ ptr                :<externalptr> 
##   ..@ dbname             : chr "/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/employee.sqlite"
##   ..@ loadable.extensions: logi TRUE
##   ..@ flags              : int 70
##   ..@ vfs                : chr ""
##   ..@ ref                :<environment: 0x7fc7efe74e68> 
##   ..@ bigint             : chr "integer64"

Add a table into database

First table:

library("tidyverse")
## ── Attaching packages ──────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
(employees <- tibble(name   = c("Alice", "Bob", "Carol", "Dave", "Eve", "Frank"),
                     email  = c("alice@company.com", "bob@company.com",
                                "carol@company.com", "dave@company.com",
                                "eve@company.com",   "frank@comany.com"),
                     salary = c(52000, 40000, 30000, 33000, 44000, 37000),
                     dept   = c("Accounting", "Accounting", "Sales",
                                "Accounting", "Sales", "Sales")))
## # A tibble: 6 x 4
##   name  email             salary dept      
##   <chr> <chr>              <dbl> <chr>     
## 1 Alice alice@company.com  52000 Accounting
## 2 Bob   bob@company.com    40000 Accounting
## 3 Carol carol@company.com  30000 Sales     
## 4 Dave  dave@company.com   33000 Accounting
## 5 Eve   eve@company.com    44000 Sales     
## 6 Frank frank@comany.com   37000 Sales

Second table:

(phone <- tibble(name  = c("Bob", "Carol", "Eve", "Frank"),
                 phone = c("919 555-1111", "919 555-2222", "919 555-3333", "919 555-4444")))
## # A tibble: 4 x 2
##   name  phone       
##   <chr> <chr>       
## 1 Bob   919 555-1111
## 2 Carol 919 555-2222
## 3 Eve   919 555-3333
## 4 Frank 919 555-4444

Write tables to the database:

dbWriteTable(con, "employees", employees, overwrite = TRUE)
dbWriteTable(con, "phone", phone, overwrite = TRUE)
dbListTables(con)
## [1] "employees" "phone"

Add another table

dbWriteTable(con, "employs", employees)
dbListTables(con)
## [1] "employees" "employs"   "phone"

Remove a table from database

dbRemoveTable(con, "employs")
dbListTables(con)
## [1] "employees" "phone"

Querying tables

# select all columns from table employees
res <- dbSendQuery(con, "SELECT * FROM employees")
# execute the query
dbFetch(res)
##    name             email salary       dept
## 1 Alice alice@company.com  52000 Accounting
## 2   Bob   bob@company.com  40000 Accounting
## 3 Carol carol@company.com  30000      Sales
## 4  Dave  dave@company.com  33000 Accounting
## 5   Eve   eve@company.com  44000      Sales
## 6 Frank  frank@comany.com  37000      Sales
dbClearResult(res)

Closing the connection

dbDisconnect(con)

SQL Queries

Following we demonstrate some common SQL commands, although all task can be achieved by using dplyr as well.

Connecting

con <- dbConnect(RSQLite::SQLite(), dbname = "employee.sqlite")
dbListTables(con)
## [1] "employees" "phone"
knitr::opts_chunk$set(connection = "con")

Below we demonstrate some common SQL queries.

SELECT statements

SELECT * FROM employees;
6 records
name email salary dept
Alice 52000 Accounting
Bob 40000 Accounting
Carol 30000 Sales
Dave 33000 Accounting
Eve 44000 Sales
Frank 37000 Sales
SELECT * FROM phone;
4 records
name phone
Bob 919 555-1111
Carol 919 555-2222
Eve 919 555-3333
Frank 919 555-4444

Select using SELECT

SELECT name AS first_name, salary FROM employees;
6 records
first_name salary
Alice 52000
Bob 40000
Carol 30000
Dave 33000
Eve 44000
Frank 37000

Arrange using ORDER BY

SELECT name AS first_name, salary FROM employees ORDER BY salary;
6 records
first_name salary
Carol 30000
Dave 33000
Frank 37000
Bob 40000
Eve 44000
Alice 52000

Descending order:

SELECT name AS first_name, salary FROM employees ORDER BY salary DESC;
6 records
first_name salary
Alice 52000
Eve 44000
Bob 40000
Frank 37000
Dave 33000
Carol 30000

Filter via WHERE

SELECT * FROM employees WHERE salary < 40000;
3 records
name email salary dept
Carol 30000 Sales
Dave 33000 Accounting
Frank 37000 Sales

Group_by via GROUP BY

First record in each group.

SELECT * FROM employees GROUP BY dept;
2 records
name email salary dept
Alice 52000 Accounting
Carol 30000 Sales

Head via LIMIT

First 3 records:

SELECT * FROM employees LIMIT 3;
3 records
name email salary dept
Alice 52000 Accounting
Bob 40000 Accounting
Carol 30000 Sales

Last 3 records:

SELECT * FROM employees ORDER BY name DESC LIMIT 3;
3 records
name email salary dept
Frank 37000 Sales
Eve 44000 Sales
Dave 33000 Accounting

Join two tables (default)

By default SQLite uses a CROSS JOIN (all row combinations) which is not terribly useful. Note only first 10 rows of results are shown.

SELECT * FROM employees JOIN phone;
Displaying records 1 - 10
name email salary dept name phone
Alice 52000 Accounting Bob 919 555-1111
Alice 52000 Accounting Carol 919 555-2222
Alice 52000 Accounting Eve 919 555-3333
Alice 52000 Accounting Frank 919 555-4444
Bob 40000 Accounting Bob 919 555-1111
Bob 40000 Accounting Carol 919 555-2222
Bob 40000 Accounting Eve 919 555-3333
Bob 40000 Accounting Frank 919 555-4444
Carol 30000 Sales Bob 919 555-1111
Carol 30000 Sales Carol 919 555-2222

Inner join by NATURAL

By default, the common variable name is used as key.

SELECT * FROM employees NATURAL JOIN phone;
4 records
name email salary dept phone
Bob 40000 Accounting 919 555-1111
Carol 30000 Sales 919 555-2222
Eve 44000 Sales 919 555-3333
Frank 37000 Sales 919 555-4444

Inner join - explicit

Explicitly specify key.

SELECT * FROM employees JOIN phone ON employees.name = phone.name;
4 records
name email salary dept name phone
Bob 40000 Accounting Bob 919 555-1111
Carol 30000 Sales Carol 919 555-2222
Eve 44000 Sales Eve 919 555-3333
Frank 37000 Sales Frank 919 555-4444

Left join - natural

SELECT * FROM employees NATURAL LEFT JOIN phone;
6 records
name email salary dept phone
Alice 52000 Accounting NA
Bob 40000 Accounting 919 555-1111
Carol 30000 Sales 919 555-2222
Dave 33000 Accounting NA
Eve 44000 Sales 919 555-3333
Frank 37000 Sales 919 555-4444

Left join - explicit

SELECT * FROM employees LEFT JOIN phone ON employees.name = phone.name;
6 records
name email salary dept name phone
Alice 52000 Accounting NA NA
Bob 40000 Accounting Bob 919 555-1111
Carol 30000 Sales Carol 919 555-2222
Dave 33000 Accounting NA NA
Eve 44000 Sales Eve 919 555-3333
Frank 37000 Sales Frank 919 555-4444

Other joins

SQLite does not support directly an OUTER JOIN or a RIGHT JOIN.

Creating indices

CREATE INDEX index_name ON employees (name);
CREATE INDEX index_name_email ON employees (name, email);
sqlite3 employee.sqlite .indices
## index_name        index_name_email

Close connection

dbDisconnect(con)