Data scientist

Post: https://towardsdatascience.com/find-your-first-job-as-a-data-scientist-81e4401fe5bf

A typical data science project:

Acknowledgement

Much material in this lecture is adpated from http://www2.stat.duke.edu/~cr173/Sta523_Fa17/sql.html and http://www2.stat.duke.edu/~cr173/Sta523_Fa17/bigish_data.html.

Why databases?

Size of data

Small data: those can fit into computer memory.
Bigish data: those can fit into disk(s) of a single machine.
Big data: those cannot fit into disk(s) of a single machine.

Computer architecture

Key to high performance is effective use of memory hierarchy. True on all architectures.

Numbers everyone should know

Operation	Time
L1 cache reference	0.5 ns
L2 cache reference	7 ns
Main memory reference	100 ns
Read 1 MB sequentially from memory	250,000 ns
Read 1 MB sequentially from SSD	1,000,000 ns
Read 1 MB sequentially from disk	20,000,000 ns

Source: https://gist.github.com/jboner/2841832

Implications for bigish data

Suppose we have a 10 GB flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set.

If we can store the file in memory:
10 GB × (250 μs/1 MB) = 2.5 seconds

If we have to access the file from SSD (~1GB/sec):
10 GB × (1 ms/1 MB) = 10 seconds

If we have to access the file from disk:
10 GB × (20 ms/1 MB) = 200 seconds

This is just for reading data, if we make any modifications (writing) things are much worse.

Blocks

Cost: Disk << Memory

Speed: Disk <<< Memory

So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and what if we can’t fit everything into memory?

Create blocks - group rows based on similar attributes and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.

Linear vs binary Search

Even with blocks, any kind of subsetting of rows requires a linear search, which requires \(O(N)\) accesses where \(N\) is the number of blocks.

We can do much better if we are careful about how we structure our data, specifically sorting some or all of the columns.

Sorting is expensive, \(O(N \log N)\), but it only needs to be done once.
After sorting, we can use a binary search for any subsetting tasks \(O(\log N)\).
These sorted columns are known as indexes.
Indexes require additional storage, but usually small enough to be kept in memory while blocks stay on disk.

Databases

SQL

Structured Query Language (SQL) is a special purpose language for interacting with (querying and modifying) these indexed tabular data structures.

ANSI Standard but with some dialect divergence.
SQL functionalities map very closely (but not exactly) with the data manipulation verbs present in dplyr.
We will see this mapping in more detail in a bit.

Access databases from R

dplyr package supports a variety of databases.
- Open source databases: SQLite, MySQL, PostgreSQL, BigQuery.
- Commercial databases: Oracle, Microsoft SQL Server.
- See link for a complete list.
DBI package provides a common interface for connecting to databases.
dbplyr package is the backend that translates dplyr verbs to database SQL queries.
To install database drivers, follow instructions at https://db.rstudio.com/best-practices/drivers/.

A sample session using SQLite

Create a SQLite database

Create a SQLite database employee.sqlite on disk (or in memory) for learning purpose.

library("DBI")
library("RSQLite")
con = dbConnect(RSQLite::SQLite(), "employee.sqlite")
# con = dbConnect(RSQLite::SQLite(), ":memory:")
str(con)

## Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots
##   ..@ ptr                :<externalptr> 
##   ..@ dbname             : chr "/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/employee.sqlite"
##   ..@ loadable.extensions: logi TRUE
##   ..@ flags              : int 70
##   ..@ vfs                : chr ""
##   ..@ ref                :<environment: 0x7fc7efe74e68> 
##   ..@ bigint             : chr "integer64"

Add a table into database

First table:

library("tidyverse")

## ── Attaching packages ──────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

(employees <- tibble(name   = c("Alice", "Bob", "Carol", "Dave", "Eve", "Frank"),
                     email  = c("alice@company.com", "bob@company.com",
                                "carol@company.com", "dave@company.com",
                                "eve@company.com",   "frank@comany.com"),
                     salary = c(52000, 40000, 30000, 33000, 44000, 37000),
                     dept   = c("Accounting", "Accounting", "Sales",
                                "Accounting", "Sales", "Sales")))

## # A tibble: 6 x 4
##   name  email             salary dept      
##   <chr> <chr>              <dbl> <chr>     
## 1 Alice alice@company.com  52000 Accounting
## 2 Bob   bob@company.com    40000 Accounting
## 3 Carol carol@company.com  30000 Sales     
## 4 Dave  dave@company.com   33000 Accounting
## 5 Eve   eve@company.com    44000 Sales     
## 6 Frank frank@comany.com   37000 Sales

Second table:

(phone <- tibble(name  = c("Bob", "Carol", "Eve", "Frank"),
                 phone = c("919 555-1111", "919 555-2222", "919 555-3333", "919 555-4444")))

## # A tibble: 4 x 2
##   name  phone       
##   <chr> <chr>       
## 1 Bob   919 555-1111
## 2 Carol 919 555-2222
## 3 Eve   919 555-3333
## 4 Frank 919 555-4444

Write tables to the database:

dbWriteTable(con, "employees", employees, overwrite = TRUE)
dbWriteTable(con, "phone", phone, overwrite = TRUE)
dbListTables(con)

## [1] "employees" "phone"

Add another table

dbWriteTable(con, "employs", employees)
dbListTables(con)

## [1] "employees" "employs"   "phone"

Remove a table from database

dbRemoveTable(con, "employs")
dbListTables(con)

## [1] "employees" "phone"

Querying tables

# select all columns from table employees
res <- dbSendQuery(con, "SELECT * FROM employees")
# execute the query
dbFetch(res)

##    name             email salary       dept
## 1 Alice alice@company.com  52000 Accounting
## 2   Bob   bob@company.com  40000 Accounting
## 3 Carol carol@company.com  30000      Sales
## 4  Dave  dave@company.com  33000 Accounting
## 5   Eve   eve@company.com  44000      Sales
## 6 Frank  frank@comany.com  37000      Sales

dbClearResult(res)

Closing the connection

dbDisconnect(con)

SQL Queries

Following we demonstrate some common SQL commands, although all task can be achieved by using dplyr as well.

Connecting

con <- dbConnect(RSQLite::SQLite(), dbname = "employee.sqlite")
dbListTables(con)

## [1] "employees" "phone"

knitr::opts_chunk$set(connection = "con")

Below we demonstrate some common SQL queries.

SELECT statements

SELECT * FROM employees;

6 records
name	email	salary	dept
Alice	alice@company.com	52000	Accounting
Bob	bob@company.com	40000	Accounting
Carol	carol@company.com	30000	Sales
Dave	dave@company.com	33000	Accounting
Eve	eve@company.com	44000	Sales
Frank	frank@comany.com	37000	Sales

SELECT * FROM phone;

4 records
name	phone
Bob	919 555-1111
Carol	919 555-2222
Eve	919 555-3333
Frank	919 555-4444

Select using SELECT

SELECT name AS first_name, salary FROM employees;

6 records
first_name	salary
Alice	52000
Bob	40000
Carol	30000
Dave	33000
Eve	44000
Frank	37000

Arrange using ORDER BY

SELECT name AS first_name, salary FROM employees ORDER BY salary;

6 records
first_name	salary
Carol	30000
Dave	33000
Frank	37000
Bob	40000
Eve	44000
Alice	52000

Descending order:

SELECT name AS first_name, salary FROM employees ORDER BY salary DESC;

6 records
first_name	salary
Alice	52000
Eve	44000
Bob	40000
Frank	37000
Dave	33000
Carol	30000

Filter via WHERE

SELECT * FROM employees WHERE salary < 40000;

3 records
name	email	salary	dept
Carol	carol@company.com	30000	Sales
Dave	dave@company.com	33000	Accounting
Frank	frank@comany.com	37000	Sales

Group_by via GROUP BY

First record in each group.

SELECT * FROM employees GROUP BY dept;

2 records
name	email	salary	dept
Alice	alice@company.com	52000	Accounting
Carol	carol@company.com	30000	Sales

Head via LIMIT

First 3 records:

SELECT * FROM employees LIMIT 3;

3 records
name	email	salary	dept
Alice	alice@company.com	52000	Accounting
Bob	bob@company.com	40000	Accounting
Carol	carol@company.com	30000	Sales

Last 3 records:

SELECT * FROM employees ORDER BY name DESC LIMIT 3;

3 records
name	email	salary	dept
Frank	frank@comany.com	37000	Sales
Eve	eve@company.com	44000	Sales
Dave	dave@company.com	33000	Accounting

Join two tables (default)

By default SQLite uses a CROSS JOIN (all row combinations) which is not terribly useful. Note only first 10 rows of results are shown.

SELECT * FROM employees JOIN phone;

Displaying records 1 - 10
name	email	salary	dept	name	phone
Alice	alice@company.com	52000	Accounting	Bob	919 555-1111
Alice	alice@company.com	52000	Accounting	Carol	919 555-2222
Alice	alice@company.com	52000	Accounting	Eve	919 555-3333
Alice	alice@company.com	52000	Accounting	Frank	919 555-4444
Bob	bob@company.com	40000	Accounting	Bob	919 555-1111
Bob	bob@company.com	40000	Accounting	Carol	919 555-2222
Bob	bob@company.com	40000	Accounting	Eve	919 555-3333
Bob	bob@company.com	40000	Accounting	Frank	919 555-4444
Carol	carol@company.com	30000	Sales	Bob	919 555-1111
Carol	carol@company.com	30000	Sales	Carol	919 555-2222

Inner join by NATURAL

By default, the common variable name is used as key.

SELECT * FROM employees NATURAL JOIN phone;

4 records
name	email	salary	dept	phone
Bob	bob@company.com	40000	Accounting	919 555-1111
Carol	carol@company.com	30000	Sales	919 555-2222
Eve	eve@company.com	44000	Sales	919 555-3333
Frank	frank@comany.com	37000	Sales	919 555-4444

Inner join - explicit

Explicitly specify key.

SELECT * FROM employees JOIN phone ON employees.name = phone.name;

4 records
name	email	salary	dept	name	phone
Bob	bob@company.com	40000	Accounting	Bob	919 555-1111
Carol	carol@company.com	30000	Sales	Carol	919 555-2222
Eve	eve@company.com	44000	Sales	Eve	919 555-3333
Frank	frank@comany.com	37000	Sales	Frank	919 555-4444

Left join - natural

SELECT * FROM employees NATURAL LEFT JOIN phone;

6 records
name	email	salary	dept	phone
Alice	alice@company.com	52000	Accounting	NA
Bob	bob@company.com	40000	Accounting	919 555-1111
Carol	carol@company.com	30000	Sales	919 555-2222
Dave	dave@company.com	33000	Accounting	NA
Eve	eve@company.com	44000	Sales	919 555-3333
Frank	frank@comany.com	37000	Sales	919 555-4444

Left join - explicit

SELECT * FROM employees LEFT JOIN phone ON employees.name = phone.name;

6 records
name	email	salary	dept	name	phone
Alice	alice@company.com	52000	Accounting	NA	NA
Bob	bob@company.com	40000	Accounting	Bob	919 555-1111
Carol	carol@company.com	30000	Sales	Carol	919 555-2222
Dave	dave@company.com	33000	Accounting	NA	NA
Eve	eve@company.com	44000	Sales	Eve	919 555-3333
Frank	frank@comany.com	37000	Sales	Frank	919 555-4444

Other joins

SQLite does not support directly an OUTER JOIN or a RIGHT JOIN.

Creating indices

CREATE INDEX index_name ON employees (name);

CREATE INDEX index_name_email ON employees (name, email);

sqlite3 employee.sqlite .indices

## index_name        index_name_email

Close connection

dbDisconnect(con)

Working with Databases - Part I

Biostat 203B

Dr. Hua Zhou @ UCLA

Feb 18, 2020