Working with Databases

Introduction
Machine information
CSV file
Read csv file: read.csv() vs read_csv() (tidyverse) vs data.table
Read data from database
Use dplyr with SQLite
SQL query
SQL grouping
SQL query
SQL translation
Timings
Laziness
Full query
Plotting
- Bar plot
- Other plots
Close connection to database

Introduction

In this lecture we will learn:

Import data from bigish csv files (MIMIC-III).
Tidy data (TODO).
Deposit data into an SQLite database.
Query SQLite database.
Transform in database and plot in R.

Machine information

sessionInfo()

## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.2  magrittr_1.5    tools_3.6.2     htmltools_0.4.0
##  [5] yaml_2.2.1      Rcpp_1.0.3      stringi_1.4.6   rmarkdown_2.1  
##  [9] knitr_1.28      stringr_1.4.0   xfun_0.12       digest_0.6.24  
## [13] rlang_0.4.4     evaluate_0.14

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

CSV file

The /home/203bdata/mimic-iii folder on the teaching server contains the MIMIC-III data. On my Mac, it’s at /Users/huazhou/Google Drive/mimic/mimic-iii.

# On teaching server
ls -l /home/203bdata/mimic-iii

# On my MacBook
ls -l /Users/huazhou/Google\ Drive/mimic/mimic-iii

## total 22333688
## -rw-------  1 huazhou  staff    12548562 Sep  2  2016 ADMISSIONS.csv
## -rw-------  1 huazhou  staff     6339185 Sep  2  2016 CALLOUT.csv
## -rw-------  1 huazhou  staff      203492 Sep  2  2016 CAREGIVERS.csv
## -rw-------  1 huazhou  staff    85204883 Nov 30  2018 CHARTEVENTS.csv
## -rw-------  1 huazhou  staff    58150883 Sep  2  2016 CPTEVENTS.csv
## -rw-------  1 huazhou  staff   525785298 Sep  2  2016 DATETIMEEVENTS.csv
## -rw-------  1 huazhou  staff    19137527 Oct 31  2018 DIAGNOSES_ICD.csv
## -rw-------  1 huazhou  staff    10487132 Sep  2  2016 DRGCODES.csv
## -rw-------  1 huazhou  staff       13807 Sep  2  2016 D_CPT.csv
## -rw-------  1 huazhou  staff     1387562 Sep  2  2016 D_ICD_DIAGNOSES.csv
## -rw-------  1 huazhou  staff      311466 Sep  2  2016 D_ICD_PROCEDURES.csv
## -rw-------  1 huazhou  staff      954420 Sep  2  2016 D_ITEMS.csv
## -rw-------  1 huazhou  staff       43118 Sep  2  2016 D_LABITEMS.csv
## -rw-------  1 huazhou  staff     6357077 Sep  2  2016 ICUSTAYS.csv
## -rw-------  1 huazhou  staff  2464296511 Sep  2  2016 INPUTEVENTS_CV.csv
## -rw-------  1 huazhou  staff   975255812 Sep  2  2016 INPUTEVENTS_MV.csv
## -rw-r--r--@ 1 huazhou  staff           0 Jan 13 14:43 Icon
## -rw-------  1 huazhou  staff  1854245647 Nov 11  2016 LABEVENTS.csv
## -rw-------  1 huazhou  staff    72507810 Oct 31  2018 MICROBIOLOGYEVENTS.csv
## -rw-------  1 huazhou  staff  4007717810 Sep 30  2016 NOTEEVENTS.csv
## -rw-------  1 huazhou  staff   396406750 Sep  2  2016 OUTPUTEVENTS.csv
## -rw-------  1 huazhou  staff     2628900 Sep  2  2016 PATIENTS.csv
## -rw-------  1 huazhou  staff   770336136 Sep  2  2016 PRESCRIPTIONS.csv
## -rw-------  1 huazhou  staff    48770424 Sep  2  2016 PROCEDUREEVENTS_MV.csv
## -rw-------  1 huazhou  staff     6798492 Sep  2  2016 PROCEDURES_ICD.csv
## -rw-------  1 huazhou  staff     3481645 Sep  2  2016 SERVICES.csv
## -rw-------  1 huazhou  staff    25057095 Sep  2  2016 TRANSFERS.csv
## -rw-------  1 huazhou  staff       36517 Oct 31  2018 postgres_add_indexes.sql
## -rw-------  1 huazhou  staff        4195 Oct 31  2018 postgres_checks.sql
## -rw-------  1 huazhou  staff       20688 Oct  4  2018 postgres_create_tables.sql
## -rw-------  1 huazhou  staff        6897 Oct  4  2018 postgres_load_data.sql

datafolder <- ifelse(Sys.info()[["sysname"]] == "Linux",
                    "/home/203bdata/mimic-iii",
                    "/Users/huazhou/Google\ Drive/mimic/mimic-iii")
list.files(datafolder, full.names = TRUE, patter = "*.csv")

##  [1] "/Users/huazhou/Google Drive/mimic/mimic-iii/ADMISSIONS.csv"        
##  [2] "/Users/huazhou/Google Drive/mimic/mimic-iii/CALLOUT.csv"           
##  [3] "/Users/huazhou/Google Drive/mimic/mimic-iii/CAREGIVERS.csv"        
##  [4] "/Users/huazhou/Google Drive/mimic/mimic-iii/CHARTEVENTS.csv"       
##  [5] "/Users/huazhou/Google Drive/mimic/mimic-iii/CPTEVENTS.csv"         
##  [6] "/Users/huazhou/Google Drive/mimic/mimic-iii/D_CPT.csv"             
##  [7] "/Users/huazhou/Google Drive/mimic/mimic-iii/D_ICD_DIAGNOSES.csv"   
##  [8] "/Users/huazhou/Google Drive/mimic/mimic-iii/D_ICD_PROCEDURES.csv"  
##  [9] "/Users/huazhou/Google Drive/mimic/mimic-iii/D_ITEMS.csv"           
## [10] "/Users/huazhou/Google Drive/mimic/mimic-iii/D_LABITEMS.csv"        
## [11] "/Users/huazhou/Google Drive/mimic/mimic-iii/DATETIMEEVENTS.csv"    
## [12] "/Users/huazhou/Google Drive/mimic/mimic-iii/DIAGNOSES_ICD.csv"     
## [13] "/Users/huazhou/Google Drive/mimic/mimic-iii/DRGCODES.csv"          
## [14] "/Users/huazhou/Google Drive/mimic/mimic-iii/ICUSTAYS.csv"          
## [15] "/Users/huazhou/Google Drive/mimic/mimic-iii/INPUTEVENTS_CV.csv"    
## [16] "/Users/huazhou/Google Drive/mimic/mimic-iii/INPUTEVENTS_MV.csv"    
## [17] "/Users/huazhou/Google Drive/mimic/mimic-iii/LABEVENTS.csv"         
## [18] "/Users/huazhou/Google Drive/mimic/mimic-iii/MICROBIOLOGYEVENTS.csv"
## [19] "/Users/huazhou/Google Drive/mimic/mimic-iii/NOTEEVENTS.csv"        
## [20] "/Users/huazhou/Google Drive/mimic/mimic-iii/OUTPUTEVENTS.csv"      
## [21] "/Users/huazhou/Google Drive/mimic/mimic-iii/PATIENTS.csv"          
## [22] "/Users/huazhou/Google Drive/mimic/mimic-iii/PRESCRIPTIONS.csv"     
## [23] "/Users/huazhou/Google Drive/mimic/mimic-iii/PROCEDUREEVENTS_MV.csv"
## [24] "/Users/huazhou/Google Drive/mimic/mimic-iii/PROCEDURES_ICD.csv"    
## [25] "/Users/huazhou/Google Drive/mimic/mimic-iii/SERVICES.csv"          
## [26] "/Users/huazhou/Google Drive/mimic/mimic-iii/TRANSFERS.csv"

Read csv file: `read.csv()` vs `read_csv()` (tidyverse) vs data.table

First let’s compare different approaches for reading CSV files.

read.csv() function in base R takes about 1 minute to read in the 500MB csv file DATETIMEEVENTS.csv:

csvfile <- str_c(datafolder, "/DATETIMEEVENTS.csv")
system.time(read.csv(csvfile))

##    user  system elapsed 
##  69.601   1.748  72.004

data.table is an R package for reading large data sets. It’s much faster to read the same file (about 8 seconds). The result is of type data.table and data.frame.

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

system.time({dtevents <- fread(csvfile)})

##    user  system elapsed 
##   9.054   0.286   2.423

class(dtevents)

## [1] "data.table" "data.frame"

dtevents

##           ROW_ID SUBJECT_ID HADM_ID ICUSTAY_ID ITEMID           CHARTTIME
##       1:     711       7657  121183     297945   3411 2172-03-14 11:00:00
##       2:     712       7657  121183     297945   3411 2172-03-14 13:00:00
##       3:     713       7657  121183     297945   3411 2172-03-14 15:00:00
##       4:     714       7657  121183     297945   3411 2172-03-14 17:00:00
##       5:     715       7657  121183     297945   3411 2172-03-14 19:00:00
##      ---                                                                 
## 4485933: 4456093      99366  136021     218447 224279 2197-03-24 16:32:00
## 4485934: 4456094      99366  136021     218447 224280 2197-03-24 16:32:00
## 4485935: 4456095      99366  136021     218447 224282 2197-03-24 16:32:00
## 4485936: 4456096      99366  136021     218447 224284 2197-03-24 16:32:00
## 4485937: 4456097      99366  136021     218447 224287 2197-03-24 16:32:00
##                    STORETIME  CGID               VALUE      VALUEUOM WARNING
##       1: 2172-03-14 11:52:00 16446                              Date      NA
##       2: 2172-03-14 12:36:00 16446                              Date      NA
##       3: 2172-03-14 15:10:00 14957                              Date      NA
##       4: 2172-03-14 17:01:00 16446                              Date      NA
##       5: 2172-03-14 19:29:00 14815                              Date      NA
##      ---                                                                    
## 4485933: 2197-03-24 16:32:00 18234 2197-03-24 13:03:00 Date and Time       0
## 4485934: 2197-03-24 16:32:00 18234 2197-03-24 00:00:00          Date       0
## 4485935: 2197-03-24 16:32:00 18234 2197-03-24 00:00:00          Date       0
## 4485936: 2197-03-24 16:32:00 18234 2197-03-24 00:00:00          Date       0
## 4485937: 2197-03-24 16:32:00 18234 2197-03-24 15:04:00 Date and Time       0
##          ERROR RESULTSTATUS  STOPPED
##       1:    NA           NA NotStopd
##       2:    NA           NA NotStopd
##       3:    NA           NA NotStopd
##       4:    NA           NA NotStopd
##       5:    NA           NA NotStopd
##      ---                            
## 4485933:     0           NA         
## 4485934:     0           NA         
## 4485935:     0           NA         
## 4485936:     0           NA         
## 4485937:     0           NA

read_csv() function in tidyverse takes about 6-7 seconds:

system.time({dtevents <- read_csv(csvfile)})

## Parsed with column specification:
## cols(
##   ROW_ID = col_double(),
##   SUBJECT_ID = col_double(),
##   HADM_ID = col_double(),
##   ICUSTAY_ID = col_double(),
##   ITEMID = col_double(),
##   CHARTTIME = col_datetime(format = ""),
##   STORETIME = col_datetime(format = ""),
##   CGID = col_double(),
##   VALUE = col_datetime(format = ""),
##   VALUEUOM = col_character(),
##   WARNING = col_logical(),
##   ERROR = col_logical(),
##   RESULTSTATUS = col_logical(),
##   STOPPED = col_character()
## )

##    user  system elapsed 
##   7.049   0.315   7.397

class(dtevents)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

dtevents %>% print(width = Inf)

## # A tibble: 4,485,937 x 14
##    ROW_ID SUBJECT_ID HADM_ID ICUSTAY_ID ITEMID CHARTTIME          
##     <dbl>      <dbl>   <dbl>      <dbl>  <dbl> <dttm>             
##  1    711       7657  121183     297945   3411 2172-03-14 11:00:00
##  2    712       7657  121183     297945   3411 2172-03-14 13:00:00
##  3    713       7657  121183     297945   3411 2172-03-14 15:00:00
##  4    714       7657  121183     297945   3411 2172-03-14 17:00:00
##  5    715       7657  121183     297945   3411 2172-03-14 19:00:00
##  6    716       7657  121183     297945   3411 2172-03-14 21:30:00
##  7    717       7657  121183     297945   3411 2172-03-15 00:00:00
##  8    718       7657  121183     297945   3411 2172-03-15 01:00:00
##  9    719       7657  121183     297945   3411 2172-03-15 03:30:00
## 10    720       7657  121183     297945   3411 2172-03-15 05:00:00
##    STORETIME            CGID VALUE               VALUEUOM WARNING ERROR
##    <dttm>              <dbl> <dttm>              <chr>    <lgl>   <lgl>
##  1 2172-03-14 11:52:00 16446 NA                  Date     NA      NA   
##  2 2172-03-14 12:36:00 16446 NA                  Date     NA      NA   
##  3 2172-03-14 15:10:00 14957 NA                  Date     NA      NA   
##  4 2172-03-14 17:01:00 16446 NA                  Date     NA      NA   
##  5 2172-03-14 19:29:00 14815 NA                  Date     NA      NA   
##  6 2172-03-14 21:43:00 14815 NA                  Date     NA      NA   
##  7 2172-03-15 00:27:00 14815 NA                  Date     NA      NA   
##  8 2172-03-15 01:44:00 14815 NA                  Date     NA      NA   
##  9 2172-03-15 03:48:00 14815 NA                  Date     NA      NA   
## 10 2172-03-15 04:58:00 14815 NA                  Date     NA      NA   
##    RESULTSTATUS STOPPED 
##    <lgl>        <chr>   
##  1 NA           NotStopd
##  2 NA           NotStopd
##  3 NA           NotStopd
##  4 NA           NotStopd
##  5 NA           NotStopd
##  6 NA           NotStopd
##  7 NA           NotStopd
##  8 NA           NotStopd
##  9 NA           NotStopd
## 10 NA           NotStopd
## # … with 4,485,927 more rows

Read all CSVs and deposit to a SQLite database

Create a SQLite database file called mimiciii.sqlite.

library("DBI")
library("RSQLite")
con = dbConnect(RSQLite::SQLite(), "mimiciii.sqlite")
str(con)

## Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots
##   ..@ ptr                :<externalptr> 
##   ..@ dbname             : chr "/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/mimiciii.sqlite"
##   ..@ loadable.extensions: logi TRUE
##   ..@ flags              : int 70
##   ..@ vfs                : chr ""
##   ..@ ref                :<environment: 0x7ff315e4fd18> 
##   ..@ bigint             : chr "integer64"

Because of excellent efficiency of data.table packages, we use it to read all CSVs and deposit into mimiciii.sqlite. Do not run this chunk by yourself on teaching server. This takes a couple minutes.

datafolder
csv_names <- list.files(datafolder)
csv_names <- csv_names[str_detect(csv_names, "csv")]
csv_names
system.time({
for (table_name in csv_names) {
    csv_name <- str_c(datafolder, "/", table_name)
    tmp <- fread(csv_name)
    table_name <- str_remove(table_name, ".csv")
    dbWriteTable(con, table_name, tmp, overwrite = TRUE)
}})

# list tables in database
dbListTables(con)

##  [1] "ADMISSIONS"         "CALLOUT"            "CAREGIVERS"        
##  [4] "CHARTEVENTS"        "CPTEVENTS"          "DATETIMEEVENTS"    
##  [7] "DIAGNOSES_ICD"      "DRGCODES"           "D_CPT"             
## [10] "D_ICD_DIAGNOSES"    "D_ICD_PROCEDURES"   "D_ITEMS"           
## [13] "D_LABITEMS"         "ICUSTAYS"           "INPUTEVENTS_CV"    
## [16] "INPUTEVENTS_MV"     "LABEVENTS"          "MICROBIOLOGYEVENTS"
## [19] "NOTEEVENTS"         "OUTPUTEVENTS"       "PATIENTS"          
## [22] "PRESCRIPTIONS"      "PROCEDUREEVENTS_MV" "PROCEDURES_ICD"    
## [25] "SERVICES"           "TRANSFERS"

# disconnect from database
dbDisconnect(con)

The resultant database file mimiciii.sqlite has size 10.4GB.

ls -l mimiciii.sqlite

## -rw-r--r--  1 huazhou  staff  11160363008 Feb 18 11:20 mimiciii.sqlite

Read data from database

Connect to the database mimiciii.sqlite and list the tables:

con <- dbConnect(RSQLite::SQLite(), dbname = "./mimiciii.sqlite")
dbListTables(con)

##  [1] "ADMISSIONS"         "CALLOUT"            "CAREGIVERS"        
##  [4] "CHARTEVENTS"        "CPTEVENTS"          "DATETIMEEVENTS"    
##  [7] "DIAGNOSES_ICD"      "DRGCODES"           "D_CPT"             
## [10] "D_ICD_DIAGNOSES"    "D_ICD_PROCEDURES"   "D_ITEMS"           
## [13] "D_LABITEMS"         "ICUSTAYS"           "INPUTEVENTS_CV"    
## [16] "INPUTEVENTS_MV"     "LABEVENTS"          "MICROBIOLOGYEVENTS"
## [19] "NOTEEVENTS"         "OUTPUTEVENTS"       "PATIENTS"          
## [22] "PRESCRIPTIONS"      "PROCEDUREEVENTS_MV" "PROCEDURES_ICD"    
## [25] "SERVICES"           "TRANSFERS"

Read the table CHARTEVENTS:

chartevents <- tbl(con, "CHARTEVENTS")
class(chartevents)

## [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
## [4] "tbl_lazy"             "tbl"

chartevents %>% print(width = Inf)

## # Source:   table<CHARTEVENTS> [?? x 15]
## # Database: sqlite 3.30.1
## #   [/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/mimiciii.sqlite]
##    ROW_ID SUBJECT_ID HADM_ID ICUSTAY_ID ITEMID CHARTTIME     STORETIME      CGID
##     <int>      <int>   <int>      <int>  <int> <chr>         <chr>         <int>
##  1    788         36  165660     241249 223834 5/12/34 12:00 5/12/34 13:56 17525
##  2    789         36  165660     241249 223835 5/12/34 12:00 5/12/34 13:56 17525
##  3    790         36  165660     241249 224328 5/12/34 12:00 5/12/34 12:18 20823
##  4    791         36  165660     241249 224329 5/12/34 12:00 5/12/34 12:19 20823
##  5    792         36  165660     241249 224330 5/12/34 12:00 5/12/34 12:19 20823
##  6    793         36  165660     241249 224331 5/12/34 12:00 5/12/34 12:19 20823
##  7    794         36  165660     241249 224332 5/12/34 12:00 5/12/34 14:44 17525
##  8    795         36  165660     241249 224663 5/12/34 12:00 5/12/34 14:44 17525
##  9    796         36  165660     241249 224665 5/12/34 12:00 5/12/34 14:44 17525
## 10    797         36  165660     241249 220224 5/12/34 12:35 5/12/34 12:38 20889
##     VALUE VALUENUM VALUEUOM WARNING ERROR RESULTSTATUS STOPPED
##     <dbl>    <dbl> <chr>      <int> <int>        <int>   <int>
##  1  15       15    "L/min"        0     0           NA      NA
##  2 100      100    ""             0     0           NA      NA
##  3   0.37     0.37 ""             0     0           NA      NA
##  4   6        6    "min"          0     0           NA      NA
##  5   2.5      2.5  ""             0     0           NA      NA
##  6   0        0    "ml/hr"        0     0           NA      NA
##  7   3        3    ""             0     0           NA      NA
##  8   8        8    ""             0     0           NA      NA
##  9   1.11     1.11 ""             0     0           NA      NA
## 10  58       58    "mmHg"         1     0           NA      NA
## # … with more rows

Use dplyr with SQLite

Retrieve chart events of SUBJECT_ID beetween 23 and 164:

chartevents_subset <- chartevents %>%
  select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>%
  filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165)
chartevents_subset

## # Source:   lazy query [?? x 4]
## # Database: sqlite 3.30.1
## #   [/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/mimiciii.sqlite]
##    ROW_ID SUBJECT_ID HADM_ID ITEMID
##     <int>      <int>   <int>  <int>
##  1    788         36  165660 223834
##  2    789         36  165660 223835
##  3    790         36  165660 224328
##  4    791         36  165660 224329
##  5    792         36  165660 224330
##  6    793         36  165660 224331
##  7    794         36  165660 224332
##  8    795         36  165660 224663
##  9    796         36  165660 224665
## 10    797         36  165660 220224
## # … with more rows

SQL query

class(chartevents_subset)

## [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
## [4] "tbl_lazy"             "tbl"

show_query(chartevents_subset)

## <SQL>
## SELECT *
## FROM (SELECT `ROW_ID`, `SUBJECT_ID`, `HADM_ID`, `ITEMID`
## FROM `CHARTEVENTS`)
## WHERE ((`SUBJECT_ID` >= 23.0) AND (`SUBJECT_ID` <= 165.0))

SQL grouping

chartevents %>%
  group_by(SUBJECT_ID) %>%
  summarise(n = n())

## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.30.1
## #   [/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/mimiciii.sqlite]
##    SUBJECT_ID     n
##         <int> <int>
##  1         23   320
##  2         34   346
##  3         36  1342
##  4         85   396
##  5        107   763
##  6        109 12471
##  7        111   351
##  8        124 10712
##  9        154   422
## 10        165   217
## # … with more rows

SQL query

chartevents %>%
  group_by(SUBJECT_ID) %>%
  summarise(n = n()) %>%
  show_query()

## <SQL>
## SELECT `SUBJECT_ID`, COUNT() AS `n`
## FROM `CHARTEVENTS`
## GROUP BY `SUBJECT_ID`

SQL translation

dbplyr package (a dplyr backend for databases) has a function, translate_sql, that lets you experiment with how R functions are translated to SQL:

library("dbplyr")

## 
## Attaching package: 'dbplyr'

## The following objects are masked from 'package:dplyr':
## 
##     ident, sql

translate_sql(x == 1 & (y < 2 | z > 3))

## <SQL> `x` = 1.0 AND (`y` < 2.0 OR `z` > 3.0)

translate_sql(x ^ 2 < 10)

## <SQL> POWER(`x`, 2.0) < 10.0

translate_sql(x %% 2 == 10)

## <SQL> `x` % 2.0 = 10.0

translate_sql(paste(x, y))

## <SQL> CONCAT_WS(' ', `x`, `y`)

translate_sql(mean(x))

## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.

## <SQL> AVG(`x`) OVER ()

translate_sql(mean(x, na.rm = TRUE))

## <SQL> AVG(`x`) OVER ()

Timings

Let’s compare the timings of dplyr (in-momory) and dbplyr (on disk database).

dplyr using tibble:

csvfile <- str_c(datafolder, "/DATETIMEEVENTS.csv")
dtevents_tibble <- read_csv(csvfile)

## Parsed with column specification:
## cols(
##   ROW_ID = col_double(),
##   SUBJECT_ID = col_double(),
##   HADM_ID = col_double(),
##   ICUSTAY_ID = col_double(),
##   ITEMID = col_double(),
##   CHARTTIME = col_datetime(format = ""),
##   STORETIME = col_datetime(format = ""),
##   CGID = col_double(),
##   VALUE = col_datetime(format = ""),
##   VALUEUOM = col_character(),
##   WARNING = col_logical(),
##   ERROR = col_logical(),
##   RESULTSTATUS = col_logical(),
##   STOPPED = col_character()
## )

system.time(
  dtevents_tibble %>%
  select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>%
  # filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) %>%
  group_by(SUBJECT_ID) %>%
  summarize(n = n())
)

##    user  system elapsed 
##   0.202   0.011   0.214

dplyr using SQLite:

dtevents_sql <- tbl(con, "DATETIMEEVENTS")
system.time(
  dtevents_sql %>%
  select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>%
  # filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) %>%
  group_by(SUBJECT_ID) %>%
  summarize(n = n())
)

##    user  system elapsed 
##   0.006   0.000   0.006

SQLite (0.08 seconds) was much faster than tibble (0.25 seconds). But SQLitee is disk-based, while the tibble is in memory. Why is the discrepancy?

Laziness

dplyr/dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.

When building a query, we don’t want the entire table, often we want just enough to check if our query is working.
Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
Therefore, by default dplyr
- won’t connect and query the database until absolutely necessary (e.g. show output),
- and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like

dtevents_sql %>%
  select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>%
  # filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) %>%
  group_by(SUBJECT_ID) %>%
  summarize(n = n())

## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.30.1
## #   [/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/mimiciii.sqlite]
##    SUBJECT_ID     n
##         <int> <int>
##  1          6    60
##  2          8     5
##  3         17    69
##  4         21   259
##  5         23    18
##  6         25     2
##  7         34    27
##  8         35   141
##  9         36   201
## 10         38   877
## # … with more rows

Full query

To force a full query and return a complete table it is necessary to use the collect function.

system.time(
  dtevents_sql %>%
  select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>%
  # filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) %>%
  group_by(SUBJECT_ID) %>%
  summarize(n = n()) %>%
  collect()
)

##    user  system elapsed 
##   1.581   0.232   1.835

Plotting

Bar plot

Suppose we want the bar plot of number of datetime events on each SUBJECT_ID.

chartevents %>%
  count(SUBJECT_ID) %>%
  ggplot() + 
  geom_histogram(mapping = aes(x = n))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Apparently current version of ggplot2 (v3.2.1) works for tbl_dbi/tbl_sql/tbl_lazy now.

Older version of ggplot2, e.g., v2.2.1, will output error. This is becasue ggplot2 needed to compute the count per bin by going through all the rows. But here chartevents is just a pointer to the SQLite table. We had to use the transform in database, plot in R strategy.

chartevents %>%
  count(SUBJECT_ID) %>%
  collect() %>%
  ggplot() + 
  geom_histogram(mapping = aes(x = n))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Other plots

For example of making histogram and raster plot, read tutorial https://db.rstudio.com/best-practices/visualization/.

Close connection to database

dbDisconnect(con)

Working with Databases - Part II

Biostat 203B

Dr. Hua Zhou @ UCLA

Feb 20, 2020

Introduction

Machine information

CSV file

Read csv file: `read.csv()` vs `read_csv()` (tidyverse) vs data.table

Read all CSVs and deposit to a SQLite database

Read data from database

Use dplyr with SQLite

SQL query

SQL grouping

SQL query

SQL translation

Timings

Laziness

Full query

Plotting

Bar plot

Other plots

Close connection to database

Working with Databases - Part II

Biostat 203B

Dr. Hua Zhou @ UCLA

Feb 20, 2020

Introduction

Machine information

CSV file

Read csv file: read.csv() vs read_csv() (tidyverse) vs data.table

Read all CSVs and deposit to a SQLite database

Read data from database

Use dplyr with SQLite

SQL query

SQL grouping

SQL query

SQL translation

Timings

Laziness

Full query

Plotting

Bar plot

Other plots

Close connection to database

Read csv file: `read.csv()` vs `read_csv()` (tidyverse) vs data.table