Introduction

In this lecture we will learn:

Machine information

sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.2  magrittr_1.5    tools_3.6.2     htmltools_0.4.0
##  [5] yaml_2.2.1      Rcpp_1.0.3      stringi_1.4.6   rmarkdown_2.1  
##  [9] knitr_1.28      stringr_1.4.0   xfun_0.12       digest_0.6.24  
## [13] rlang_0.4.4     evaluate_0.14
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

CSV file

The /home/203bdata/mimic-iii folder on the teaching server contains the MIMIC-III data. On my Mac, it’s at /Users/huazhou/Google Drive/mimic/mimic-iii.

# On teaching server
ls -l /home/203bdata/mimic-iii
# On my MacBook
ls -l /Users/huazhou/Google\ Drive/mimic/mimic-iii
## total 22333688
## -rw-------  1 huazhou  staff    12548562 Sep  2  2016 ADMISSIONS.csv
## -rw-------  1 huazhou  staff     6339185 Sep  2  2016 CALLOUT.csv
## -rw-------  1 huazhou  staff      203492 Sep  2  2016 CAREGIVERS.csv
## -rw-------  1 huazhou  staff    85204883 Nov 30  2018 CHARTEVENTS.csv
## -rw-------  1 huazhou  staff    58150883 Sep  2  2016 CPTEVENTS.csv
## -rw-------  1 huazhou  staff   525785298 Sep  2  2016 DATETIMEEVENTS.csv
## -rw-------  1 huazhou  staff    19137527 Oct 31  2018 DIAGNOSES_ICD.csv
## -rw-------  1 huazhou  staff    10487132 Sep  2  2016 DRGCODES.csv
## -rw-------  1 huazhou  staff       13807 Sep  2  2016 D_CPT.csv
## -rw-------  1 huazhou  staff     1387562 Sep  2  2016 D_ICD_DIAGNOSES.csv
## -rw-------  1 huazhou  staff      311466 Sep  2  2016 D_ICD_PROCEDURES.csv
## -rw-------  1 huazhou  staff      954420 Sep  2  2016 D_ITEMS.csv
## -rw-------  1 huazhou  staff       43118 Sep  2  2016 D_LABITEMS.csv
## -rw-------  1 huazhou  staff     6357077 Sep  2  2016 ICUSTAYS.csv
## -rw-------  1 huazhou  staff  2464296511 Sep  2  2016 INPUTEVENTS_CV.csv
## -rw-------  1 huazhou  staff   975255812 Sep  2  2016 INPUTEVENTS_MV.csv
## -rw-r--r--@ 1 huazhou  staff           0 Jan 13 14:43 Icon
## -rw-------  1 huazhou  staff  1854245647 Nov 11  2016 LABEVENTS.csv
## -rw-------  1 huazhou  staff    72507810 Oct 31  2018 MICROBIOLOGYEVENTS.csv
## -rw-------  1 huazhou  staff  4007717810 Sep 30  2016 NOTEEVENTS.csv
## -rw-------  1 huazhou  staff   396406750 Sep  2  2016 OUTPUTEVENTS.csv
## -rw-------  1 huazhou  staff     2628900 Sep  2  2016 PATIENTS.csv
## -rw-------  1 huazhou  staff   770336136 Sep  2  2016 PRESCRIPTIONS.csv
## -rw-------  1 huazhou  staff    48770424 Sep  2  2016 PROCEDUREEVENTS_MV.csv
## -rw-------  1 huazhou  staff     6798492 Sep  2  2016 PROCEDURES_ICD.csv
## -rw-------  1 huazhou  staff     3481645 Sep  2  2016 SERVICES.csv
## -rw-------  1 huazhou  staff    25057095 Sep  2  2016 TRANSFERS.csv
## -rw-------  1 huazhou  staff       36517 Oct 31  2018 postgres_add_indexes.sql
## -rw-------  1 huazhou  staff        4195 Oct 31  2018 postgres_checks.sql
## -rw-------  1 huazhou  staff       20688 Oct  4  2018 postgres_create_tables.sql
## -rw-------  1 huazhou  staff        6897 Oct  4  2018 postgres_load_data.sql
datafolder <- ifelse(Sys.info()[["sysname"]] == "Linux",
                    "/home/203bdata/mimic-iii",
                    "/Users/huazhou/Google\ Drive/mimic/mimic-iii")
list.files(datafolder, full.names = TRUE, patter = "*.csv")
##  [1] "/Users/huazhou/Google Drive/mimic/mimic-iii/ADMISSIONS.csv"        
##  [2] "/Users/huazhou/Google Drive/mimic/mimic-iii/CALLOUT.csv"           
##  [3] "/Users/huazhou/Google Drive/mimic/mimic-iii/CAREGIVERS.csv"        
##  [4] "/Users/huazhou/Google Drive/mimic/mimic-iii/CHARTEVENTS.csv"       
##  [5] "/Users/huazhou/Google Drive/mimic/mimic-iii/CPTEVENTS.csv"         
##  [6] "/Users/huazhou/Google Drive/mimic/mimic-iii/D_CPT.csv"             
##  [7] "/Users/huazhou/Google Drive/mimic/mimic-iii/D_ICD_DIAGNOSES.csv"   
##  [8] "/Users/huazhou/Google Drive/mimic/mimic-iii/D_ICD_PROCEDURES.csv"  
##  [9] "/Users/huazhou/Google Drive/mimic/mimic-iii/D_ITEMS.csv"           
## [10] "/Users/huazhou/Google Drive/mimic/mimic-iii/D_LABITEMS.csv"        
## [11] "/Users/huazhou/Google Drive/mimic/mimic-iii/DATETIMEEVENTS.csv"    
## [12] "/Users/huazhou/Google Drive/mimic/mimic-iii/DIAGNOSES_ICD.csv"     
## [13] "/Users/huazhou/Google Drive/mimic/mimic-iii/DRGCODES.csv"          
## [14] "/Users/huazhou/Google Drive/mimic/mimic-iii/ICUSTAYS.csv"          
## [15] "/Users/huazhou/Google Drive/mimic/mimic-iii/INPUTEVENTS_CV.csv"    
## [16] "/Users/huazhou/Google Drive/mimic/mimic-iii/INPUTEVENTS_MV.csv"    
## [17] "/Users/huazhou/Google Drive/mimic/mimic-iii/LABEVENTS.csv"         
## [18] "/Users/huazhou/Google Drive/mimic/mimic-iii/MICROBIOLOGYEVENTS.csv"
## [19] "/Users/huazhou/Google Drive/mimic/mimic-iii/NOTEEVENTS.csv"        
## [20] "/Users/huazhou/Google Drive/mimic/mimic-iii/OUTPUTEVENTS.csv"      
## [21] "/Users/huazhou/Google Drive/mimic/mimic-iii/PATIENTS.csv"          
## [22] "/Users/huazhou/Google Drive/mimic/mimic-iii/PRESCRIPTIONS.csv"     
## [23] "/Users/huazhou/Google Drive/mimic/mimic-iii/PROCEDUREEVENTS_MV.csv"
## [24] "/Users/huazhou/Google Drive/mimic/mimic-iii/PROCEDURES_ICD.csv"    
## [25] "/Users/huazhou/Google Drive/mimic/mimic-iii/SERVICES.csv"          
## [26] "/Users/huazhou/Google Drive/mimic/mimic-iii/TRANSFERS.csv"

Read csv file: read.csv() vs read_csv() (tidyverse) vs data.table

First let’s compare different approaches for reading CSV files.

Create a SQLite database file called mimiciii.sqlite.

library("DBI")
library("RSQLite")
con = dbConnect(RSQLite::SQLite(), "mimiciii.sqlite")
str(con)
## Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots
##   ..@ ptr                :<externalptr> 
##   ..@ dbname             : chr "/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/mimiciii.sqlite"
##   ..@ loadable.extensions: logi TRUE
##   ..@ flags              : int 70
##   ..@ vfs                : chr ""
##   ..@ ref                :<environment: 0x7ff315e4fd18> 
##   ..@ bigint             : chr "integer64"

Because of excellent efficiency of data.table packages, we use it to read all CSVs and deposit into mimiciii.sqlite. Do not run this chunk by yourself on teaching server. This takes a couple minutes.

datafolder
csv_names <- list.files(datafolder)
csv_names <- csv_names[str_detect(csv_names, "csv")]
csv_names
system.time({
for (table_name in csv_names) {
    csv_name <- str_c(datafolder, "/", table_name)
    tmp <- fread(csv_name)
    table_name <- str_remove(table_name, ".csv")
    dbWriteTable(con, table_name, tmp, overwrite = TRUE)
}})
# list tables in database
dbListTables(con)
##  [1] "ADMISSIONS"         "CALLOUT"            "CAREGIVERS"        
##  [4] "CHARTEVENTS"        "CPTEVENTS"          "DATETIMEEVENTS"    
##  [7] "DIAGNOSES_ICD"      "DRGCODES"           "D_CPT"             
## [10] "D_ICD_DIAGNOSES"    "D_ICD_PROCEDURES"   "D_ITEMS"           
## [13] "D_LABITEMS"         "ICUSTAYS"           "INPUTEVENTS_CV"    
## [16] "INPUTEVENTS_MV"     "LABEVENTS"          "MICROBIOLOGYEVENTS"
## [19] "NOTEEVENTS"         "OUTPUTEVENTS"       "PATIENTS"          
## [22] "PRESCRIPTIONS"      "PROCEDUREEVENTS_MV" "PROCEDURES_ICD"    
## [25] "SERVICES"           "TRANSFERS"
# disconnect from database
dbDisconnect(con)

The resultant database file mimiciii.sqlite has size 10.4GB.

ls -l mimiciii.sqlite
## -rw-r--r--  1 huazhou  staff  11160363008 Feb 18 11:20 mimiciii.sqlite

Read data from database

Connect to the database mimiciii.sqlite and list the tables:

con <- dbConnect(RSQLite::SQLite(), dbname = "./mimiciii.sqlite")
dbListTables(con)
##  [1] "ADMISSIONS"         "CALLOUT"            "CAREGIVERS"        
##  [4] "CHARTEVENTS"        "CPTEVENTS"          "DATETIMEEVENTS"    
##  [7] "DIAGNOSES_ICD"      "DRGCODES"           "D_CPT"             
## [10] "D_ICD_DIAGNOSES"    "D_ICD_PROCEDURES"   "D_ITEMS"           
## [13] "D_LABITEMS"         "ICUSTAYS"           "INPUTEVENTS_CV"    
## [16] "INPUTEVENTS_MV"     "LABEVENTS"          "MICROBIOLOGYEVENTS"
## [19] "NOTEEVENTS"         "OUTPUTEVENTS"       "PATIENTS"          
## [22] "PRESCRIPTIONS"      "PROCEDUREEVENTS_MV" "PROCEDURES_ICD"    
## [25] "SERVICES"           "TRANSFERS"

Read the table CHARTEVENTS:

chartevents <- tbl(con, "CHARTEVENTS")
class(chartevents)
## [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
## [4] "tbl_lazy"             "tbl"
chartevents %>% print(width = Inf)
## # Source:   table<CHARTEVENTS> [?? x 15]
## # Database: sqlite 3.30.1
## #   [/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/mimiciii.sqlite]
##    ROW_ID SUBJECT_ID HADM_ID ICUSTAY_ID ITEMID CHARTTIME     STORETIME      CGID
##     <int>      <int>   <int>      <int>  <int> <chr>         <chr>         <int>
##  1    788         36  165660     241249 223834 5/12/34 12:00 5/12/34 13:56 17525
##  2    789         36  165660     241249 223835 5/12/34 12:00 5/12/34 13:56 17525
##  3    790         36  165660     241249 224328 5/12/34 12:00 5/12/34 12:18 20823
##  4    791         36  165660     241249 224329 5/12/34 12:00 5/12/34 12:19 20823
##  5    792         36  165660     241249 224330 5/12/34 12:00 5/12/34 12:19 20823
##  6    793         36  165660     241249 224331 5/12/34 12:00 5/12/34 12:19 20823
##  7    794         36  165660     241249 224332 5/12/34 12:00 5/12/34 14:44 17525
##  8    795         36  165660     241249 224663 5/12/34 12:00 5/12/34 14:44 17525
##  9    796         36  165660     241249 224665 5/12/34 12:00 5/12/34 14:44 17525
## 10    797         36  165660     241249 220224 5/12/34 12:35 5/12/34 12:38 20889
##     VALUE VALUENUM VALUEUOM WARNING ERROR RESULTSTATUS STOPPED
##     <dbl>    <dbl> <chr>      <int> <int>        <int>   <int>
##  1  15       15    "L/min"        0     0           NA      NA
##  2 100      100    ""             0     0           NA      NA
##  3   0.37     0.37 ""             0     0           NA      NA
##  4   6        6    "min"          0     0           NA      NA
##  5   2.5      2.5  ""             0     0           NA      NA
##  6   0        0    "ml/hr"        0     0           NA      NA
##  7   3        3    ""             0     0           NA      NA
##  8   8        8    ""             0     0           NA      NA
##  9   1.11     1.11 ""             0     0           NA      NA
## 10  58       58    "mmHg"         1     0           NA      NA
## # … with more rows

Use dplyr with SQLite

Retrieve chart events of SUBJECT_ID beetween 23 and 164:

chartevents_subset <- chartevents %>%
  select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>%
  filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165)
chartevents_subset
## # Source:   lazy query [?? x 4]
## # Database: sqlite 3.30.1
## #   [/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/mimiciii.sqlite]
##    ROW_ID SUBJECT_ID HADM_ID ITEMID
##     <int>      <int>   <int>  <int>
##  1    788         36  165660 223834
##  2    789         36  165660 223835
##  3    790         36  165660 224328
##  4    791         36  165660 224329
##  5    792         36  165660 224330
##  6    793         36  165660 224331
##  7    794         36  165660 224332
##  8    795         36  165660 224663
##  9    796         36  165660 224665
## 10    797         36  165660 220224
## # … with more rows

SQL query

class(chartevents_subset)
## [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
## [4] "tbl_lazy"             "tbl"
show_query(chartevents_subset)
## <SQL>
## SELECT *
## FROM (SELECT `ROW_ID`, `SUBJECT_ID`, `HADM_ID`, `ITEMID`
## FROM `CHARTEVENTS`)
## WHERE ((`SUBJECT_ID` >= 23.0) AND (`SUBJECT_ID` <= 165.0))

SQL grouping

chartevents %>%
  group_by(SUBJECT_ID) %>%
  summarise(n = n())
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.30.1
## #   [/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/mimiciii.sqlite]
##    SUBJECT_ID     n
##         <int> <int>
##  1         23   320
##  2         34   346
##  3         36  1342
##  4         85   396
##  5        107   763
##  6        109 12471
##  7        111   351
##  8        124 10712
##  9        154   422
## 10        165   217
## # … with more rows

SQL query

chartevents %>%
  group_by(SUBJECT_ID) %>%
  summarise(n = n()) %>%
  show_query()
## <SQL>
## SELECT `SUBJECT_ID`, COUNT() AS `n`
## FROM `CHARTEVENTS`
## GROUP BY `SUBJECT_ID`

SQL translation

dbplyr package (a dplyr backend for databases) has a function, translate_sql, that lets you experiment with how R functions are translated to SQL:

library("dbplyr")
## 
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql
translate_sql(x == 1 & (y < 2 | z > 3))
## <SQL> `x` = 1.0 AND (`y` < 2.0 OR `z` > 3.0)
translate_sql(x ^ 2 < 10)
## <SQL> POWER(`x`, 2.0) < 10.0
translate_sql(x %% 2 == 10)
## <SQL> `x` % 2.0 = 10.0
translate_sql(paste(x, y))
## <SQL> CONCAT_WS(' ', `x`, `y`)
translate_sql(mean(x))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## <SQL> AVG(`x`) OVER ()
translate_sql(mean(x, na.rm = TRUE))
## <SQL> AVG(`x`) OVER ()

Timings

Let’s compare the timings of dplyr (in-momory) and dbplyr (on disk database).

csvfile <- str_c(datafolder, "/DATETIMEEVENTS.csv")
dtevents_tibble <- read_csv(csvfile)
## Parsed with column specification:
## cols(
##   ROW_ID = col_double(),
##   SUBJECT_ID = col_double(),
##   HADM_ID = col_double(),
##   ICUSTAY_ID = col_double(),
##   ITEMID = col_double(),
##   CHARTTIME = col_datetime(format = ""),
##   STORETIME = col_datetime(format = ""),
##   CGID = col_double(),
##   VALUE = col_datetime(format = ""),
##   VALUEUOM = col_character(),
##   WARNING = col_logical(),
##   ERROR = col_logical(),
##   RESULTSTATUS = col_logical(),
##   STOPPED = col_character()
## )
system.time(
  dtevents_tibble %>%
  select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>%
  # filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) %>%
  group_by(SUBJECT_ID) %>%
  summarize(n = n())
)
##    user  system elapsed 
##   0.202   0.011   0.214
dtevents_sql <- tbl(con, "DATETIMEEVENTS")
system.time(
  dtevents_sql %>%
  select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>%
  # filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) %>%
  group_by(SUBJECT_ID) %>%
  summarize(n = n())
)
##    user  system elapsed 
##   0.006   0.000   0.006

SQLite (0.08 seconds) was much faster than tibble (0.25 seconds). But SQLitee is disk-based, while the tibble is in memory. Why is the discrepancy?

Laziness

dplyr/dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.

dtevents_sql %>%
  select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>%
  # filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) %>%
  group_by(SUBJECT_ID) %>%
  summarize(n = n())
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.30.1
## #   [/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/mimiciii.sqlite]
##    SUBJECT_ID     n
##         <int> <int>
##  1          6    60
##  2          8     5
##  3         17    69
##  4         21   259
##  5         23    18
##  6         25     2
##  7         34    27
##  8         35   141
##  9         36   201
## 10         38   877
## # … with more rows

Full query

To force a full query and return a complete table it is necessary to use the collect function.

system.time(
  dtevents_sql %>%
  select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>%
  # filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) %>%
  group_by(SUBJECT_ID) %>%
  summarize(n = n()) %>%
  collect()
)
##    user  system elapsed 
##   1.581   0.232   1.835

Plotting

Bar plot

Suppose we want the bar plot of number of datetime events on each SUBJECT_ID.

chartevents %>%
  count(SUBJECT_ID) %>%
  ggplot() + 
  geom_histogram(mapping = aes(x = n))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Apparently current version of ggplot2 (v3.2.1) works for tbl_dbi/tbl_sql/tbl_lazy now.

Older version of ggplot2, e.g., v2.2.1, will output error. This is becasue ggplot2 needed to compute the count per bin by going through all the rows. But here chartevents is just a pointer to the SQLite table. We had to use the transform in database, plot in R strategy.

chartevents %>%
  count(SUBJECT_ID) %>%
  collect() %>%
  ggplot() + 
  geom_histogram(mapping = aes(x = n))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Other plots

For example of making histogram and raster plot, read tutorial https://db.rstudio.com/best-practices/visualization/.

Close connection to database

dbDisconnect(con)