In this lecture we will learn:
Import data from bigish csv files (MIMIC-III).
Tidy data (TODO).
Deposit data into an SQLite database.
Query SQLite database.
Transform in database and plot in R.
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.6.2 magrittr_1.5 tools_3.6.2 htmltools_0.4.0
## [5] yaml_2.2.1 Rcpp_1.0.3 stringi_1.4.6 rmarkdown_2.1
## [9] knitr_1.28 stringr_1.4.0 xfun_0.12 digest_0.6.24
## [13] rlang_0.4.4 evaluate_0.14
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
The /home/203bdata/mimic-iii
folder on the teaching server contains the MIMIC-III data. On my Mac, it’s at /Users/huazhou/Google Drive/mimic/mimic-iii
.
# On teaching server
ls -l /home/203bdata/mimic-iii
# On my MacBook
ls -l /Users/huazhou/Google\ Drive/mimic/mimic-iii
## total 22333688
## -rw------- 1 huazhou staff 12548562 Sep 2 2016 ADMISSIONS.csv
## -rw------- 1 huazhou staff 6339185 Sep 2 2016 CALLOUT.csv
## -rw------- 1 huazhou staff 203492 Sep 2 2016 CAREGIVERS.csv
## -rw------- 1 huazhou staff 85204883 Nov 30 2018 CHARTEVENTS.csv
## -rw------- 1 huazhou staff 58150883 Sep 2 2016 CPTEVENTS.csv
## -rw------- 1 huazhou staff 525785298 Sep 2 2016 DATETIMEEVENTS.csv
## -rw------- 1 huazhou staff 19137527 Oct 31 2018 DIAGNOSES_ICD.csv
## -rw------- 1 huazhou staff 10487132 Sep 2 2016 DRGCODES.csv
## -rw------- 1 huazhou staff 13807 Sep 2 2016 D_CPT.csv
## -rw------- 1 huazhou staff 1387562 Sep 2 2016 D_ICD_DIAGNOSES.csv
## -rw------- 1 huazhou staff 311466 Sep 2 2016 D_ICD_PROCEDURES.csv
## -rw------- 1 huazhou staff 954420 Sep 2 2016 D_ITEMS.csv
## -rw------- 1 huazhou staff 43118 Sep 2 2016 D_LABITEMS.csv
## -rw------- 1 huazhou staff 6357077 Sep 2 2016 ICUSTAYS.csv
## -rw------- 1 huazhou staff 2464296511 Sep 2 2016 INPUTEVENTS_CV.csv
## -rw------- 1 huazhou staff 975255812 Sep 2 2016 INPUTEVENTS_MV.csv
## -rw-r--r--@ 1 huazhou staff 0 Jan 13 14:43 Icon
## -rw------- 1 huazhou staff 1854245647 Nov 11 2016 LABEVENTS.csv
## -rw------- 1 huazhou staff 72507810 Oct 31 2018 MICROBIOLOGYEVENTS.csv
## -rw------- 1 huazhou staff 4007717810 Sep 30 2016 NOTEEVENTS.csv
## -rw------- 1 huazhou staff 396406750 Sep 2 2016 OUTPUTEVENTS.csv
## -rw------- 1 huazhou staff 2628900 Sep 2 2016 PATIENTS.csv
## -rw------- 1 huazhou staff 770336136 Sep 2 2016 PRESCRIPTIONS.csv
## -rw------- 1 huazhou staff 48770424 Sep 2 2016 PROCEDUREEVENTS_MV.csv
## -rw------- 1 huazhou staff 6798492 Sep 2 2016 PROCEDURES_ICD.csv
## -rw------- 1 huazhou staff 3481645 Sep 2 2016 SERVICES.csv
## -rw------- 1 huazhou staff 25057095 Sep 2 2016 TRANSFERS.csv
## -rw------- 1 huazhou staff 36517 Oct 31 2018 postgres_add_indexes.sql
## -rw------- 1 huazhou staff 4195 Oct 31 2018 postgres_checks.sql
## -rw------- 1 huazhou staff 20688 Oct 4 2018 postgres_create_tables.sql
## -rw------- 1 huazhou staff 6897 Oct 4 2018 postgres_load_data.sql
datafolder <- ifelse(Sys.info()[["sysname"]] == "Linux",
"/home/203bdata/mimic-iii",
"/Users/huazhou/Google\ Drive/mimic/mimic-iii")
list.files(datafolder, full.names = TRUE, patter = "*.csv")
## [1] "/Users/huazhou/Google Drive/mimic/mimic-iii/ADMISSIONS.csv"
## [2] "/Users/huazhou/Google Drive/mimic/mimic-iii/CALLOUT.csv"
## [3] "/Users/huazhou/Google Drive/mimic/mimic-iii/CAREGIVERS.csv"
## [4] "/Users/huazhou/Google Drive/mimic/mimic-iii/CHARTEVENTS.csv"
## [5] "/Users/huazhou/Google Drive/mimic/mimic-iii/CPTEVENTS.csv"
## [6] "/Users/huazhou/Google Drive/mimic/mimic-iii/D_CPT.csv"
## [7] "/Users/huazhou/Google Drive/mimic/mimic-iii/D_ICD_DIAGNOSES.csv"
## [8] "/Users/huazhou/Google Drive/mimic/mimic-iii/D_ICD_PROCEDURES.csv"
## [9] "/Users/huazhou/Google Drive/mimic/mimic-iii/D_ITEMS.csv"
## [10] "/Users/huazhou/Google Drive/mimic/mimic-iii/D_LABITEMS.csv"
## [11] "/Users/huazhou/Google Drive/mimic/mimic-iii/DATETIMEEVENTS.csv"
## [12] "/Users/huazhou/Google Drive/mimic/mimic-iii/DIAGNOSES_ICD.csv"
## [13] "/Users/huazhou/Google Drive/mimic/mimic-iii/DRGCODES.csv"
## [14] "/Users/huazhou/Google Drive/mimic/mimic-iii/ICUSTAYS.csv"
## [15] "/Users/huazhou/Google Drive/mimic/mimic-iii/INPUTEVENTS_CV.csv"
## [16] "/Users/huazhou/Google Drive/mimic/mimic-iii/INPUTEVENTS_MV.csv"
## [17] "/Users/huazhou/Google Drive/mimic/mimic-iii/LABEVENTS.csv"
## [18] "/Users/huazhou/Google Drive/mimic/mimic-iii/MICROBIOLOGYEVENTS.csv"
## [19] "/Users/huazhou/Google Drive/mimic/mimic-iii/NOTEEVENTS.csv"
## [20] "/Users/huazhou/Google Drive/mimic/mimic-iii/OUTPUTEVENTS.csv"
## [21] "/Users/huazhou/Google Drive/mimic/mimic-iii/PATIENTS.csv"
## [22] "/Users/huazhou/Google Drive/mimic/mimic-iii/PRESCRIPTIONS.csv"
## [23] "/Users/huazhou/Google Drive/mimic/mimic-iii/PROCEDUREEVENTS_MV.csv"
## [24] "/Users/huazhou/Google Drive/mimic/mimic-iii/PROCEDURES_ICD.csv"
## [25] "/Users/huazhou/Google Drive/mimic/mimic-iii/SERVICES.csv"
## [26] "/Users/huazhou/Google Drive/mimic/mimic-iii/TRANSFERS.csv"
read.csv()
vs read_csv()
(tidyverse) vs data.tableFirst let’s compare different approaches for reading CSV files.
read.csv()
function in base R takes about 1 minute to read in the 500MB csv file DATETIMEEVENTS.csv
:
csvfile <- str_c(datafolder, "/DATETIMEEVENTS.csv")
system.time(read.csv(csvfile))
## user system elapsed
## 69.601 1.748 72.004
data.table
is an R package for reading large data sets. It’s much faster to read the same file (about 8 seconds). The result is of type data.table
and data.frame
.
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
system.time({dtevents <- fread(csvfile)})
## user system elapsed
## 9.054 0.286 2.423
class(dtevents)
## [1] "data.table" "data.frame"
dtevents
## ROW_ID SUBJECT_ID HADM_ID ICUSTAY_ID ITEMID CHARTTIME
## 1: 711 7657 121183 297945 3411 2172-03-14 11:00:00
## 2: 712 7657 121183 297945 3411 2172-03-14 13:00:00
## 3: 713 7657 121183 297945 3411 2172-03-14 15:00:00
## 4: 714 7657 121183 297945 3411 2172-03-14 17:00:00
## 5: 715 7657 121183 297945 3411 2172-03-14 19:00:00
## ---
## 4485933: 4456093 99366 136021 218447 224279 2197-03-24 16:32:00
## 4485934: 4456094 99366 136021 218447 224280 2197-03-24 16:32:00
## 4485935: 4456095 99366 136021 218447 224282 2197-03-24 16:32:00
## 4485936: 4456096 99366 136021 218447 224284 2197-03-24 16:32:00
## 4485937: 4456097 99366 136021 218447 224287 2197-03-24 16:32:00
## STORETIME CGID VALUE VALUEUOM WARNING
## 1: 2172-03-14 11:52:00 16446 Date NA
## 2: 2172-03-14 12:36:00 16446 Date NA
## 3: 2172-03-14 15:10:00 14957 Date NA
## 4: 2172-03-14 17:01:00 16446 Date NA
## 5: 2172-03-14 19:29:00 14815 Date NA
## ---
## 4485933: 2197-03-24 16:32:00 18234 2197-03-24 13:03:00 Date and Time 0
## 4485934: 2197-03-24 16:32:00 18234 2197-03-24 00:00:00 Date 0
## 4485935: 2197-03-24 16:32:00 18234 2197-03-24 00:00:00 Date 0
## 4485936: 2197-03-24 16:32:00 18234 2197-03-24 00:00:00 Date 0
## 4485937: 2197-03-24 16:32:00 18234 2197-03-24 15:04:00 Date and Time 0
## ERROR RESULTSTATUS STOPPED
## 1: NA NA NotStopd
## 2: NA NA NotStopd
## 3: NA NA NotStopd
## 4: NA NA NotStopd
## 5: NA NA NotStopd
## ---
## 4485933: 0 NA
## 4485934: 0 NA
## 4485935: 0 NA
## 4485936: 0 NA
## 4485937: 0 NA
read_csv()
function in tidyverse takes about 6-7 seconds:
system.time({dtevents <- read_csv(csvfile)})
## Parsed with column specification:
## cols(
## ROW_ID = col_double(),
## SUBJECT_ID = col_double(),
## HADM_ID = col_double(),
## ICUSTAY_ID = col_double(),
## ITEMID = col_double(),
## CHARTTIME = col_datetime(format = ""),
## STORETIME = col_datetime(format = ""),
## CGID = col_double(),
## VALUE = col_datetime(format = ""),
## VALUEUOM = col_character(),
## WARNING = col_logical(),
## ERROR = col_logical(),
## RESULTSTATUS = col_logical(),
## STOPPED = col_character()
## )
## user system elapsed
## 7.049 0.315 7.397
class(dtevents)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
dtevents %>% print(width = Inf)
## # A tibble: 4,485,937 x 14
## ROW_ID SUBJECT_ID HADM_ID ICUSTAY_ID ITEMID CHARTTIME
## <dbl> <dbl> <dbl> <dbl> <dbl> <dttm>
## 1 711 7657 121183 297945 3411 2172-03-14 11:00:00
## 2 712 7657 121183 297945 3411 2172-03-14 13:00:00
## 3 713 7657 121183 297945 3411 2172-03-14 15:00:00
## 4 714 7657 121183 297945 3411 2172-03-14 17:00:00
## 5 715 7657 121183 297945 3411 2172-03-14 19:00:00
## 6 716 7657 121183 297945 3411 2172-03-14 21:30:00
## 7 717 7657 121183 297945 3411 2172-03-15 00:00:00
## 8 718 7657 121183 297945 3411 2172-03-15 01:00:00
## 9 719 7657 121183 297945 3411 2172-03-15 03:30:00
## 10 720 7657 121183 297945 3411 2172-03-15 05:00:00
## STORETIME CGID VALUE VALUEUOM WARNING ERROR
## <dttm> <dbl> <dttm> <chr> <lgl> <lgl>
## 1 2172-03-14 11:52:00 16446 NA Date NA NA
## 2 2172-03-14 12:36:00 16446 NA Date NA NA
## 3 2172-03-14 15:10:00 14957 NA Date NA NA
## 4 2172-03-14 17:01:00 16446 NA Date NA NA
## 5 2172-03-14 19:29:00 14815 NA Date NA NA
## 6 2172-03-14 21:43:00 14815 NA Date NA NA
## 7 2172-03-15 00:27:00 14815 NA Date NA NA
## 8 2172-03-15 01:44:00 14815 NA Date NA NA
## 9 2172-03-15 03:48:00 14815 NA Date NA NA
## 10 2172-03-15 04:58:00 14815 NA Date NA NA
## RESULTSTATUS STOPPED
## <lgl> <chr>
## 1 NA NotStopd
## 2 NA NotStopd
## 3 NA NotStopd
## 4 NA NotStopd
## 5 NA NotStopd
## 6 NA NotStopd
## 7 NA NotStopd
## 8 NA NotStopd
## 9 NA NotStopd
## 10 NA NotStopd
## # … with 4,485,927 more rows
Create a SQLite database file called mimiciii.sqlite
.
library("DBI")
library("RSQLite")
con = dbConnect(RSQLite::SQLite(), "mimiciii.sqlite")
str(con)
## Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots
## ..@ ptr :<externalptr>
## ..@ dbname : chr "/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/mimiciii.sqlite"
## ..@ loadable.extensions: logi TRUE
## ..@ flags : int 70
## ..@ vfs : chr ""
## ..@ ref :<environment: 0x7ff315e4fd18>
## ..@ bigint : chr "integer64"
Because of excellent efficiency of data.table
packages, we use it to read all CSVs and deposit into mimiciii.sqlite
. Do not run this chunk by yourself on teaching server. This takes a couple minutes.
datafolder
csv_names <- list.files(datafolder)
csv_names <- csv_names[str_detect(csv_names, "csv")]
csv_names
system.time({
for (table_name in csv_names) {
csv_name <- str_c(datafolder, "/", table_name)
tmp <- fread(csv_name)
table_name <- str_remove(table_name, ".csv")
dbWriteTable(con, table_name, tmp, overwrite = TRUE)
}})
# list tables in database
dbListTables(con)
## [1] "ADMISSIONS" "CALLOUT" "CAREGIVERS"
## [4] "CHARTEVENTS" "CPTEVENTS" "DATETIMEEVENTS"
## [7] "DIAGNOSES_ICD" "DRGCODES" "D_CPT"
## [10] "D_ICD_DIAGNOSES" "D_ICD_PROCEDURES" "D_ITEMS"
## [13] "D_LABITEMS" "ICUSTAYS" "INPUTEVENTS_CV"
## [16] "INPUTEVENTS_MV" "LABEVENTS" "MICROBIOLOGYEVENTS"
## [19] "NOTEEVENTS" "OUTPUTEVENTS" "PATIENTS"
## [22] "PRESCRIPTIONS" "PROCEDUREEVENTS_MV" "PROCEDURES_ICD"
## [25] "SERVICES" "TRANSFERS"
# disconnect from database
dbDisconnect(con)
The resultant database file mimiciii.sqlite
has size 10.4GB.
ls -l mimiciii.sqlite
## -rw-r--r-- 1 huazhou staff 11160363008 Feb 18 11:20 mimiciii.sqlite
Connect to the database mimiciii.sqlite
and list the tables:
con <- dbConnect(RSQLite::SQLite(), dbname = "./mimiciii.sqlite")
dbListTables(con)
## [1] "ADMISSIONS" "CALLOUT" "CAREGIVERS"
## [4] "CHARTEVENTS" "CPTEVENTS" "DATETIMEEVENTS"
## [7] "DIAGNOSES_ICD" "DRGCODES" "D_CPT"
## [10] "D_ICD_DIAGNOSES" "D_ICD_PROCEDURES" "D_ITEMS"
## [13] "D_LABITEMS" "ICUSTAYS" "INPUTEVENTS_CV"
## [16] "INPUTEVENTS_MV" "LABEVENTS" "MICROBIOLOGYEVENTS"
## [19] "NOTEEVENTS" "OUTPUTEVENTS" "PATIENTS"
## [22] "PRESCRIPTIONS" "PROCEDUREEVENTS_MV" "PROCEDURES_ICD"
## [25] "SERVICES" "TRANSFERS"
Read the table CHARTEVENTS
:
chartevents <- tbl(con, "CHARTEVENTS")
class(chartevents)
## [1] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql"
## [4] "tbl_lazy" "tbl"
chartevents %>% print(width = Inf)
## # Source: table<CHARTEVENTS> [?? x 15]
## # Database: sqlite 3.30.1
## # [/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/mimiciii.sqlite]
## ROW_ID SUBJECT_ID HADM_ID ICUSTAY_ID ITEMID CHARTTIME STORETIME CGID
## <int> <int> <int> <int> <int> <chr> <chr> <int>
## 1 788 36 165660 241249 223834 5/12/34 12:00 5/12/34 13:56 17525
## 2 789 36 165660 241249 223835 5/12/34 12:00 5/12/34 13:56 17525
## 3 790 36 165660 241249 224328 5/12/34 12:00 5/12/34 12:18 20823
## 4 791 36 165660 241249 224329 5/12/34 12:00 5/12/34 12:19 20823
## 5 792 36 165660 241249 224330 5/12/34 12:00 5/12/34 12:19 20823
## 6 793 36 165660 241249 224331 5/12/34 12:00 5/12/34 12:19 20823
## 7 794 36 165660 241249 224332 5/12/34 12:00 5/12/34 14:44 17525
## 8 795 36 165660 241249 224663 5/12/34 12:00 5/12/34 14:44 17525
## 9 796 36 165660 241249 224665 5/12/34 12:00 5/12/34 14:44 17525
## 10 797 36 165660 241249 220224 5/12/34 12:35 5/12/34 12:38 20889
## VALUE VALUENUM VALUEUOM WARNING ERROR RESULTSTATUS STOPPED
## <dbl> <dbl> <chr> <int> <int> <int> <int>
## 1 15 15 "L/min" 0 0 NA NA
## 2 100 100 "" 0 0 NA NA
## 3 0.37 0.37 "" 0 0 NA NA
## 4 6 6 "min" 0 0 NA NA
## 5 2.5 2.5 "" 0 0 NA NA
## 6 0 0 "ml/hr" 0 0 NA NA
## 7 3 3 "" 0 0 NA NA
## 8 8 8 "" 0 0 NA NA
## 9 1.11 1.11 "" 0 0 NA NA
## 10 58 58 "mmHg" 1 0 NA NA
## # … with more rows
Retrieve chart events of SUBJECT_ID
beetween 23 and 164:
chartevents_subset <- chartevents %>%
select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>%
filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165)
chartevents_subset
## # Source: lazy query [?? x 4]
## # Database: sqlite 3.30.1
## # [/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/mimiciii.sqlite]
## ROW_ID SUBJECT_ID HADM_ID ITEMID
## <int> <int> <int> <int>
## 1 788 36 165660 223834
## 2 789 36 165660 223835
## 3 790 36 165660 224328
## 4 791 36 165660 224329
## 5 792 36 165660 224330
## 6 793 36 165660 224331
## 7 794 36 165660 224332
## 8 795 36 165660 224663
## 9 796 36 165660 224665
## 10 797 36 165660 220224
## # … with more rows
class(chartevents_subset)
## [1] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql"
## [4] "tbl_lazy" "tbl"
show_query(chartevents_subset)
## <SQL>
## SELECT *
## FROM (SELECT `ROW_ID`, `SUBJECT_ID`, `HADM_ID`, `ITEMID`
## FROM `CHARTEVENTS`)
## WHERE ((`SUBJECT_ID` >= 23.0) AND (`SUBJECT_ID` <= 165.0))
chartevents %>%
group_by(SUBJECT_ID) %>%
summarise(n = n())
## # Source: lazy query [?? x 2]
## # Database: sqlite 3.30.1
## # [/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/mimiciii.sqlite]
## SUBJECT_ID n
## <int> <int>
## 1 23 320
## 2 34 346
## 3 36 1342
## 4 85 396
## 5 107 763
## 6 109 12471
## 7 111 351
## 8 124 10712
## 9 154 422
## 10 165 217
## # … with more rows
chartevents %>%
group_by(SUBJECT_ID) %>%
summarise(n = n()) %>%
show_query()
## <SQL>
## SELECT `SUBJECT_ID`, COUNT() AS `n`
## FROM `CHARTEVENTS`
## GROUP BY `SUBJECT_ID`
dbplyr package (a dplyr backend for databases) has a function, translate_sql
, that lets you experiment with how R functions are translated to SQL:
library("dbplyr")
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
translate_sql(x == 1 & (y < 2 | z > 3))
## <SQL> `x` = 1.0 AND (`y` < 2.0 OR `z` > 3.0)
translate_sql(x ^ 2 < 10)
## <SQL> POWER(`x`, 2.0) < 10.0
translate_sql(x %% 2 == 10)
## <SQL> `x` % 2.0 = 10.0
translate_sql(paste(x, y))
## <SQL> CONCAT_WS(' ', `x`, `y`)
translate_sql(mean(x))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## <SQL> AVG(`x`) OVER ()
translate_sql(mean(x, na.rm = TRUE))
## <SQL> AVG(`x`) OVER ()
Let’s compare the timings of dplyr (in-momory) and dbplyr (on disk database).
csvfile <- str_c(datafolder, "/DATETIMEEVENTS.csv")
dtevents_tibble <- read_csv(csvfile)
## Parsed with column specification:
## cols(
## ROW_ID = col_double(),
## SUBJECT_ID = col_double(),
## HADM_ID = col_double(),
## ICUSTAY_ID = col_double(),
## ITEMID = col_double(),
## CHARTTIME = col_datetime(format = ""),
## STORETIME = col_datetime(format = ""),
## CGID = col_double(),
## VALUE = col_datetime(format = ""),
## VALUEUOM = col_character(),
## WARNING = col_logical(),
## ERROR = col_logical(),
## RESULTSTATUS = col_logical(),
## STOPPED = col_character()
## )
system.time(
dtevents_tibble %>%
select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>%
# filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) %>%
group_by(SUBJECT_ID) %>%
summarize(n = n())
)
## user system elapsed
## 0.202 0.011 0.214
dtevents_sql <- tbl(con, "DATETIMEEVENTS")
system.time(
dtevents_sql %>%
select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>%
# filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) %>%
group_by(SUBJECT_ID) %>%
summarize(n = n())
)
## user system elapsed
## 0.006 0.000 0.006
SQLite (0.08 seconds) was much faster than tibble (0.25 seconds). But SQLitee is disk-based, while the tibble is in memory. Why is the discrepancy?
dplyr/dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.
When building a query, we don’t want the entire table, often we want just enough to check if our query is working.
Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
Therefore, by default dplyr
won’t connect and query the database until absolutely necessary (e.g. show output),
and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like
dtevents_sql %>%
select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>%
# filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) %>%
group_by(SUBJECT_ID) %>%
summarize(n = n())
## # Source: lazy query [?? x 2]
## # Database: sqlite 3.30.1
## # [/Users/huazhou/Documents/github.com/ucla-biostat203b-2020winter.github.io/slides/12-dbplyr/mimiciii.sqlite]
## SUBJECT_ID n
## <int> <int>
## 1 6 60
## 2 8 5
## 3 17 69
## 4 21 259
## 5 23 18
## 6 25 2
## 7 34 27
## 8 35 141
## 9 36 201
## 10 38 877
## # … with more rows
To force a full query and return a complete table it is necessary to use the collect
function.
system.time(
dtevents_sql %>%
select(ROW_ID, SUBJECT_ID, HADM_ID, ITEMID, ITEMID) %>%
# filter(SUBJECT_ID >= 23, SUBJECT_ID <= 165) %>%
group_by(SUBJECT_ID) %>%
summarize(n = n()) %>%
collect()
)
## user system elapsed
## 1.581 0.232 1.835
Suppose we want the bar plot of number of datetime events on each SUBJECT_ID
.
chartevents %>%
count(SUBJECT_ID) %>%
ggplot() +
geom_histogram(mapping = aes(x = n))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Apparently current version of ggplot2 (v3.2.1) works for tbl_dbi/tbl_sql/tbl_lazy
now.
Older version of ggplot2, e.g., v2.2.1, will output error. This is becasue ggplot2 needed to compute the count per bin by going through all the rows. But here chartevents
is just a pointer to the SQLite table. We had to use the transform in database, plot in R strategy.
chartevents %>%
count(SUBJECT_ID) %>%
collect() %>%
ggplot() +
geom_histogram(mapping = aes(x = n))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
For example of making histogram and raster plot, read tutorial https://db.rstudio.com/best-practices/visualization/.
dbDisconnect(con)