Task

In this assignment, you are to write a report analyzing the electronic health record (EHR) data MIMIC-III. You report will demostrate your knowledge of working with PostgreSQL database, data visualization, and commonly used analytical methods such as logistic regression and neural network. Your report should include at least following parts:

  1. An informative title. For example, 30-Day Mortality Rate of Myocardia Infarction Patients Admitted to CCU.

  2. Introduction. Describe the MIMIC-III data set and what research hypothesis/goal you are to address using this data.

  3. Data preparation. Create a study cohort from MIMIC-III corresponding to your research hypothesis/goal. See the examplary code below. Use a CONSORT flow diagram to summarize your steps to create the cohort.

  4. Data visualization. Use visualization to summarize the cohort you created.

  5. Analytics. Use at least two analytical approaches to address your research hypothesis/goal. For example, we can use (1) logistic regression and (2) neural network to build a predictive model for the 30-day mortality rate of patients admitted into CCU and compare their predictive performance. Summarize your results in graphs.

  6. Conclusions.

Learning resources about analyzing EHR data

Examplary code

Connect to PostgresSQL database

Load database libraries and the tidyverse frontend:

library(DBI)
library(RPostgreSQL)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date

Credentials for using PostgreSQL database. We are going to use username postgres with password postgres to access the mimic database in the schemee mimiciii.

# Load configuration settings
dbdriver <- 'PostgreSQL'
#host  <- '127.0.0.1'
#port  <- '5432'
user  <- 'postgres'
password <- 'postgres'
dbname <- 'mimic'
schema <- 'mimiciii'
# Connect to the database using the configuration settings
con <- dbConnect(RPostgreSQL::PostgreSQL(), 
                 dbname = dbname, 
                 #host = host, 
                 #port = port, 
                 user = user, 
                 password = password)
# Set the default schema
dbExecute(con, paste("SET search_path TO ", schema, sep=" "))
## [1] 0
con
## <PostgreSQLConnection>

List tables in the mimic database:

dbListTables(con)
##   [1] "admissions"         "callout"            "caregivers"        
##   [4] "chartevents_1"      "chartevents"        "chartevents_2"     
##   [7] "chartevents_3"      "chartevents_4"      "chartevents_5"     
##  [10] "chartevents_6"      "chartevents_7"      "chartevents_8"     
##  [13] "chartevents_9"      "chartevents_10"     "inputevents_mv"    
##  [16] "labevents"          "chartevents_11"     "chartevents_12"    
##  [19] "chartevents_13"     "chartevents_14"     "microbiologyevents"
##  [22] "chartevents_15"     "chartevents_16"     "chartevents_17"    
##  [25] "chartevents_18"     "chartevents_19"     "chartevents_20"    
##  [28] "chartevents_21"     "chartevents_22"     "chartevents_23"    
##  [31] "noteevents"         "outputevents"       "chartevents_24"    
##  [34] "chartevents_25"     "chartevents_26"     "chartevents_27"    
##  [37] "chartevents_28"     "patients"           "chartevents_29"    
##  [40] "chartevents_30"     "chartevents_31"     "chartevents_32"    
##  [43] "chartevents_33"     "chartevents_34"     "chartevents_35"    
##  [46] "chartevents_36"     "chartevents_37"     "chartevents_38"    
##  [49] "chartevents_39"     "chartevents_40"     "chartevents_41"    
##  [52] "chartevents_42"     "chartevents_43"     "chartevents_44"    
##  [55] "chartevents_45"     "chartevents_46"     "chartevents_47"    
##  [58] "chartevents_48"     "chartevents_49"     "chartevents_50"    
##  [61] "chartevents_51"     "chartevents_52"     "chartevents_53"    
##  [64] "chartevents_54"     "chartevents_55"     "chartevents_56"    
##  [67] "chartevents_57"     "chartevents_58"     "chartevents_59"    
##  [70] "chartevents_60"     "chartevents_61"     "chartevents_62"    
##  [73] "chartevents_63"     "chartevents_64"     "chartevents_65"    
##  [76] "chartevents_66"     "chartevents_67"     "chartevents_68"    
##  [79] "chartevents_69"     "chartevents_70"     "chartevents_71"    
##  [82] "chartevents_72"     "chartevents_73"     "chartevents_74"    
##  [85] "chartevents_75"     "chartevents_76"     "chartevents_77"    
##  [88] "chartevents_78"     "chartevents_79"     "chartevents_80"    
##  [91] "chartevents_81"     "chartevents_82"     "chartevents_83"    
##  [94] "chartevents_84"     "chartevents_85"     "chartevents_86"    
##  [97] "chartevents_87"     "chartevents_88"     "chartevents_89"    
## [100] "chartevents_90"     "chartevents_91"     "chartevents_92"    
## [103] "chartevents_93"     "chartevents_94"     "chartevents_95"    
## [106] "chartevents_96"     "chartevents_97"     "chartevents_98"    
## [109] "chartevents_99"     "chartevents_100"    "chartevents_101"   
## [112] "chartevents_102"    "chartevents_103"    "chartevents_104"   
## [115] "chartevents_105"    "chartevents_106"    "chartevents_107"   
## [118] "chartevents_108"    "chartevents_109"    "chartevents_110"   
## [121] "chartevents_111"    "chartevents_112"    "chartevents_113"   
## [124] "chartevents_114"    "chartevents_115"    "chartevents_116"   
## [127] "chartevents_117"    "chartevents_118"    "chartevents_119"   
## [130] "chartevents_120"    "chartevents_121"    "chartevents_122"   
## [133] "chartevents_123"    "chartevents_124"    "chartevents_125"   
## [136] "chartevents_126"    "chartevents_127"    "chartevents_128"   
## [139] "chartevents_129"    "chartevents_130"    "chartevents_131"   
## [142] "chartevents_132"    "chartevents_133"    "chartevents_134"   
## [145] "chartevents_135"    "chartevents_136"    "chartevents_137"   
## [148] "chartevents_138"    "chartevents_139"    "chartevents_140"   
## [151] "chartevents_141"    "chartevents_142"    "chartevents_143"   
## [154] "chartevents_144"    "chartevents_145"    "chartevents_146"   
## [157] "chartevents_147"    "chartevents_148"    "chartevents_149"   
## [160] "icd"                "chartevents_150"    "chartevents_151"   
## [163] "chartevents_152"    "chartevents_153"    "chartevents_154"   
## [166] "chartevents_155"    "chartevents_156"    "chartevents_157"   
## [169] "chartevents_158"    "chartevents_159"    "chartevents_160"   
## [172] "chartevents_161"    "chartevents_162"    "chartevents_163"   
## [175] "chartevents_164"    "chartevents_165"    "chartevents_166"   
## [178] "chartevents_167"    "chartevents_168"    "chartevents_169"   
## [181] "chartevents_170"    "chartevents_171"    "chartevents_172"   
## [184] "chartevents_173"    "chartevents_174"    "chartevents_175"   
## [187] "chartevents_176"    "chartevents_177"    "chartevents_178"   
## [190] "chartevents_179"    "chartevents_180"    "chartevents_181"   
## [193] "chartevents_182"    "chartevents_183"    "chartevents_184"   
## [196] "chartevents_185"    "chartevents_186"    "chartevents_187"   
## [199] "chartevents_188"    "chartevents_189"    "chartevents_190"   
## [202] "chartevents_191"    "chartevents_192"    "chartevents_193"   
## [205] "chartevents_194"    "chartevents_195"    "chartevents_196"   
## [208] "chartevents_197"    "chartevents_198"    "chartevents_199"   
## [211] "chartevents_200"    "chartevents_201"    "chartevents_202"   
## [214] "chartevents_203"    "chartevents_204"    "chartevents_205"   
## [217] "chartevents_206"    "chartevents_207"    "cptevents"         
## [220] "datetimeevents"     "diagnoses_icd"      "drgcodes"          
## [223] "d_cpt"              "d_icd_diagnoses"    "d_icd_procedures"  
## [226] "d_items"            "d_labitems"         "icustays"          
## [229] "inputevents_cv"     "prescriptions"      "procedureevents_mv"
## [232] "procedures_icd"     "services"           "transfers"         
## [235] "dbplyr_449"         "dbplyr_224"         "dbplyr_343"

An example SQL query:

sql_query <- "SELECT i.subject_id, i.hadm_id, i.los
              FROM icustays i;"
(data <- dbGetQuery(con, sql_query)) %>% as_tibble()
## # A tibble: 61,532 x 3
##    subject_id hadm_id   los
##         <int>   <int> <dbl>
##  1        268  110404 3.25 
##  2        269  106296 3.28 
##  3        270  188028 2.89 
##  4        271  173727 2.06 
##  5        272  164716 1.62 
##  6        273  158689 1.49 
##  7        274  130546 8.81 
##  8        275  129886 7.13 
##  9        276  135156 1.34 
## 10        277  171601 0.731
## # … with 61,522 more rows
#This document shows how RMarkdown can be used to create a reproducible analysis using MIMIC-III (version 1.4). Let's calculate the median length of stay in the ICU and then include this value in our document.
(avg_los <- median(data$los, na.rm=TRUE))
## [1] 2.09225
(rounded_avg_los <- round(avg_los, digits = 2))
## [1] 2.09

Connect to a table in database

To connect to the patients table in the database:

patients <- tbl(con, "patients")
patients %>% print(width = Inf)
## # Source:   table<patients> [?? x 8]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
##    row_id subject_id gender dob                 dod                
##     <int>      <int> <chr>  <dttm>              <dttm>             
##  1    234        249 F      2075-03-13 00:00:00 NA                 
##  2    235        250 F      2164-12-27 00:00:00 2188-11-22 00:00:00
##  3    236        251 M      2090-03-15 00:00:00 NA                 
##  4    237        252 M      2078-03-06 00:00:00 NA                 
##  5    238        253 F      2089-11-26 00:00:00 NA                 
##  6    239        255 M      2109-08-05 00:00:00 NA                 
##  7    240        256 M      2086-07-31 00:00:00 NA                 
##  8    241        257 F      2031-04-03 00:00:00 2121-07-08 00:00:00
##  9    242        258 F      2124-09-19 00:00:00 NA                 
## 10    243        260 F      2105-03-23 00:00:00 NA                 
##    dod_hosp            dod_ssn             expire_flag
##    <dttm>              <dttm>                    <int>
##  1 NA                  NA                            0
##  2 2188-11-22 00:00:00 NA                            1
##  3 NA                  NA                            0
##  4 NA                  NA                            0
##  5 NA                  NA                            0
##  6 NA                  NA                            0
##  7 NA                  NA                            0
##  8 2121-07-08 00:00:00 2121-07-08 00:00:00           1
##  9 NA                  NA                            0
## 10 NA                  NA                            0
## # … with more rows

We can see this connection is lazy (does not load the whole table into computer memory)

str(patients)
## List of 2
##  $ src:List of 2
##   ..$ con  :Formal class 'PostgreSQLConnection' [package "RPostgreSQL"] with 1 slot
##   .. .. ..@ Id: int [1:2] 26388 0
##   ..$ disco: NULL
##   ..- attr(*, "class")= chr [1:4] "src_PostgreSQLConnection" "src_dbi" "src_sql" "src"
##  $ ops:List of 2
##   ..$ x   : 'ident' chr "patients"
##   ..$ vars: chr [1:8] "row_id" "subject_id" "gender" "dob" ...
##   ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
##  - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...

To load the entire table into memory we may use

collect(patients)
## # A tibble: 46,520 x 8
##    row_id subject_id gender dob                 dod                
##  *  <int>      <int> <chr>  <dttm>              <dttm>             
##  1    234        249 F      2075-03-13 00:00:00 NA                 
##  2    235        250 F      2164-12-27 00:00:00 2188-11-22 00:00:00
##  3    236        251 M      2090-03-15 00:00:00 NA                 
##  4    237        252 M      2078-03-06 00:00:00 NA                 
##  5    238        253 F      2089-11-26 00:00:00 NA                 
##  6    239        255 M      2109-08-05 00:00:00 NA                 
##  7    240        256 M      2086-07-31 00:00:00 NA                 
##  8    241        257 F      2031-04-03 00:00:00 2121-07-08 00:00:00
##  9    242        258 F      2124-09-19 00:00:00 NA                 
## 10    243        260 F      2105-03-23 00:00:00 NA                 
## # … with 46,510 more rows, and 3 more variables: dod_hosp <dttm>,
## #   dod_ssn <dttm>, expire_flag <int>

But keep in mind that the point of using a database software is that the data tables are potentially large and we prefer to use database to do on disk computations as much as possible. So in this assignment we will avoid loading whole tables into memory as much as we can.

Query and subsetting

In this section, we demo how to create a cohort of patients who were directly admitted into CCU and were diagnosed with heart attack.

First we create a (query) table of patients who were directly admitted into CCU.

tbl(con, "transfers") %>%
  select(subject_id, hadm_id, prev_careunit, curr_careunit) %>%
  filter(is.na(prev_careunit) & curr_careunit == "CCU") %>%
  select(subject_id, hadm_id) %>%
  distinct() %>%
  print() -> ccu_admissions
## # Source:   lazy query [?? x 2]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
##    subject_id hadm_id
##         <int>   <int>
##  1      42525  142743
##  2      23028  190183
##  3      24357  111518
##  4      23872  120477
##  5       9768  128541
##  6      23474  128369
##  7      30292  113876
##  8      19848  167716
##  9      12138  131429
## 10      10616  122738
## # … with more rows

Now we want to restrict to heart attack patients. To find all possible ICD-9 codes related to heart attack, we search for string myocardial infarction in the long_title of table d_icd_diagnoses:

tbl(con, "d_icd_diagnoses") %>%
  filter(str_detect(tolower(long_title), "myocardial infarction")) %>%
  print() -> mi_codes
## # Source:   lazy query [?? x 4]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
##    row_id icd9_code short_title        long_title                               
##     <int> <chr>     <chr>              <chr>                                    
##  1   4335 41000     AMI anterolateral… Acute myocardial infarction of anterolat…
##  2   4336 41001     AMI anterolateral… Acute myocardial infarction of anterolat…
##  3   4337 41002     AMI anterolateral… Acute myocardial infarction of anterolat…
##  4   4338 41010     AMI anterior wall… Acute myocardial infarction of other ant…
##  5   4339 41011     AMI anterior wall… Acute myocardial infarction of other ant…
##  6   4340 41012     AMI anterior wall… Acute myocardial infarction of other ant…
##  7   4341 41020     AMI inferolateral… Acute myocardial infarction of inferolat…
##  8   4342 41021     AMI inferolateral… Acute myocardial infarction of inferolat…
##  9   4343 41022     AMI inferolateral… Acute myocardial infarction of inferolat…
## 10   4344 41030     AMI inferopost, u… Acute myocardial infarction of inferopos…
## # … with more rows

diagnoses_icd table stores the diagnosis of each admission. We use semi_join() to keep the rows in diagnoses_icd that match the ICD-9 codes related to heart attack:

tbl(con, "diagnoses_icd") %>%
  semi_join(mi_codes, by = "icd9_code") %>%
  print() -> mi_admissions
## # Source:   lazy query [?? x 5]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
##    row_id subject_id hadm_id seq_num icd9_code
##     <int>      <int>   <int>   <int> <chr>    
##  1 175785      15805  188616       1 41000    
##  2 281839      25208  167918       7 41000    
##  3    594         73  194730       1 41001    
##  4   6395        543  115307       1 41001    
##  5   6630        571  193189       1 41001    
##  6  11015        947  122379       1 41001    
##  7  12916       1114  164691       2 41001    
##  8  14945       1317  198886       1 41001    
##  9  18340       1626  117062       1 41001    
## 10  30343       2700  100335       1 41001    
## # … with more rows

MI may not be listed as the principal diagnosis; as explained in the documentation for the patients table, the seq_num field is a priority ranking for the diagnoses generated at the end of stay. In order to focus on patients for whom MI was central to their hospitalization, we will include records with MI in any of the first five diagnosis positions, according to the seq_num field. To avoid duplicate admissions, we use group_by() and top_n() to limit the query to the first MI diagnosis for each admission.

mi_admissions %>%
  filter(seq_num <= 5) %>%
  group_by(subject_id, hadm_id) %>%
  # top_n(1, wt = seq_num) %>% #  not working. bug? use following as workaround
  filter(min_rank(seq_num) <= 1) %>%
  ungroup() %>%
  select(subject_id, hadm_id, icd9_code, seq_num) %>%
  print() -> mi_admissions
## Warning: `lang_name()` is deprecated as of rlang 0.2.0.
## Please use `call_name()` instead.
## This warning is displayed once per session.
## Warning: `lang()` is deprecated as of rlang 0.2.0.
## Please use `call2()` instead.
## This warning is displayed once per session.
## # Source:   lazy query [?? x 4]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
##    subject_id hadm_id icd9_code seq_num
##         <int>   <int> <chr>       <int>
##  1         24  161859 41041           1
##  2         28  162569 412             4
##  3         42  119203 412             4
##  4         53  155385 41021           1
##  5         73  194730 41001           1
##  6         79  181542 41011           1
##  7        108  123552 412             5
##  8        111  192123 41081           5
##  9        123  195632 41011           1
## 10        149  154869 41011           1
## # … with more rows

Now we inner_join the table of admissions to CCU and the table of admissions that include MI diagnosis.

ccu_admissions %>%
  inner_join(mi_admissions, by = c("subject_id", "hadm_id")) %>%
  print() -> study_admissions
## # Source:   lazy query [?? x 4]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
##    subject_id hadm_id icd9_code seq_num
##         <int>   <int> <chr>       <int>
##  1         24  161859 41041           1
##  2         42  119203 412             4
##  3         53  155385 41021           1
##  4         73  194730 41001           1
##  5        111  192123 41081           5
##  6        123  195632 41011           1
##  7        149  154869 41011           1
##  8        154  111735 41041           1
##  9        158  169433 412             4
## 10        160  161672 41041           1
## # … with more rows

Transform and augment query tables

Now we create a logical variable indicating the MI is the principal diagonosis or not (according to seq_num).

study_admissions %>%
  mutate(principal_dx = seq_num == 1) %>%
  select(-seq_num) %>%
  print() -> study_admissions
## # Source:   lazy query [?? x 4]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
##    subject_id hadm_id icd9_code principal_dx
##         <int>   <int> <chr>     <lgl>       
##  1         24  161859 41041     TRUE        
##  2         42  119203 412       FALSE       
##  3         53  155385 41021     TRUE        
##  4         73  194730 41001     TRUE        
##  5        111  192123 41081     FALSE       
##  6        123  195632 41011     TRUE        
##  7        149  154869 41011     TRUE        
##  8        154  111735 41041     TRUE        
##  9        158  169433 412       FALSE       
## 10        160  161672 41041     TRUE        
## # … with more rows

We want to add information about the severity of patients’ ailments. The drgcodes table contains, for DRG codes from the All Payers Registry (APR), severity and mortality indicators. We pull the drug severity information and right-join it to our query table.

tbl(con, "drgcodes") %>%
  filter(str_detect(drg_type, "APR")) %>%
  select(subject_id, hadm_id, drg_severity) %>%
  right_join(study_admissions, by = c("subject_id", "hadm_id")) %>%
  mutate(drg_severity = ifelse(is.na(drg_severity), 1, drg_severity)) %>%
  print() -> study_admissions
## # Source:   lazy query [?? x 5]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
##    subject_id hadm_id drg_severity icd9_code principal_dx
##         <int>   <int>        <dbl> <chr>     <lgl>       
##  1         24  161859            1 41041     TRUE        
##  2         42  119203            1 412       FALSE       
##  3         53  155385            1 41021     TRUE        
##  4         73  194730            1 41001     TRUE        
##  5        111  192123            4 41081     FALSE       
##  6        123  195632            1 41011     TRUE        
##  7        149  154869            1 41011     TRUE        
##  8        154  111735            1 41041     TRUE        
##  9        158  169433            1 412       FALSE       
## 10        160  161672            1 41041     TRUE        
## # … with more rows

Pull the admission time admittime, discharge time dischtime, date of birth dob, and date of death dod. We are interested in the mortaility rate 30 days after discharge. So we only keep patients who didn’t die in hospital.

study_admissions %>%
  left_join(
    select(tbl(con, "admissions"),
           subject_id, hadm_id, admittime, dischtime, hospital_expire_flag
    ), by = c("subject_id", "hadm_id")
  ) %>%
  filter(hospital_expire_flag == 0) %>% # patients who did not die in hospital
  select(-hospital_expire_flag) %>%
  left_join(
    select(tbl(con, "patients"), subject_id, dob, dod),
    by = "subject_id"
  ) %>%
  print(width = Inf) -> study_admissions
## # Source:   lazy query [?? x 9]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
##    subject_id hadm_id drg_severity icd9_code principal_dx admittime          
##         <int>   <int>        <dbl> <chr>     <lgl>        <dttm>             
##  1         24  161859            1 41041     TRUE         2139-06-06 16:14:00
##  2         42  119203            1 412       FALSE        2116-04-26 18:58:00
##  3         53  155385            1 41021     TRUE         2170-03-20 20:20:00
##  4         73  194730            1 41001     TRUE         2170-12-22 00:12:00
##  5        111  192123            4 41081     FALSE        2142-04-24 06:55:00
##  6        123  195632            1 41011     TRUE         2161-10-29 02:27:00
##  7        149  154869            1 41011     TRUE         2135-02-18 19:16:00
##  8        154  111735            1 41041     TRUE         2117-12-29 21:36:00
##  9        158  169433            1 412       FALSE        2170-02-03 11:38:00
## 10        160  161672            1 41041     TRUE         2174-11-06 08:38:00
##    dischtime           dob                 dod                
##    <dttm>              <dttm>              <dttm>             
##  1 2139-06-09 12:48:00 2100-05-31 00:00:00 NA                 
##  2 2116-04-30 18:16:00 2055-02-25 00:00:00 2121-10-17 00:00:00
##  3 2170-03-23 18:00:00 2124-08-31 00:00:00 NA                 
##  4 2170-12-26 12:00:00 2113-05-22 00:00:00 NA                 
##  5 2142-05-05 11:45:00 2075-07-16 00:00:00 2144-07-01 00:00:00
##  6 2161-11-01 11:37:00 2105-08-31 00:00:00 NA                 
##  7 2135-02-26 17:00:00 1835-02-18 00:00:00 2135-03-02 00:00:00
##  8 2118-01-01 14:27:00 2073-07-26 00:00:00 NA                 
##  9 2170-02-06 16:49:00 2102-02-26 00:00:00 NA                 
## 10 2174-11-08 16:35:00 2124-11-22 00:00:00 NA                 
## # … with more rows

To add age (at admission) variable into the table. The documentation for the patients table explains that patients of 90 years and older had their ages artificially inflated, so we remove these patients from the analysis.

study_admissions %>%
  mutate(tt_death = date_part("day", dod) - date_part("day", dischtime)) %>%
  mutate(mortality = tt_death <= 30) %>%
  mutate(age = date_part("year", admittime) - date_part("year", dob)) %>%
  filter(age < 90) %>%
  mutate(age = age - ifelse(
    date_part("month", admittime) < date_part("month", dob) |
      (
        date_part("month", admittime) == date_part("month", dob) &
          date_part("day", admittime) < date_part("day", dob)
      ),
    1,
    0
  )) %>%
  select(-admittime, -dischtime, -dob, -dod, -tt_death) %>%
  select(subject_id, hadm_id, age, mortality, everything()) %>%
  print() -> study_admissions
## # Source:   lazy query [?? x 7]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
##    subject_id hadm_id   age mortality drg_severity icd9_code principal_dx
##         <int>   <int> <dbl> <lgl>            <dbl> <chr>     <lgl>       
##  1         24  161859    39 NA                   1 41041     TRUE        
##  2         42  119203    61 TRUE                 1 412       FALSE       
##  3         53  155385    45 NA                   1 41021     TRUE        
##  4         73  194730    57 NA                   1 41001     TRUE        
##  5        111  192123    66 TRUE                 4 41081     FALSE       
##  6        123  195632    56 NA                   1 41011     TRUE        
##  7        154  111735    44 NA                   1 41041     TRUE        
##  8        158  169433    67 NA                   1 412       FALSE       
##  9        160  161672    49 NA                   1 41041     TRUE        
## 10        194  124794    47 NA                   1 41041     TRUE        
## # … with more rows

Many mortality indicators are missing, due to neither the hospital database nor the social security database having a record of these patients’ deaths. We could convert these to FALSE values, but it may be helpful to retain in the analytic table this information on whether deaths were recorded at all, e.g. for validation or sensitivity testing.

Finally, let’s merge some demographic information (ethnicity, gender) into our study study_admissions.

tbl(con, "admissions") %>%
  select(subject_id, ethnicity) %>%
  distinct() %>%
  print() -> study_subjects
## # Source:   lazy query [?? x 2]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
##    subject_id ethnicity             
##         <int> <chr>                 
##  1       1421 BLACK/AFRICAN AMERICAN
##  2      68255 ASIAN                 
##  3       4901 WHITE                 
##  4      81232 WHITE                 
##  5      28292 WHITE                 
##  6      21741 WHITE                 
##  7      12175 BLACK/AFRICAN AMERICAN
##  8        341 BLACK/AFRICAN AMERICAN
##  9      48629 UNABLE TO OBTAIN      
## 10      18688 HISPANIC OR LATINO    
## # … with more rows
tbl(con, "patients") %>%
  select(subject_id, gender) %>%
  distinct() %>%
  full_join(study_subjects, by = "subject_id") %>%
  print() -> study_subjects
## # Source:   lazy query [?? x 3]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
##    subject_id gender ethnicity             
##         <int> <chr>  <chr>                 
##  1      26121 M      WHITE                 
##  2       7489 F      WHITE                 
##  3      81661 M      WHITE                 
##  4      30273 F      WHITE                 
##  5      55935 F      BLACK/AFRICAN AMERICAN
##  6      63914 M      UNABLE TO OBTAIN      
##  7      19393 M      WHITE                 
##  8      15861 M      WHITE                 
##  9      12289 F      WHITE                 
## 10      31927 F      PORTUGUESE            
## # … with more rows
study_subjects %>%
  semi_join(study_admissions, by = "subject_id") %>%
  print() -> study_subjects
## # Source:   lazy query [?? x 3]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
##    subject_id gender ethnicity            
##         <int> <chr>  <chr>                
##  1      11418 M      UNKNOWN/NOT SPECIFIED
##  2      11787 F      WHITE                
##  3      28951 M      WHITE                
##  4      24780 M      HISPANIC OR LATINO   
##  5      30647 M      UNABLE TO OBTAIN     
##  6       5960 F      UNKNOWN/NOT SPECIFIED
##  7      23893 M      WHITE                
##  8      81371 M      WHITE                
##  9       3606 M      UNKNOWN/NOT SPECIFIED
## 10      58438 F      OTHER                
## # … with more rows

Let’s resolves ome diversity and inconsistency in the ethnicity field:

unknown_ethnicity <- c(
  "OTHER",
  "UNABLE TO OBTAIN",
  "UNKNOWN/NOT SPECIFIED",
  "MULTI RACE ETHNICITY",
  "PATIENT DECLINED TO ANSWER",
  "UNKNOWN"
)

study_subjects %>%
  collect() %>%
  mutate(ethnic_group = case_when(
    str_detect(ethnicity, "^ASIAN") ~ "ASIAN",
    str_detect(ethnicity, "^BLACK") ~ "BLACK",
    str_detect(ethnicity, "^HISPANIC") ~ "HISPANIC",
    str_detect(ethnicity, "^WHITE") ~ "WHITE",
    ethnicity %in% unknown_ethnicity ~ NA_character_,
    TRUE ~ NA_character_
  )) %>%
  select(subject_id, gender, ethnic_group) %>%
  print() -> study_subjects
## # A tibble: 1,203 x 3
##    subject_id gender ethnic_group
##         <int> <chr>  <chr>       
##  1         24 M      WHITE       
##  2         42 M      <NA>        
##  3         53 M      <NA>        
##  4         73 F      WHITE       
##  5        111 F      WHITE       
##  6        123 M      HISPANIC    
##  7        154 M      WHITE       
##  8        158 M      WHITE       
##  9        160 F      WHITE       
## 10        194 M      BLACK       
## # … with 1,193 more rows

Some patients are coded as belonging to more than one ethnic group. To resolve these inconsistencies, we define a helper function to pick the modal value from a vector of values in R, which can be used by the summarize() function to choose one ethnic group for each patient.

most <- function(x) {
  if (all(is.na(x))) return(NA_character_)
  y <- table(x, useNA = "no")
  if (length(which(y == max(y))) > 1) return(NA_character_)
  return(names(y)[which.max(y)])
}

study_subjects %>%
  group_by(subject_id) %>%
  summarize(ethnic_group = most(ethnic_group)) %>%
  ungroup() %>%
  mutate(ethnic_group = ifelse(is.na(ethnic_group), "UNKNOWN", ethnic_group)) %>%
  print() -> subject_ethnic_groups
## # A tibble: 1,188 x 2
##    subject_id ethnic_group
##         <int> <chr>       
##  1         24 WHITE       
##  2         42 UNKNOWN     
##  3         53 UNKNOWN     
##  4         73 WHITE       
##  5        111 WHITE       
##  6        123 HISPANIC    
##  7        154 WHITE       
##  8        158 WHITE       
##  9        160 WHITE       
## 10        194 BLACK       
## # … with 1,178 more rows
study_subjects %>%
  select(subject_id, gender) %>%
  left_join(subject_ethnic_groups, by = "subject_id") %>%
  print() -> study_subjects
## # A tibble: 1,203 x 3
##    subject_id gender ethnic_group
##         <int> <chr>  <chr>       
##  1         24 M      WHITE       
##  2         42 M      UNKNOWN     
##  3         53 M      UNKNOWN     
##  4         73 F      WHITE       
##  5        111 F      WHITE       
##  6        123 M      HISPANIC    
##  7        154 M      WHITE       
##  8        158 M      WHITE       
##  9        160 F      WHITE       
## 10        194 M      BLACK       
## # … with 1,193 more rows

Now we add the demographic information gender and ethnicity into our study_admissions table:

study_admissions %>%
  left_join(study_subjects, by = "subject_id", copy = TRUE) %>%
  print() -> study_admissions
## # Source:   lazy query [?? x 9]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
##    subject_id hadm_id   age mortality drg_severity icd9_code principal_dx gender
##         <int>   <int> <dbl> <lgl>            <dbl> <chr>     <lgl>        <chr> 
##  1         24  161859    39 NA                   1 41041     TRUE         M     
##  2         42  119203    61 TRUE                 1 412       FALSE        M     
##  3         53  155385    45 NA                   1 41021     TRUE         M     
##  4         73  194730    57 NA                   1 41001     TRUE         F     
##  5        111  192123    66 TRUE                 4 41081     FALSE        F     
##  6        123  195632    56 NA                   1 41011     TRUE         M     
##  7        154  111735    44 NA                   1 41041     TRUE         M     
##  8        158  169433    67 NA                   1 412       FALSE        M     
##  9        160  161672    49 NA                   1 41041     TRUE         F     
## 10        194  124794    47 NA                   1 41041     TRUE         M     
## # … with more rows, and 1 more variable: ethnic_group <chr>

Close the connection to a database

Close the connection:

dbDisconnect(con)
## [1] TRUE

CONSORT Flow Diagrams

CONSORT Flow Diagrams can be used to plot the flow of data selection of a patient cohort.
For more details, see: The CONSORT Flow Diagram. Following code shows an example.

library(shape)
library(diagram)

# set margins and multiplot
par(mfrow = c(1, 1))
par(mar = c(0, 0, 0, 0))

# initialise a plot device
openplotmat()

# position of boxes
# 1st column indicates x axis position between 0 and 1
# 2nd column indicates y axis position between 0 and 1
# automatically assigns vertical position
num_of_boxes <- 6
auto_coords = coordinates(num_of_boxes)
vert_pos = rev(auto_coords[,1])
box_pos <- matrix(nrow = num_of_boxes, ncol = 2, data = 0)
box_pos[1,] = c(0.20, vert_pos[1]) # 1st box
box_pos[2,] = c(0.70, vert_pos[2]) # 2nd box
box_pos[3,] = c(0.70, vert_pos[3]) # 3rd box
box_pos[4,] = c(0.70, vert_pos[4]) # etc...
box_pos[5,] = c(0.70, vert_pos[5])
box_pos[6,] = c(0.20, vert_pos[6])

# content of boxes
box_content <- matrix(nrow = num_of_boxes, ncol = 1, data = 0)
box_content[1] = "All patients in MIMIC-III \n n = 58,976" # 1st box
box_content[2] = "Exclude patients of age < 18 \n n = 8,180" # 2nd box
box_content[3] = "Exclude patients with no ICU admissions \n n = 1,071" # 3rd box
box_content[4] = "Exclude patients with diabetes \n n = 1,324" # etc...
box_content[5] = "Exclude patients with sepsis \n n = 4,804"
box_content[6] = "Study cohort \n n = 43,597"

# adjust the size of boxes to fit content
box_x <- c(0.20, 0.25, 0.25, 0.25, 0.25, 0.20)
box_y <- c(0.07, 0.07, 0.07, 0.07, 0.07, 0.07)

# Draw the arrows
straightarrow(from = c(box_pos[1,1],box_pos[2,2]), to = box_pos[2,], lwd = 1)  
straightarrow(from = c(box_pos[1,1],box_pos[3,2]), to = box_pos[3,], lwd = 1)  
straightarrow(from = c(box_pos[1,1],box_pos[4,2]), to = box_pos[4,], lwd = 1)  
straightarrow(from = c(box_pos[1,1],box_pos[5,2]), to = box_pos[5,], lwd = 1)  
straightarrow(from = box_pos[1,], to = box_pos[6,], lwd = 1)

# Draw the boxes
for (i in 1:num_of_boxes) {
  textrect(mid = box_pos[i,], radx = box_x[i], rady = box_y[i], 
           lab = box_content[i], 
           shadow.col = "grey")
  }

Install PostgreSQL and RpostgreSQL package on CentOS

This is a note to myself (Dr. Hua Zhou). The postgres in yum is an old version. We want to install the most recent postgres and then build the RpostgreSQL package based on it.

  1. Follow instructions in https://www.digitalocean.com/community/tutorials/how-to-install-and-use-postgresql-on-centos-7 to install PostgreSQL 11 on CentOS 7.

  2. Issue following command to install RPostgreSQL in R:

sudo R -e 'Sys.setenv(PG_INCDIR="/usr/pgsql-11/include/"); Sys.setenv(PG_LIBDIR="/usr/pgsql-11/lib/"); install.packages("RPostgreSQL")'