In this assignment, you are to write a report analyzing the electronic health record (EHR) data MIMIC-III. You report will demostrate your knowledge of working with PostgreSQL database, data visualization, and commonly used analytical methods such as logistic regression and neural network. Your report should include at least following parts:
An informative title. For example, 30-Day Mortality Rate of Myocardia Infarction Patients Admitted to CCU.
Introduction. Describe the MIMIC-III data set and what research hypothesis/goal you are to address using this data.
Data preparation. Create a study cohort from MIMIC-III corresponding to your research hypothesis/goal. See the examplary code below. Use a CONSORT flow diagram to summarize your steps to create the cohort.
Data visualization. Use visualization to summarize the cohort you created.
Analytics. Use at least two analytical approaches to address your research hypothesis/goal. For example, we can use (1) logistic regression and (2) neural network to build a predictive model for the 30-day mortality rate of patients admitted into CCU and compare their predictive performance. Summarize your results in graphs.
Conclusions.
Secondary Analysis of Electronic Health Records: https://link.springer.com/book/10.1007/978-3-319-43742-2
The Book of OHDSI: https://ohdsi.github.io/TheBookOfOhdsi/.
The GitHub repository https://github.com/MIT-LCP/mimic-code contains some code examples for working with the MIMIC-III data. Following sample code derives from https://github.com/MIT-LCP/mimic-code/blob/master/tutorials/dplyr-frontend/intro.md.
Load database libraries and the tidyverse frontend:
library(DBI)
library(RPostgreSQL)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
Credentials for using PostgreSQL database. We are going to use username postgres
with password postgres
to access the mimic
database in the schemee mimiciii
.
# Load configuration settings
dbdriver <- 'PostgreSQL'
#host <- '127.0.0.1'
#port <- '5432'
user <- 'postgres'
password <- 'postgres'
dbname <- 'mimic'
schema <- 'mimiciii'
# Connect to the database using the configuration settings
con <- dbConnect(RPostgreSQL::PostgreSQL(),
dbname = dbname,
#host = host,
#port = port,
user = user,
password = password)
# Set the default schema
dbExecute(con, paste("SET search_path TO ", schema, sep=" "))
## [1] 0
con
## <PostgreSQLConnection>
List tables in the mimic
database:
dbListTables(con)
## [1] "admissions" "callout" "caregivers"
## [4] "chartevents_1" "chartevents" "chartevents_2"
## [7] "chartevents_3" "chartevents_4" "chartevents_5"
## [10] "chartevents_6" "chartevents_7" "chartevents_8"
## [13] "chartevents_9" "chartevents_10" "inputevents_mv"
## [16] "labevents" "chartevents_11" "chartevents_12"
## [19] "chartevents_13" "chartevents_14" "microbiologyevents"
## [22] "chartevents_15" "chartevents_16" "chartevents_17"
## [25] "chartevents_18" "chartevents_19" "chartevents_20"
## [28] "chartevents_21" "chartevents_22" "chartevents_23"
## [31] "noteevents" "outputevents" "chartevents_24"
## [34] "chartevents_25" "chartevents_26" "chartevents_27"
## [37] "chartevents_28" "patients" "chartevents_29"
## [40] "chartevents_30" "chartevents_31" "chartevents_32"
## [43] "chartevents_33" "chartevents_34" "chartevents_35"
## [46] "chartevents_36" "chartevents_37" "chartevents_38"
## [49] "chartevents_39" "chartevents_40" "chartevents_41"
## [52] "chartevents_42" "chartevents_43" "chartevents_44"
## [55] "chartevents_45" "chartevents_46" "chartevents_47"
## [58] "chartevents_48" "chartevents_49" "chartevents_50"
## [61] "chartevents_51" "chartevents_52" "chartevents_53"
## [64] "chartevents_54" "chartevents_55" "chartevents_56"
## [67] "chartevents_57" "chartevents_58" "chartevents_59"
## [70] "chartevents_60" "chartevents_61" "chartevents_62"
## [73] "chartevents_63" "chartevents_64" "chartevents_65"
## [76] "chartevents_66" "chartevents_67" "chartevents_68"
## [79] "chartevents_69" "chartevents_70" "chartevents_71"
## [82] "chartevents_72" "chartevents_73" "chartevents_74"
## [85] "chartevents_75" "chartevents_76" "chartevents_77"
## [88] "chartevents_78" "chartevents_79" "chartevents_80"
## [91] "chartevents_81" "chartevents_82" "chartevents_83"
## [94] "chartevents_84" "chartevents_85" "chartevents_86"
## [97] "chartevents_87" "chartevents_88" "chartevents_89"
## [100] "chartevents_90" "chartevents_91" "chartevents_92"
## [103] "chartevents_93" "chartevents_94" "chartevents_95"
## [106] "chartevents_96" "chartevents_97" "chartevents_98"
## [109] "chartevents_99" "chartevents_100" "chartevents_101"
## [112] "chartevents_102" "chartevents_103" "chartevents_104"
## [115] "chartevents_105" "chartevents_106" "chartevents_107"
## [118] "chartevents_108" "chartevents_109" "chartevents_110"
## [121] "chartevents_111" "chartevents_112" "chartevents_113"
## [124] "chartevents_114" "chartevents_115" "chartevents_116"
## [127] "chartevents_117" "chartevents_118" "chartevents_119"
## [130] "chartevents_120" "chartevents_121" "chartevents_122"
## [133] "chartevents_123" "chartevents_124" "chartevents_125"
## [136] "chartevents_126" "chartevents_127" "chartevents_128"
## [139] "chartevents_129" "chartevents_130" "chartevents_131"
## [142] "chartevents_132" "chartevents_133" "chartevents_134"
## [145] "chartevents_135" "chartevents_136" "chartevents_137"
## [148] "chartevents_138" "chartevents_139" "chartevents_140"
## [151] "chartevents_141" "chartevents_142" "chartevents_143"
## [154] "chartevents_144" "chartevents_145" "chartevents_146"
## [157] "chartevents_147" "chartevents_148" "chartevents_149"
## [160] "icd" "chartevents_150" "chartevents_151"
## [163] "chartevents_152" "chartevents_153" "chartevents_154"
## [166] "chartevents_155" "chartevents_156" "chartevents_157"
## [169] "chartevents_158" "chartevents_159" "chartevents_160"
## [172] "chartevents_161" "chartevents_162" "chartevents_163"
## [175] "chartevents_164" "chartevents_165" "chartevents_166"
## [178] "chartevents_167" "chartevents_168" "chartevents_169"
## [181] "chartevents_170" "chartevents_171" "chartevents_172"
## [184] "chartevents_173" "chartevents_174" "chartevents_175"
## [187] "chartevents_176" "chartevents_177" "chartevents_178"
## [190] "chartevents_179" "chartevents_180" "chartevents_181"
## [193] "chartevents_182" "chartevents_183" "chartevents_184"
## [196] "chartevents_185" "chartevents_186" "chartevents_187"
## [199] "chartevents_188" "chartevents_189" "chartevents_190"
## [202] "chartevents_191" "chartevents_192" "chartevents_193"
## [205] "chartevents_194" "chartevents_195" "chartevents_196"
## [208] "chartevents_197" "chartevents_198" "chartevents_199"
## [211] "chartevents_200" "chartevents_201" "chartevents_202"
## [214] "chartevents_203" "chartevents_204" "chartevents_205"
## [217] "chartevents_206" "chartevents_207" "cptevents"
## [220] "datetimeevents" "diagnoses_icd" "drgcodes"
## [223] "d_cpt" "d_icd_diagnoses" "d_icd_procedures"
## [226] "d_items" "d_labitems" "icustays"
## [229] "inputevents_cv" "prescriptions" "procedureevents_mv"
## [232] "procedures_icd" "services" "transfers"
## [235] "dbplyr_449" "dbplyr_224" "dbplyr_343"
An example SQL query:
sql_query <- "SELECT i.subject_id, i.hadm_id, i.los
FROM icustays i;"
(data <- dbGetQuery(con, sql_query)) %>% as_tibble()
## # A tibble: 61,532 x 3
## subject_id hadm_id los
## <int> <int> <dbl>
## 1 268 110404 3.25
## 2 269 106296 3.28
## 3 270 188028 2.89
## 4 271 173727 2.06
## 5 272 164716 1.62
## 6 273 158689 1.49
## 7 274 130546 8.81
## 8 275 129886 7.13
## 9 276 135156 1.34
## 10 277 171601 0.731
## # … with 61,522 more rows
#This document shows how RMarkdown can be used to create a reproducible analysis using MIMIC-III (version 1.4). Let's calculate the median length of stay in the ICU and then include this value in our document.
(avg_los <- median(data$los, na.rm=TRUE))
## [1] 2.09225
(rounded_avg_los <- round(avg_los, digits = 2))
## [1] 2.09
To connect to the patients
table in the database:
patients <- tbl(con, "patients")
patients %>% print(width = Inf)
## # Source: table<patients> [?? x 8]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
## row_id subject_id gender dob dod
## <int> <int> <chr> <dttm> <dttm>
## 1 234 249 F 2075-03-13 00:00:00 NA
## 2 235 250 F 2164-12-27 00:00:00 2188-11-22 00:00:00
## 3 236 251 M 2090-03-15 00:00:00 NA
## 4 237 252 M 2078-03-06 00:00:00 NA
## 5 238 253 F 2089-11-26 00:00:00 NA
## 6 239 255 M 2109-08-05 00:00:00 NA
## 7 240 256 M 2086-07-31 00:00:00 NA
## 8 241 257 F 2031-04-03 00:00:00 2121-07-08 00:00:00
## 9 242 258 F 2124-09-19 00:00:00 NA
## 10 243 260 F 2105-03-23 00:00:00 NA
## dod_hosp dod_ssn expire_flag
## <dttm> <dttm> <int>
## 1 NA NA 0
## 2 2188-11-22 00:00:00 NA 1
## 3 NA NA 0
## 4 NA NA 0
## 5 NA NA 0
## 6 NA NA 0
## 7 NA NA 0
## 8 2121-07-08 00:00:00 2121-07-08 00:00:00 1
## 9 NA NA 0
## 10 NA NA 0
## # … with more rows
We can see this connection is lazy (does not load the whole table into computer memory)
str(patients)
## List of 2
## $ src:List of 2
## ..$ con :Formal class 'PostgreSQLConnection' [package "RPostgreSQL"] with 1 slot
## .. .. ..@ Id: int [1:2] 26388 0
## ..$ disco: NULL
## ..- attr(*, "class")= chr [1:4] "src_PostgreSQLConnection" "src_dbi" "src_sql" "src"
## $ ops:List of 2
## ..$ x : 'ident' chr "patients"
## ..$ vars: chr [1:8] "row_id" "subject_id" "gender" "dob" ...
## ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
## - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
To load the entire table into memory we may use
collect(patients)
## # A tibble: 46,520 x 8
## row_id subject_id gender dob dod
## * <int> <int> <chr> <dttm> <dttm>
## 1 234 249 F 2075-03-13 00:00:00 NA
## 2 235 250 F 2164-12-27 00:00:00 2188-11-22 00:00:00
## 3 236 251 M 2090-03-15 00:00:00 NA
## 4 237 252 M 2078-03-06 00:00:00 NA
## 5 238 253 F 2089-11-26 00:00:00 NA
## 6 239 255 M 2109-08-05 00:00:00 NA
## 7 240 256 M 2086-07-31 00:00:00 NA
## 8 241 257 F 2031-04-03 00:00:00 2121-07-08 00:00:00
## 9 242 258 F 2124-09-19 00:00:00 NA
## 10 243 260 F 2105-03-23 00:00:00 NA
## # … with 46,510 more rows, and 3 more variables: dod_hosp <dttm>,
## # dod_ssn <dttm>, expire_flag <int>
But keep in mind that the point of using a database software is that the data tables are potentially large and we prefer to use database to do on disk computations as much as possible. So in this assignment we will avoid loading whole tables into memory as much as we can.
In this section, we demo how to create a cohort of patients who were directly admitted into CCU and were diagnosed with heart attack.
First we create a (query) table of patients who were directly admitted into CCU.
tbl(con, "transfers") %>%
select(subject_id, hadm_id, prev_careunit, curr_careunit) %>%
filter(is.na(prev_careunit) & curr_careunit == "CCU") %>%
select(subject_id, hadm_id) %>%
distinct() %>%
print() -> ccu_admissions
## # Source: lazy query [?? x 2]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
## subject_id hadm_id
## <int> <int>
## 1 42525 142743
## 2 23028 190183
## 3 24357 111518
## 4 23872 120477
## 5 9768 128541
## 6 23474 128369
## 7 30292 113876
## 8 19848 167716
## 9 12138 131429
## 10 10616 122738
## # … with more rows
Now we want to restrict to heart attack patients. To find all possible ICD-9 codes related to heart attack, we search for string myocardial infarction
in the long_title
of table d_icd_diagnoses
:
tbl(con, "d_icd_diagnoses") %>%
filter(str_detect(tolower(long_title), "myocardial infarction")) %>%
print() -> mi_codes
## # Source: lazy query [?? x 4]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
## row_id icd9_code short_title long_title
## <int> <chr> <chr> <chr>
## 1 4335 41000 AMI anterolateral… Acute myocardial infarction of anterolat…
## 2 4336 41001 AMI anterolateral… Acute myocardial infarction of anterolat…
## 3 4337 41002 AMI anterolateral… Acute myocardial infarction of anterolat…
## 4 4338 41010 AMI anterior wall… Acute myocardial infarction of other ant…
## 5 4339 41011 AMI anterior wall… Acute myocardial infarction of other ant…
## 6 4340 41012 AMI anterior wall… Acute myocardial infarction of other ant…
## 7 4341 41020 AMI inferolateral… Acute myocardial infarction of inferolat…
## 8 4342 41021 AMI inferolateral… Acute myocardial infarction of inferolat…
## 9 4343 41022 AMI inferolateral… Acute myocardial infarction of inferolat…
## 10 4344 41030 AMI inferopost, u… Acute myocardial infarction of inferopos…
## # … with more rows
diagnoses_icd
table stores the diagnosis of each admission. We use semi_join()
to keep the rows in diagnoses_icd
that match the ICD-9 codes related to heart attack:
tbl(con, "diagnoses_icd") %>%
semi_join(mi_codes, by = "icd9_code") %>%
print() -> mi_admissions
## # Source: lazy query [?? x 5]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
## row_id subject_id hadm_id seq_num icd9_code
## <int> <int> <int> <int> <chr>
## 1 175785 15805 188616 1 41000
## 2 281839 25208 167918 7 41000
## 3 594 73 194730 1 41001
## 4 6395 543 115307 1 41001
## 5 6630 571 193189 1 41001
## 6 11015 947 122379 1 41001
## 7 12916 1114 164691 2 41001
## 8 14945 1317 198886 1 41001
## 9 18340 1626 117062 1 41001
## 10 30343 2700 100335 1 41001
## # … with more rows
MI may not be listed as the principal diagnosis; as explained in the documentation for the patients
table, the seq_num
field is a priority ranking for the diagnoses generated at the end of stay. In order to focus on patients for whom MI was central to their hospitalization, we will include records with MI in any of the first five diagnosis positions, according to the seq_num
field. To avoid duplicate admissions, we use group_by()
and top_n()
to limit the query to the first MI diagnosis for each admission.
mi_admissions %>%
filter(seq_num <= 5) %>%
group_by(subject_id, hadm_id) %>%
# top_n(1, wt = seq_num) %>% # not working. bug? use following as workaround
filter(min_rank(seq_num) <= 1) %>%
ungroup() %>%
select(subject_id, hadm_id, icd9_code, seq_num) %>%
print() -> mi_admissions
## Warning: `lang_name()` is deprecated as of rlang 0.2.0.
## Please use `call_name()` instead.
## This warning is displayed once per session.
## Warning: `lang()` is deprecated as of rlang 0.2.0.
## Please use `call2()` instead.
## This warning is displayed once per session.
## # Source: lazy query [?? x 4]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
## subject_id hadm_id icd9_code seq_num
## <int> <int> <chr> <int>
## 1 24 161859 41041 1
## 2 28 162569 412 4
## 3 42 119203 412 4
## 4 53 155385 41021 1
## 5 73 194730 41001 1
## 6 79 181542 41011 1
## 7 108 123552 412 5
## 8 111 192123 41081 5
## 9 123 195632 41011 1
## 10 149 154869 41011 1
## # … with more rows
Now we inner_join
the table of admissions to CCU and the table of admissions that include MI diagnosis.
ccu_admissions %>%
inner_join(mi_admissions, by = c("subject_id", "hadm_id")) %>%
print() -> study_admissions
## # Source: lazy query [?? x 4]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
## subject_id hadm_id icd9_code seq_num
## <int> <int> <chr> <int>
## 1 24 161859 41041 1
## 2 42 119203 412 4
## 3 53 155385 41021 1
## 4 73 194730 41001 1
## 5 111 192123 41081 5
## 6 123 195632 41011 1
## 7 149 154869 41011 1
## 8 154 111735 41041 1
## 9 158 169433 412 4
## 10 160 161672 41041 1
## # … with more rows
Now we create a logical variable indicating the MI is the principal diagonosis or not (according to seq_num
).
study_admissions %>%
mutate(principal_dx = seq_num == 1) %>%
select(-seq_num) %>%
print() -> study_admissions
## # Source: lazy query [?? x 4]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
## subject_id hadm_id icd9_code principal_dx
## <int> <int> <chr> <lgl>
## 1 24 161859 41041 TRUE
## 2 42 119203 412 FALSE
## 3 53 155385 41021 TRUE
## 4 73 194730 41001 TRUE
## 5 111 192123 41081 FALSE
## 6 123 195632 41011 TRUE
## 7 149 154869 41011 TRUE
## 8 154 111735 41041 TRUE
## 9 158 169433 412 FALSE
## 10 160 161672 41041 TRUE
## # … with more rows
We want to add information about the severity of patients’ ailments. The drgcodes
table contains, for DRG
codes from the All Payers Registry (APR), severity and mortality indicators. We pull the drug severity information and right-join it to our query table.
tbl(con, "drgcodes") %>%
filter(str_detect(drg_type, "APR")) %>%
select(subject_id, hadm_id, drg_severity) %>%
right_join(study_admissions, by = c("subject_id", "hadm_id")) %>%
mutate(drg_severity = ifelse(is.na(drg_severity), 1, drg_severity)) %>%
print() -> study_admissions
## # Source: lazy query [?? x 5]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
## subject_id hadm_id drg_severity icd9_code principal_dx
## <int> <int> <dbl> <chr> <lgl>
## 1 24 161859 1 41041 TRUE
## 2 42 119203 1 412 FALSE
## 3 53 155385 1 41021 TRUE
## 4 73 194730 1 41001 TRUE
## 5 111 192123 4 41081 FALSE
## 6 123 195632 1 41011 TRUE
## 7 149 154869 1 41011 TRUE
## 8 154 111735 1 41041 TRUE
## 9 158 169433 1 412 FALSE
## 10 160 161672 1 41041 TRUE
## # … with more rows
Pull the admission time admittime
, discharge time dischtime
, date of birth dob
, and date of death dod
. We are interested in the mortaility rate 30 days after discharge. So we only keep patients who didn’t die in hospital.
study_admissions %>%
left_join(
select(tbl(con, "admissions"),
subject_id, hadm_id, admittime, dischtime, hospital_expire_flag
), by = c("subject_id", "hadm_id")
) %>%
filter(hospital_expire_flag == 0) %>% # patients who did not die in hospital
select(-hospital_expire_flag) %>%
left_join(
select(tbl(con, "patients"), subject_id, dob, dod),
by = "subject_id"
) %>%
print(width = Inf) -> study_admissions
## # Source: lazy query [?? x 9]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
## subject_id hadm_id drg_severity icd9_code principal_dx admittime
## <int> <int> <dbl> <chr> <lgl> <dttm>
## 1 24 161859 1 41041 TRUE 2139-06-06 16:14:00
## 2 42 119203 1 412 FALSE 2116-04-26 18:58:00
## 3 53 155385 1 41021 TRUE 2170-03-20 20:20:00
## 4 73 194730 1 41001 TRUE 2170-12-22 00:12:00
## 5 111 192123 4 41081 FALSE 2142-04-24 06:55:00
## 6 123 195632 1 41011 TRUE 2161-10-29 02:27:00
## 7 149 154869 1 41011 TRUE 2135-02-18 19:16:00
## 8 154 111735 1 41041 TRUE 2117-12-29 21:36:00
## 9 158 169433 1 412 FALSE 2170-02-03 11:38:00
## 10 160 161672 1 41041 TRUE 2174-11-06 08:38:00
## dischtime dob dod
## <dttm> <dttm> <dttm>
## 1 2139-06-09 12:48:00 2100-05-31 00:00:00 NA
## 2 2116-04-30 18:16:00 2055-02-25 00:00:00 2121-10-17 00:00:00
## 3 2170-03-23 18:00:00 2124-08-31 00:00:00 NA
## 4 2170-12-26 12:00:00 2113-05-22 00:00:00 NA
## 5 2142-05-05 11:45:00 2075-07-16 00:00:00 2144-07-01 00:00:00
## 6 2161-11-01 11:37:00 2105-08-31 00:00:00 NA
## 7 2135-02-26 17:00:00 1835-02-18 00:00:00 2135-03-02 00:00:00
## 8 2118-01-01 14:27:00 2073-07-26 00:00:00 NA
## 9 2170-02-06 16:49:00 2102-02-26 00:00:00 NA
## 10 2174-11-08 16:35:00 2124-11-22 00:00:00 NA
## # … with more rows
To add age
(at admission) variable into the table. The documentation for the patients table explains that patients of 90 years and older had their ages artificially inflated, so we remove these patients from the analysis.
study_admissions %>%
mutate(tt_death = date_part("day", dod) - date_part("day", dischtime)) %>%
mutate(mortality = tt_death <= 30) %>%
mutate(age = date_part("year", admittime) - date_part("year", dob)) %>%
filter(age < 90) %>%
mutate(age = age - ifelse(
date_part("month", admittime) < date_part("month", dob) |
(
date_part("month", admittime) == date_part("month", dob) &
date_part("day", admittime) < date_part("day", dob)
),
1,
0
)) %>%
select(-admittime, -dischtime, -dob, -dod, -tt_death) %>%
select(subject_id, hadm_id, age, mortality, everything()) %>%
print() -> study_admissions
## # Source: lazy query [?? x 7]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
## subject_id hadm_id age mortality drg_severity icd9_code principal_dx
## <int> <int> <dbl> <lgl> <dbl> <chr> <lgl>
## 1 24 161859 39 NA 1 41041 TRUE
## 2 42 119203 61 TRUE 1 412 FALSE
## 3 53 155385 45 NA 1 41021 TRUE
## 4 73 194730 57 NA 1 41001 TRUE
## 5 111 192123 66 TRUE 4 41081 FALSE
## 6 123 195632 56 NA 1 41011 TRUE
## 7 154 111735 44 NA 1 41041 TRUE
## 8 158 169433 67 NA 1 412 FALSE
## 9 160 161672 49 NA 1 41041 TRUE
## 10 194 124794 47 NA 1 41041 TRUE
## # … with more rows
Many mortality indicators are missing, due to neither the hospital database nor the social security database having a record of these patients’ deaths. We could convert these to FALSE
values, but it may be helpful to retain in the analytic table this information on whether deaths were recorded at all, e.g. for validation or sensitivity testing.
Finally, let’s merge some demographic information (ethnicity, gender) into our study study_admissions
.
tbl(con, "admissions") %>%
select(subject_id, ethnicity) %>%
distinct() %>%
print() -> study_subjects
## # Source: lazy query [?? x 2]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
## subject_id ethnicity
## <int> <chr>
## 1 1421 BLACK/AFRICAN AMERICAN
## 2 68255 ASIAN
## 3 4901 WHITE
## 4 81232 WHITE
## 5 28292 WHITE
## 6 21741 WHITE
## 7 12175 BLACK/AFRICAN AMERICAN
## 8 341 BLACK/AFRICAN AMERICAN
## 9 48629 UNABLE TO OBTAIN
## 10 18688 HISPANIC OR LATINO
## # … with more rows
tbl(con, "patients") %>%
select(subject_id, gender) %>%
distinct() %>%
full_join(study_subjects, by = "subject_id") %>%
print() -> study_subjects
## # Source: lazy query [?? x 3]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
## subject_id gender ethnicity
## <int> <chr> <chr>
## 1 26121 M WHITE
## 2 7489 F WHITE
## 3 81661 M WHITE
## 4 30273 F WHITE
## 5 55935 F BLACK/AFRICAN AMERICAN
## 6 63914 M UNABLE TO OBTAIN
## 7 19393 M WHITE
## 8 15861 M WHITE
## 9 12289 F WHITE
## 10 31927 F PORTUGUESE
## # … with more rows
study_subjects %>%
semi_join(study_admissions, by = "subject_id") %>%
print() -> study_subjects
## # Source: lazy query [?? x 3]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
## subject_id gender ethnicity
## <int> <chr> <chr>
## 1 11418 M UNKNOWN/NOT SPECIFIED
## 2 11787 F WHITE
## 3 28951 M WHITE
## 4 24780 M HISPANIC OR LATINO
## 5 30647 M UNABLE TO OBTAIN
## 6 5960 F UNKNOWN/NOT SPECIFIED
## 7 23893 M WHITE
## 8 81371 M WHITE
## 9 3606 M UNKNOWN/NOT SPECIFIED
## 10 58438 F OTHER
## # … with more rows
Let’s resolves ome diversity and inconsistency in the ethnicity
field:
unknown_ethnicity <- c(
"OTHER",
"UNABLE TO OBTAIN",
"UNKNOWN/NOT SPECIFIED",
"MULTI RACE ETHNICITY",
"PATIENT DECLINED TO ANSWER",
"UNKNOWN"
)
study_subjects %>%
collect() %>%
mutate(ethnic_group = case_when(
str_detect(ethnicity, "^ASIAN") ~ "ASIAN",
str_detect(ethnicity, "^BLACK") ~ "BLACK",
str_detect(ethnicity, "^HISPANIC") ~ "HISPANIC",
str_detect(ethnicity, "^WHITE") ~ "WHITE",
ethnicity %in% unknown_ethnicity ~ NA_character_,
TRUE ~ NA_character_
)) %>%
select(subject_id, gender, ethnic_group) %>%
print() -> study_subjects
## # A tibble: 1,203 x 3
## subject_id gender ethnic_group
## <int> <chr> <chr>
## 1 24 M WHITE
## 2 42 M <NA>
## 3 53 M <NA>
## 4 73 F WHITE
## 5 111 F WHITE
## 6 123 M HISPANIC
## 7 154 M WHITE
## 8 158 M WHITE
## 9 160 F WHITE
## 10 194 M BLACK
## # … with 1,193 more rows
Some patients are coded as belonging to more than one ethnic group. To resolve these inconsistencies, we define a helper function to pick the modal value from a vector of values in R, which can be used by the summarize()
function to choose one ethnic group for each patient.
most <- function(x) {
if (all(is.na(x))) return(NA_character_)
y <- table(x, useNA = "no")
if (length(which(y == max(y))) > 1) return(NA_character_)
return(names(y)[which.max(y)])
}
study_subjects %>%
group_by(subject_id) %>%
summarize(ethnic_group = most(ethnic_group)) %>%
ungroup() %>%
mutate(ethnic_group = ifelse(is.na(ethnic_group), "UNKNOWN", ethnic_group)) %>%
print() -> subject_ethnic_groups
## # A tibble: 1,188 x 2
## subject_id ethnic_group
## <int> <chr>
## 1 24 WHITE
## 2 42 UNKNOWN
## 3 53 UNKNOWN
## 4 73 WHITE
## 5 111 WHITE
## 6 123 HISPANIC
## 7 154 WHITE
## 8 158 WHITE
## 9 160 WHITE
## 10 194 BLACK
## # … with 1,178 more rows
study_subjects %>%
select(subject_id, gender) %>%
left_join(subject_ethnic_groups, by = "subject_id") %>%
print() -> study_subjects
## # A tibble: 1,203 x 3
## subject_id gender ethnic_group
## <int> <chr> <chr>
## 1 24 M WHITE
## 2 42 M UNKNOWN
## 3 53 M UNKNOWN
## 4 73 F WHITE
## 5 111 F WHITE
## 6 123 M HISPANIC
## 7 154 M WHITE
## 8 158 M WHITE
## 9 160 F WHITE
## 10 194 M BLACK
## # … with 1,193 more rows
Now we add the demographic information gender
and ethnicity
into our study_admissions
table:
study_admissions %>%
left_join(study_subjects, by = "subject_id", copy = TRUE) %>%
print() -> study_admissions
## # Source: lazy query [?? x 9]
## # Database: postgres 11.0.7 [postgres@/var/run/postgresql:5432/mimic]
## subject_id hadm_id age mortality drg_severity icd9_code principal_dx gender
## <int> <int> <dbl> <lgl> <dbl> <chr> <lgl> <chr>
## 1 24 161859 39 NA 1 41041 TRUE M
## 2 42 119203 61 TRUE 1 412 FALSE M
## 3 53 155385 45 NA 1 41021 TRUE M
## 4 73 194730 57 NA 1 41001 TRUE F
## 5 111 192123 66 TRUE 4 41081 FALSE F
## 6 123 195632 56 NA 1 41011 TRUE M
## 7 154 111735 44 NA 1 41041 TRUE M
## 8 158 169433 67 NA 1 412 FALSE M
## 9 160 161672 49 NA 1 41041 TRUE F
## 10 194 124794 47 NA 1 41041 TRUE M
## # … with more rows, and 1 more variable: ethnic_group <chr>
Close the connection:
dbDisconnect(con)
## [1] TRUE
CONSORT Flow Diagrams can be used to plot the flow of data selection of a patient cohort.
For more details, see: The CONSORT Flow Diagram. Following code shows an example.
library(shape)
library(diagram)
# set margins and multiplot
par(mfrow = c(1, 1))
par(mar = c(0, 0, 0, 0))
# initialise a plot device
openplotmat()
# position of boxes
# 1st column indicates x axis position between 0 and 1
# 2nd column indicates y axis position between 0 and 1
# automatically assigns vertical position
num_of_boxes <- 6
auto_coords = coordinates(num_of_boxes)
vert_pos = rev(auto_coords[,1])
box_pos <- matrix(nrow = num_of_boxes, ncol = 2, data = 0)
box_pos[1,] = c(0.20, vert_pos[1]) # 1st box
box_pos[2,] = c(0.70, vert_pos[2]) # 2nd box
box_pos[3,] = c(0.70, vert_pos[3]) # 3rd box
box_pos[4,] = c(0.70, vert_pos[4]) # etc...
box_pos[5,] = c(0.70, vert_pos[5])
box_pos[6,] = c(0.20, vert_pos[6])
# content of boxes
box_content <- matrix(nrow = num_of_boxes, ncol = 1, data = 0)
box_content[1] = "All patients in MIMIC-III \n n = 58,976" # 1st box
box_content[2] = "Exclude patients of age < 18 \n n = 8,180" # 2nd box
box_content[3] = "Exclude patients with no ICU admissions \n n = 1,071" # 3rd box
box_content[4] = "Exclude patients with diabetes \n n = 1,324" # etc...
box_content[5] = "Exclude patients with sepsis \n n = 4,804"
box_content[6] = "Study cohort \n n = 43,597"
# adjust the size of boxes to fit content
box_x <- c(0.20, 0.25, 0.25, 0.25, 0.25, 0.20)
box_y <- c(0.07, 0.07, 0.07, 0.07, 0.07, 0.07)
# Draw the arrows
straightarrow(from = c(box_pos[1,1],box_pos[2,2]), to = box_pos[2,], lwd = 1)
straightarrow(from = c(box_pos[1,1],box_pos[3,2]), to = box_pos[3,], lwd = 1)
straightarrow(from = c(box_pos[1,1],box_pos[4,2]), to = box_pos[4,], lwd = 1)
straightarrow(from = c(box_pos[1,1],box_pos[5,2]), to = box_pos[5,], lwd = 1)
straightarrow(from = box_pos[1,], to = box_pos[6,], lwd = 1)
# Draw the boxes
for (i in 1:num_of_boxes) {
textrect(mid = box_pos[i,], radx = box_x[i], rady = box_y[i],
lab = box_content[i],
shadow.col = "grey")
}
RpostgreSQL
package on CentOSThis is a note to myself (Dr. Hua Zhou). The postgres in yum is an old version. We want to install the most recent postgres and then build the RpostgreSQL
package based on it.
Follow instructions in https://www.digitalocean.com/community/tutorials/how-to-install-and-use-postgresql-on-centos-7 to install PostgreSQL 11 on CentOS 7.
Issue following command to install RPostgreSQL
in R:
sudo R -e 'Sys.setenv(PG_INCDIR="/usr/pgsql-11/include/"); Sys.setenv(PG_LIBDIR="/usr/pgsql-11/lib/"); install.packages("RPostgreSQL")'