Biostat 203B Homework 1

Q1. Git/GitHub

No handwritten homework reports are accepted for this course. We work with Git and GitHub. Efficient and abundant use of Git, e.g., frequent and well-documented commits, is an important criterion for grading your homework.

Apply for the Student Developer Pack at GitHub using your UCLA email.
Create a private repository biostat-203b-2020-winter and add Hua-Zhou and juhkim111 as your collaborators with write permission.
Top directories of the repository should be hw1, hw2, … Maintain two branches master and develop. The develop branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The master branch will be your presentation area. Submit your homework files (R markdown file Rmd, html file converted from R markdown, all code and data sets to reproduce results) in master branch.
After each homework due date, teaching assistant and instructor will check out your master branch for grading. Tag each of your homework submissions with tag names hw1, hw2, … Tagging time will be used as your submission time. That means if you tag your hw1 submission after deadline, penalty points will be deducted for late submission.

Q2. Linux Shell Commands

This exercise (and later in this course) uses the MIMIC-III data, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Please follow the instructions at https://mimic.physionet.org/gettingstarted/access/ to complete the CITI Data or Specimens Only Research course. Show the screenshot of your completion report.

The /home/203bdata/mimic-iii/ folder on teaching server contains data sets from MIMIC-III. See https://mimic.physionet.org/mimictables/admissions/ for details of each table.

ls -l /home/203bdata/mimic-iii

## total 11088432
## -rw-r--r--. 1 root root   12548562 Jan 14 04:12 ADMISSIONS.csv
## -rw-r--r--. 1 root root    6339185 Jan 14 04:13 CALLOUT.csv
## -rw-r--r--. 1 root root     203492 Jan 14 04:13 CAREGIVERS.csv
## -rw-r--r--. 1 root root   85204883 Jan 14 04:14 CHARTEVENTS.csv
## -rw-r--r--. 1 root root   58150883 Jan 14 04:15 CPTEVENTS.csv
## -rw-r--r--. 1 root root  525785298 Jan 14 04:22 DATETIMEEVENTS.csv
## -rw-r--r--. 1 root root      13807 Jan 14 04:22 D_CPT.csv
## -rw-r--r--. 1 root root   19137527 Jan 14 04:22 DIAGNOSES_ICD.csv
## -rw-r--r--. 1 root root    1387562 Jan 14 04:22 D_ICD_DIAGNOSES.csv
## -rw-r--r--. 1 root root     311466 Jan 14 04:22 D_ICD_PROCEDURES.csv
## -rw-r--r--. 1 root root     954420 Jan 14 04:22 D_ITEMS.csv
## -rw-r--r--. 1 root root      43118 Jan 14 04:22 D_LABITEMS.csv
## -rw-r--r--. 1 root root   10487132 Jan 14 04:22 DRGCODES.csv
## -rw-r--r--. 1 root root    6357077 Jan 14 04:22 ICUSTAYS.csv
## -rw-r--r--. 1 root root 2464296511 Jan 14 04:57 INPUTEVENTS_CV.csv
## -rw-r--r--. 1 root root  975255812 Jan 14 05:10 INPUTEVENTS_MV.csv
## -rw-r--r--. 1 root root 1854245647 Jan 14 05:36 LABEVENTS.csv
## -rw-r--r--. 1 root root   72507810 Jan 14 05:36 MICROBIOLOGYEVENTS.csv
## -rw-r--r--. 1 root root 4007717810 Jan 14 16:40 NOTEEVENTS.csv
## -rw-r--r--. 1 root root  396406750 Jan 14 16:50 OUTPUTEVENTS.csv
## -rw-r--r--. 1 root root    2628900 Jan 14 16:50 PATIENTS.csv
## -rw-r--r--. 1 root root      36517 Jan 14 16:53 postgres_add_indexes.sql
## -rw-r--r--. 1 root root       4195 Jan 14 16:53 postgres_checks.sql
## -rw-r--r--. 1 root root      20688 Jan 14 16:53 postgres_create_tables.sql
## -rw-r--r--. 1 root root       6897 Jan 14 16:53 postgres_load_data.sql
## -rw-r--r--. 1 root root  770336136 Jan 14 16:52 PRESCRIPTIONS.csv
## -rw-r--r--. 1 root root   48770424 Jan 14 16:52 PROCEDUREEVENTS_MV.csv
## -rw-r--r--. 1 root root    6798492 Jan 14 16:52 PROCEDURES_ICD.csv
## -rw-r--r--. 1 root root    3481645 Jan 14 16:52 SERVICES.csv
## -rw-r--r--. 1 root root   25057095 Jan 14 16:52 TRANSFERS.csv

Please, do not put these data files into Git; they are big. Also do not copy them into your directory. Just read from the data folder /home/203bdata/mimic-iii directly in following exercises.

Use Bash commands to answer following questions.

What’s the output of following bash script?
```
for datafile in /home/203bdata/mimic-iii/*.csv
  do
    ls $datafile
  done
```
Display the number of lines in each csv file.
Display the first few lines of ADMISSIONS.csv. How many rows are in this data file? How many unique patients (identified by SUBJECT_ID) are in this data file? What are the possible values taken by each of the variable INSURANCE, LANGUAGE, RELIGION, MARITAL_STATUS, and ETHNICITY? How many (unique) patients are Hispanic? (Hint: combine Linux comamnds head, tail, awk, uniq, wc, sort and so on using pipe.)

Q3. More fun with shell

You and your friend just have finished reading Pride and Prejudice by Jane Austen. Among the four main characters in the book, Elizabeth, Jane, Lydia, and Darcy, your friend thinks that Darcy was the most mentioned. You, however, are certain it was Elizabeth. Obtain the full text of the novel from http://www.gutenberg.org/cache/epub/42671/pg42671.txt and save to your local folder.
```
curl http://www.gutenberg.org/cache/epub/42671/pg42671.txt > pride_and_prejudice.txt
```
Do not put this text file pride_and_prejudice.txt in Git. Using a for loop, how would you tabulate the number of times each of the four characters is mentioned?

What’s the difference between the following two commands?

echo 'hello, world' > test1.txt

and

echo 'hello, world' >> test2.txt

Using your favorite text editor (e.g., vi), type the following and save the file as middle.sh:
```
#!/bin/sh
# Select lines from the middle of a file.
# Usage: bash middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"
```
Using chmod make the file executable by the owner, and run
```
./middle.sh pride_and_prejudice.txt 20 5
```
Explain the output. Explain the meaning of "$1", "$2", and "$3" in this shell script. Why do we need the first line of the shell script?

Q4. R Batch Run

In class we discussed using R to organize simulation studies.

Expand the runSim.R script to include arguments seed (random seed), n (sample size), dist (distribution) and rep (number of simulation replicates). When dist="gaussian", generate data from standard normal; when dist="t1", generate data from t-distribution with degree of freedom 1 (same as Cauchy distribution); when dist="t5", generate data from t-distribution with degree of freedom 5. Calling runSim.R will (1) set random seed according to argument seed, (2) generate data according to argument dist, (3) compute the primed-indexed average estimator and the classical sample average estimator for each simulation replicate, (4) report the average mean squared error (MSE) \[ \frac{\sum_{r=1}^{\text{rep}} (\widehat \mu_r - \mu_{\text{true}})^2}{\text{rep}} \] for both methods.
Modify the autoSim.R script to run simulations with combinations of sample sizes nVals = seq(100, 500, by=100) and distributions distTypes = c("gaussian", "t1", "t5") and write output to appropriately named files. Use rep = 50, and seed = 203.
Write an R script to collect simulation results from output files and print average MSEs in a table of format

$n$	Method	Gaussian	$t_5$	$t_1$
100	PrimeAvg
100	SampAvg
200	PrimeAvg
200	SampAvg
300	PrimeAvg
300	SampAvg
400	PrimeAvg
400	SampAvg
500	PrimeAvg
500	SampAvg

Biostat 203B Homework 1

Due Jan 24 @ 11:59PM

Q1. Git/GitHub

Q2. Linux Shell Commands

Q3. More fun with shell

Q4. R Batch Run