The Book of OHDSI: https://ohdsi.github.io/TheBookOfOhdsi/
Secondary Analysis of Electronic Health Records https://link.springer.com/book/10.1007/978-3-319-43742-2
Lecture notes by Dr. Jin Zhou @ University of Arizona.
The model can be a classification problem (predicting if the patient will have a heart failure or not) or a regression problem (predicting costs a patient will incur). The final step of this pipeline is to asses how good our model is through performance evaluation (Leave-one-out cross validation, K-fold cross validation, randomized cross validation).
The input to computational phenotyping is raw patient data from many sources such as demographic information diagnosis, medication, procedure, lab test and clinical notes. The phenotyping algorithm converts this raw patient data into medical concepts or phenotypes. The main usage of this data is to support clinical operations such as billing, or to support genomic studies.
Population-level effect estimation refers to the estimation of average causal effects of exposures (e.g. medical interventions such as drug exposures or procedures) on specific health outcomes of interest.
Direct effect estimation: estimating the effect of an exposure on the risk of an outcome, as compared to no exposure.
Comparative effect estimation: estimating the effect of an exposure (the target exposure) on the risk of an outcome, as compared to another exposure (the comparator exposure).
In both cases, the patient-level causal effect contrasts a factual outcome, i.e., what happened to the exposed patient, with a counterfactual outcome, i.e., what would have happened had the exposure not occurred (direct) or had a different exposure occurred (comparative).
Since any one patient reveals only the factual outcome (the fundamental problem of causal inference), the various effect estimation designs employ different analytic devices to shed light on the counterfactual outcomes.
Sample average treatment effect : \[\tau_s = \frac{1}{N}\sum_{i=1}^N \left( Y_i(1)−Y_i(0)\right)\]
If the two assumption holds, we can estimate average treatment effect \(\tau\) as \[ \begin{align} \tau(x) &\equiv E[Y_i(1) - Y_i(0)|X_i = x]\\ & = E[Y_i(1)|X_i = x] - E[Y_i(0)|X_i = x]\\ & = E[Y_i(1)|X_i = x, W_i=1] - E[Y_i(0)|X_i = x, W_i=0]\\ & = E[Y_i|X_i = x, W_i=1] - E[Y_i|X_i = x, W_i=0] \end{align} \]
The cohort method attempts to emulate a randomized clinical trial. Subjects that are observed to initiate one treatment (the target) are compared to subjects initiating another treatment (the comparator) and are followed for a specific amount of time following treatment initiation, for example the time they stay on the treatment.
Choice | Description |
---|---|
Target cohort | A cohort representing the target treatment |
Comparator cohort | A cohort representing the comparator treatment |
Outcome cohort | A cohort representing the outcome of interest |
Time-at-risk | At what time (often relative to the target and comparator cohort start and end dates) do we consider the risk of the outcome? |
Model | The model used to estimate the effect while adjusting for differences between the target and comparator |
The self-controlled cohort (SCC) design Ryan, Schuemie, and Madigan 2013 compares the rate of outcomes during exposure to the rate of outcomes in the time just prior to the exposure.
Choice | Description |
---|---|
Target cohort | A cohort representing the treatment |
Outcome cohort | A cohort representing the outcome of interest |
Time-at-risk | At what time (often relative to the target cohort start and end dates) do we consider the risk of the outcome? |
Control time | The time period used as the control time |
Case-control studies Vandenbroucke and Pearce 2012 consider the question “are persons with a specific disease outcome exposed more frequently to a specific agent than those without the disease?” Thus, the central idea is to compare “cases,” i.e., subjects that experience the outcome of interest, with “controls,” i.e., subjects that did not experience the outcome of interest.
The case-crossover Maclure 1991 design evaluates whether the rate of exposure is different at the time of the outcome than at some predefined number of days prior to the outcome. It is trying to determine whether there is something special about the day the outcome occurred.
The Self-Controlled Case Series (SCCS) design (Farrington 1995; Whitaker et al. 2006) compares the rate of outcomes during exposure to the rate of outcomes during all unexposed time, including before, between, and after exposures. It is a Poisson regression that is conditioned on the person. Thus, it seeks to answer the question: “Given that a patient has the outcome, is the outcome more likely during exposed time compared to non-exposed time?”.
L1
-regularization using cross-validation to select the regularization hyperparameter is applied to the coefficients of all exposures except the exposure of interest.Among a population at risk, we aim to predict which patients at a defined moment in time (t = 0) will experience some outcome during a time-at-risk. Prediction is done using only information about the patients in an observation window prior to that moment in time.
Observational healthcare data rarely reflects whether a value is negative or missing. For example, we simply observed the person with ID 1 had no essential hypertension occurrence prior to the index date. This could be because the condition was not present (negative) at that time, or because it was not recorded (missing). It is important to realize that the machine learning algorithm cannot distinguish between the negative and missing and will simply assess the predictive value in the available data.
When considering method validity we aim to answer the question
Is this method valid for answering this question?
“Method” includes not only the study design, but also the data and the implementation of the design. The core activity when establishing method validity is evaluating whether important assumptions in the analysis have been met. For example, we assume that propensity-score matching makes two populations comparable, but we need to evaluate whether this is the case.
For example, in one study (Zaadstra et al. 2008) investigating the relationship between childhood diseases and later multiple sclerosis (MS), the authors include three negative controls that are not believed to cause MS: a broken arm, concussion, and tonsillectomy. Two of these three controls produce statistically significant associations with MS, suggesting that the study may be biased.
We should select negative controls that are comparable to our hypothesis of interest, which means we typically select exposure-outcome pairs that either have the same exposure as the hypothesis of interest (so-called “outcome controls”) or the same outcome (“exposure controls”). Our negative controls should further meet these criteria:
A semi-automated procedure for selecting negative controls. (Voss et al. 2016) In brief, information from literature, product labels, and spontaneous reporting is automatically extracted and synthesized to produce a candidate list of negative controls. This list must then undergo manual review, not only to verify that the automated extraction was accurate, but also to impose additional criteria such as biological plausibility.
Schuemie, Hripcsak, et al. 2018 created by modifying a negative control through injection of additional, simulated occurrences of the outcome during the time at risk of the exposure.
For example, assume that, during exposure to \(ACEi\), \(n\) occurrences of our negative control outcome “ingrowing nail” were observed. If we now add an additional \(n\) simulated occurrences during exposure, we have doubled the risk. Since this was a negative control, the relative risk compared to the counterfactual was one, but after injection, it becomes two.
One issue that stands important is the preservation of confounding.
To preserve confounding, we want the new outcomes to show similar associations with baseline subject-specific covariates as the original outcomes.
To achieve this, for each outcome we train a model to predict the survival rate with respect to the outcome during exposure using covariates captured prior to exposure. These covariates include demographics, as well as all recorded diagnoses, drug exposures, measurements, and medical procedures.
An L1-regularized Poisson regression Suchard et al. 2013 using 10-fold cross-validation to select the regularization hyperparameter fits the prediction model.
Figure 18.2 depicts this process. Note that although this procedure simulates several important sources of bias, it does not capture all. For example, some effects of measurement error are not present. The synthetic positive controls imply constant positive predictive value and sensitivity, which may not be true in reality.
Based on the estimates of a particular method for the negative and positive controls, we can then understand the operating characteristics by computing a range of metrics, for example:
Often the type 1 error is larger than \(5\%\). In other words, we are often more likely than 5% to reject the null hypothesis when in fact the null hypothesis is true. The reason is that the p-value only reflects random error, the error due to having a limited sample size. It does not reflect systematic error, for example the error due to confounding. Schuemie et al. 2014 derived an empirical null distribution from the actual effect estimates for the negative controls.
Formally, a Gaussian probability distribution was fit to the estimates, taking into account the sampling error of each estimate. Let \(\hat{\theta}_i\) denote the estimated log effect estimate (relative risk, odds or incidence rate ratio) from the \(i\)th negative control drug–outcome pair, and let \(\hat\epsilon\) denote the corresponding estimated standard error, \(i=1,\ldots,n\). \(\theta\) denote the true log effect size (assumed \(0\) for negative controls), and let \(\beta\) denote the true (but unknown) bias associated with pair, that is, the difference between the log of the true effect size and the log of the estimate that the study would have returned for control \(i\) had it been infinitely large. As in the standard p-value computation, we assume that \(\hat{\theta}_i\) is normally distributed with mean
\[
\theta_i + \beta_i
\] and standard deviation \(\hat\epsilon^2\). Note that in traditional p-value calculation, \(\beta_i\) is always assumed to be equal to zero, but that we assume the \(\beta_i\) is zero, arise from a normal distribution with mean \(\mu\) and variance \(\sigma^2\). This represents the null (bias) distribution. We estimate \(\mu\) and \(\sigma^2\) via maximum likelihood: \[
\beta_i \sim \mathcal{N}(\mu, \sigma^2) \quad \hat\theta_i \sim \mathcal{N}(\theta_i + \beta_i,\epsilon^2)
\] where \(\mathcal{N}(a, b)\) denotes a Gaussian distribution with mean \(a\) and variance \(b\), and estimate \(\mu\) and \(\sigma^2\) by maximizing the following likelihood: \[
\mathcal{L}(\mu,\sigma | \theta, \epsilon) \propto \Pi_{i=1}^n \int p(\hat \theta_i | \beta_i, \theta_i, \hat\epsilon_i) p(\beta_i|\mu, \sigma) d\beta_i
\] yielding maximum likelihood estimates \(\hat\mu\) and \(\sigma\). We compute a calibrated p-value that uses the empirical null distribution. Let \(\theta_{n+1}\) denote the log of the effect estimate from a new drug–outcome pair, and let \(\hat\epsilon_{n+1}^2\) denote the corresponding estimated standard error. From the aforementioned assumptions and assuming \(\beta_{n+1}\) arises from the same null distribution, we have the following: \[
\hat\theta_{n+1}\propto \mathcal{N}(\mu, \hat\sigma^2+\hat\epsilon_{n+1}^2)
\] When \(\hat\theta_{n+1}\) is smaller than \(\mu\), the one-sided calibrated p-value for the new pair is then \[
\phi \left( \frac{\theta_{n+1}-\hat\mu}{\sqrt{\hat\sigma^2+\hat\epsilon_{n+1}^2}}\right)
\] where \(\phi(\cdot)\) denotes the cumulative distribution function of the standard normal distribution. When \(\hat\theta_{n+1}\) is bigger than \(\hat\mu\), the one-sided calibrated p-value is then \[
1 - \phi \left( \frac{\theta_{n+1}-\hat\mu}{\sqrt{\hat\sigma^2+\hat\epsilon_{n+1}^2}}\right).
\]
Similarly, we typically observe that the coverage of the \(95\%\) confidence interval is less than \(95\%\): the true effect size is inside the \(95\%\) confidence interval less than \(95\%\) of the time. For confidence interval calibration Schuemie, Hripcsak, et al. 2018 extend the framework for p-value calibration by also making use of our positive controls. Typically, but not necessarily, the calibrated confidence interval is wider than the nominal confidence interval, reflecting the problems unaccounted for in the standard procedure (such as unmeasured confounding, selection bias and measurement error) but accounted for in the calibration.
Formally, we assume that \(\beta_i\), the bias associated with pair \(i\), again comes from a Gaussian distribution, but this time using a mean and standard deviation that are linearly related to \(\theta_i\), the true effect size: \[ \beta_i \sim \mathcal{N}(\mu(\theta_i), \sigma^2(\theta_i)) \] where \[ \begin{align} \mu(\theta_i) & = a+ b \times \theta_i\\ \sigma(\theta_i)^2 & = c + d\times |\theta_i| \end{align} \]
We estimate \(a\), \(b\), \(c\), and \(d\) by maximizing the marginalized likelihood in which we integrate out the unobserved \(\beta_i\): \[ \mathcal{l}(a,b,c,d | \theta, \hat{\theta}, \hat{\tau}) \propto \Pi_{i=1}^n \int p(\hat{\theta_i}|\beta_i,\theta_i,\hat{\tau}_i)p(\beta_i|a,b,c,d,\theta_i)d \beta_i \] yielding maximum likelihood estimates \((\hat a,\hat b, \hat c, \hat d)\). We compute a calibrated CI that uses the systematic error model. Let again denote the log of the effect estimate for a new outcome of interest, and let \(\hat \tau_{n+1}\)denote the corresponding estimated standard error. From the assumptions above, and assuming \(\hat \beta_{n+1}\)arises from the same systematic error model, we have: \[ \hat \theta_{n+1} \sim \mathcal{N} (\theta_{n+1} + \hat a +\hat b\times \theta_{n+1}, \hat c+\hat d \times |\theta_{n+1}| +\hat \tau_{n+1}^2). \] We find the lower bound of the calibrated \(95\%\) CI by solving this equation for \(\theta_{n+1}\): \[ \Phi \left( \frac{\theta_{n+1}+\hat a +\hat b\times \theta_{n+1}}{\sqrt{(\hat c + \hat d \times |\theta_{n+1}|)+\hat \tau_{n+1}^2}}\right) = 0.025 \]
where \(\Phi(\cdot)\) denotes the cumulative distribution function of the standard normal distribution. We find the upper bound similarly for probability \(0.975\). We define the calibrated point estimate by using probability \(0.5\).
Both p-value calibration and confidence interval calibration are implemented in the EmpiricalCalibration package.
When designing a study there are often design choices that are uncertain. For example, should propensity score matching of stratification be used? If stratification is used, how many strata? What is the appropriate time-at-risk? When faced with such uncertainty, one solution is to evaluate various options, and observe the sensitivity of the results to the design choice. If the estimate remains the same under various options, we can say the study is robust to the uncertainty.