---
title: "Tutorial: Estimate the extent of etiologic heterogeneity"
author: "Emily C. Zabor"
date: "Last updated: `r format(Sys.Date(), '%B %d, %Y')`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteEncoding{UTF-8}
  %\VignetteIndexEntry{Tutorial: Estimate the extent of etiologic heterogeneity}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


## Introduction

This tutorial will introduce you to the calculations underlying estimation of the extent of etiologic heterogeneity in either a case-control or case-only study. 

This tutorial will also demonstrate how to use the functions included in this package based on a simulated example dataset named `subtype_data`. This simulated dataset contains 1200 case subjects and 800 control subjects. Cases are split into four pre-defined etiologically distinct subtypes. There are two continuous risk factors and one binary risk factor available on all subjects.


## Case-control methods

The primary methodology was introduced in detail in:

> Begg CB, Zabor EC, Bernstein JL, Bernstein L, Press MF, Seshan VE. A conceptual and methodological framework for investigating etiologic heterogeneity. *Stat Med.* 2013;32(29):5039-52.

The method involves identifying a pre-defined subtype solution involving a set of $M$ disease subtypes, and calculating a measure of etiologic heterogeneity. 

In a case-control study, the measure of etiologic heterogeneity is denoted $D$. To calculate $D$, one must first perform polytomous logistic regression of the known risk factors for disease on the subtypes and obtain estimated risk predictions from this model for each of the subtypes for each control subject. Since the measure is population-based it is calculated solely using the study controls. Let $i$ denote these control subjects $i=1, \ldots, N_H$, where $N_H$ denotes the total number of non-diseased control subjects, and let $m$ index the set of disease subtypes, $m=1, \ldots, M$. The risk predictions obtained from the polytomous logistic regression model for the $i$th individual are denoted $r_{mi}$ such that the total risk of disease for that individual is $r_i = \sum_{m=1}^M r_{mi}$. Let the coefficients of variation of the subtype risks in the population be denoted $C_m^2 = v_m / \mu_m^2$ where $v_m = N_H^{-1}\sum_{i=1}^{N_H} r_{mi}^2 - \mu_m^2$ and $\mu_m = N_H^{-1}\sum_{i=1}^{N_H} r_{mi}$. Let the corresponding total coefficient of variation be denoted $C^2 = v / \mu^2$, where $\mu$ and $v$ are the overall disease risk mean and variance, respectively. Then the measure of etiologic heterogeneity is defined as

$$D = \sum_{m=1}^M \pi_m C_m^2 - C^2,$$

where $\pi_m$ represents the prevalence of the $m$th disease subtype. 

To estimate $D$ in the context of a case-control study, we use the `d()` function.`d()` relies on the `mlogit` function from the `mlogit` package to fit the polytomous logistic regression model. We supply `d()` with the name of the subtype variable, the number of subtypes, a list of risk factors, and the dataset name. See the function [documentation](https://www.emilyzabor.com/riskclustr/reference/d.html) for details of argument specification.

```{r}
library(riskclustr)

d(
  label = "subtype",
  M = 4, 
  factors = list("x1", "x2", "x3"),
  data = subtype_data
)
```

And we see that using all three risk factors, $D=0.410$ for the four disease subtypes of interest in the case-control setting.


## Case-only methods

When control subjects are not available, then a case-only approximation of $D$, denoted $D^\ast$, can be applied. The details of this calculation can be found in:

> Begg CB, Seshan VE, Zabor EC, et al. Genomic investigation of etiologic heterogeneity: methodologic challenges. *BMC Med Res Methodol.* 2014;14:138. 

Briefly, whereas the variance and covariance terms in the equation for $D$ are averaged over the controls in the case-control setting, in a case-only setting they are averaged over the cases, which represent a risk-biased sample from the population. The goal of an analysis of this type is not to interpret the magnitude of $D^\ast$, but rather to use $D^\ast$ to rank different subtyping schemes and identify the one that maximizes etiologic heterogeneity.

To demonstrate the usage, we first limit `subtype_data` to the cases only. Controls are denoted by a `0` for the `subtype` variable.

```{r}
subtype_cases <- subtype_data[subtype_data$subtype != 0, ]
```

We specify the arguments to `dstar()` as before for `d()`, but supplying our new case-only dataset to the `data` argument. Now, the subtype variable must be numeric, with cases coded as $1$ through $M$ according to the subtype to which they belong. The highest frequency level of `label` will be used as the reference level, for model stability.

```{r}
dstar(
  label = "subtype",
  M = 4, 
  factors = list("x1", "x2", "x3"),
  data = subtype_cases
)
```

And we see that using all three risk factors, $D^\ast=0.402$ for the four disease subtypes of interest in the case-only setting.