This tutorial will introduce you to the calculations underlying estimation of the extent of etiologic heterogeneity in either a case-control or case-only study.
This tutorial will also demonstrate how to use the functions included
in this package based on a simulated example dataset named
subtype_data
. This simulated dataset contains 1200 case
subjects and 800 control subjects. Cases are split into four pre-defined
etiologically distinct subtypes. There are two continuous risk factors
and one binary risk factor available on all subjects.
The primary methodology was introduced in detail in:
Begg CB, Zabor EC, Bernstein JL, Bernstein L, Press MF, Seshan VE. A conceptual and methodological framework for investigating etiologic heterogeneity. Stat Med. 2013;32(29):5039-52.
The method involves identifying a pre-defined subtype solution involving a set of M disease subtypes, and calculating a measure of etiologic heterogeneity.
In a case-control study, the measure of etiologic heterogeneity is denoted D. To calculate D, one must first perform polytomous logistic regression of the known risk factors for disease on the subtypes and obtain estimated risk predictions from this model for each of the subtypes for each control subject. Since the measure is population-based it is calculated solely using the study controls. Let i denote these control subjects i = 1, …, NH, where NH denotes the total number of non-diseased control subjects, and let m index the set of disease subtypes, m = 1, …, M. The risk predictions obtained from the polytomous logistic regression model for the ith individual are denoted rmi such that the total risk of disease for that individual is $r_i = \sum_{m=1}^M r_{mi}$. Let the coefficients of variation of the subtype risks in the population be denoted Cm2 = vm/μm2 where $v_m = N_H^{-1}\sum_{i=1}^{N_H} r_{mi}^2 - \mu_m^2$ and $\mu_m = N_H^{-1}\sum_{i=1}^{N_H} r_{mi}$. Let the corresponding total coefficient of variation be denoted C2 = v/μ2, where μ and v are the overall disease risk mean and variance, respectively. Then the measure of etiologic heterogeneity is defined as
$$D = \sum_{m=1}^M \pi_m C_m^2 - C^2,$$
where πm represents the prevalence of the mth disease subtype.
To estimate D in the
context of a case-control study, we use the d()
function.d()
relies on the mlogit
function
from the mlogit
package to fit the polytomous logistic
regression model. We supply d()
with the name of the
subtype variable, the number of subtypes, a list of risk factors, and
the dataset name. See the function documentation
for details of argument specification.
library(riskclustr)
d(
label = "subtype",
M = 4,
factors = list("x1", "x2", "x3"),
data = subtype_data
)
#> [1] 0.4100995
And we see that using all three risk factors, D = 0.410 for the four disease subtypes of interest in the case-control setting.
When control subjects are not available, then a case-only approximation of D, denoted D*, can be applied. The details of this calculation can be found in:
Begg CB, Seshan VE, Zabor EC, et al. Genomic investigation of etiologic heterogeneity: methodologic challenges. BMC Med Res Methodol. 2014;14:138.
Briefly, whereas the variance and covariance terms in the equation for D are averaged over the controls in the case-control setting, in a case-only setting they are averaged over the cases, which represent a risk-biased sample from the population. The goal of an analysis of this type is not to interpret the magnitude of D*, but rather to use D* to rank different subtyping schemes and identify the one that maximizes etiologic heterogeneity.
To demonstrate the usage, we first limit subtype_data
to
the cases only. Controls are denoted by a 0
for the
subtype
variable.
We specify the arguments to dstar()
as before for
d()
, but supplying our new case-only dataset to the
data
argument. Now, the subtype variable must be numeric,
with cases coded as 1 through M according to the subtype to which
they belong. The highest frequency level of label
will be
used as the reference level, for model stability.
dstar(
label = "subtype",
M = 4,
factors = list("x1", "x2", "x3"),
data = subtype_cases
)
#> [1] 0.4022017
And we see that using all three risk factors, D* = 0.402 for the four disease subtypes of interest in the case-only setting.