Title: | Functions to Study Etiologic Heterogeneity |
---|---|
Description: | A collection of functions related to the study of etiologic heterogeneity both across disease subtypes and across individual disease markers. The included functions allow one to quantify the extent of etiologic heterogeneity in the context of a case-control study, and provide p-values to test for etiologic heterogeneity across individual risk factors. Begg CB, Zabor EC, Bernstein JL, Bernstein L, Press MF, Seshan VE (2013) <doi:10.1002/sim.5902>. |
Authors: | Emily C. Zabor [aut, cre] |
Maintainer: | Emily C. Zabor <[email protected]> |
License: | GPL-2 |
Version: | 0.4.1 |
Built: | 2024-10-14 04:33:35 UTC |
Source: | https://github.com/zabore/riskclustr |
d
estimates the incremental explained risk variation
across a set of pre-specified disease subtypes in a case-control study.
This function takes the name of the disease subtype variable, the number
of disease subtypes, a list of risk factors, and a wide dataset,
and does the needed
transformation on the dataset to get the correct format. Then the polytomous
logistic regression model is fit using mlogit
,
and D is calculated based on the resulting risk predictions.
d(label, M, factors, data)
d(label, M, factors, data)
label |
the name of the subtype variable in the data. This should be a
numeric variable with values 0 through M, where 0 indicates control subjects.
Must be supplied in quotes, e.g. |
M |
is the number of subtypes. For M>=2. |
factors |
a list of the names of the binary or continuous risk factors.
For binary risk factors the lowest level will be used as the reference level.
e.g. |
data |
the name of the dataframe that contains the relevant variables. |
Begg, C. B., Zabor, E. C., Bernstein, J. L., Bernstein, L., Press, M. F., & Seshan, V. E. (2013). A conceptual and methodological framework for investigating etiologic heterogeneity. Stat Med, 32(29), 5039-5052. doi: 10.1002/sim.5902
d( label = "subtype", M = 4, factors = list("x1", "x2", "x3"), data = subtype_data )
d( label = "subtype", M = 4, factors = list("x1", "x2", "x3"), data = subtype_data )
dstar
estimates the incremental explained risk variation
across a set of pre-specified disease subtypes in a case-only study.
The highest frequency level of label is used as the reference level,
for stability.
This function takes the name of the disease subtype variable, the number
of disease subtypes, a list of risk factors, and a wide case-only dataset,
and does the needed
transformation on the dataset to get the correct format. Then the polytomous
logistic regression model is fit using mlogit
,
and D* is calculated based on the resulting risk predictions.
dstar(label, M, factors, data)
dstar(label, M, factors, data)
label |
the name of the subtype variable in the data. This should be a
numeric variable with values 0 through M, where 0 indicates control subjects.
Must be supplied in quotes, e.g. |
M |
is the number of subtypes. For M>=2. |
factors |
a list of the names of the binary or continuous risk factors.
For binary risk factors the lowest level will be used as the reference level.
e.g. |
data |
the name of the case-only dataframe that contains the relevant variables. |
Begg, C. B., Seshan, V. E., Zabor, E. C., Furberg, H., Arora, A., Shen, R., . . . Hsieh, J. J. (2014). Genomic investigation of etiologic heterogeneity: methodologic challenges. BMC Med Res Methodol, 14, 138.
# Exclude controls from data as this is a case-only calculation dstar( label = "subtype", M = 4, factors = list("x1", "x2", "x3"), data = subtype_data[subtype_data$subtype > 0, ] )
# Exclude controls from data as this is a case-only calculation dstar( label = "subtype", M = 4, factors = list("x1", "x2", "x3"), data = subtype_data[subtype_data$subtype > 0, ] )
eh_test_marker
takes a list of individual disease
markers,
a list of risk factors, a variable name denoting case versus control status,
and a dataframe, and returns results related to the question of
whether each risk factor differs across levels of the disease subtypes and
the question of whether each risk factor differs across levels of each
individual disease marker of which the disease subtypes are comprised.
Input is a dataframe that contains the individual disease markers, the risk
factors of interest, and an indicator of case or control status.
The disease markers must be binary and must have levels
0 or 1 for cases. The disease markers should be left missing for control
subjects. For categorical disease markers, a reference level should be
selected
and then indicator variables for each remaining level of the disease marker
should be created. Risk factors can be either binary or continuous. For
categorical risk factors, a reference level should be selected and then
indicator variables for each remaining level of the risk factor should be
created.
eh_test_marker(markers, factors, case, data, digits = 2)
eh_test_marker(markers, factors, case, data, digits = 2)
markers |
a list of the names of the binary disease markers.
Each must have levels 0 or 1 for case subjects. This value will be missing
for all control subjects. e.g. |
factors |
a list of the names of the binary or continuous risk factors.
For binary risk factors the lowest level will be used as the reference level.
e.g. |
case |
denotes the variable that contains each subject's status as a
case or control. This value should be 1 for cases and 0 for controls.
Argument must be supplied in quotes, e.g. |
data |
the name of the dataframe that contains the relevant variables. |
digits |
the number of digits to round the odds ratios and associated confidence intervals, and the estimates and associated standard errors. Defaults to 2. |
Returns a list.
beta
is a matrix containing the raw estimates from the
polytomous logistic regression model fit with mlogit
with a row for each risk factor and a column for each disease subtype.
beta_se
is a matrix containing the raw standard errors from the
polytomous logistic regression model fit with mlogit
with a row for each risk factor and a column for each disease subtype.
eh_pval
is a vector of unformatted p-values for testing whether each
risk factor differs across the levels of the disease subtype.
gamma
is a matrix containing the estimated disease marker parameters,
obtained as linear combinations of the beta
estimates,
with a row for each risk factor and a column for each disease marker.
gamma_se
is a matrix containing the estimated disease marker
standard errors, obtained based on a transformation of the beta
standard errors, with a row for each risk factor and a column for each
disease marker.
gamma_p
is a matrix of p-values for testing whether each risk factor
differs across levels of each disease marker, with a row for each risk
factor and a column for each disease marker.
or_ci_p
is a dataframe with the odds ratio (95\
factor/subtype combination, as well as a column of formatted etiologic
heterogeneity p-values.
beta_se_p
is a dataframe with the estimates (SE) for
each risk factor/subtype combination, as well as a column of formatted
etiologic heterogeneity p-values.
gamma_se_p
is a dataframe with disease marker estimates (SE) and
their associated p-values.
Emily C Zabor [email protected]
# Run for two binary tumor markers, which will combine to form four subtypes eh_test_marker( markers = list("marker1", "marker2"), factors = list("x1", "x2", "x3"), case = "case", data = subtype_data, digits = 2 )
# Run for two binary tumor markers, which will combine to form four subtypes eh_test_marker( markers = list("marker1", "marker2"), factors = list("x1", "x2", "x3"), case = "case", data = subtype_data, digits = 2 )
eh_test_subtype
takes the name of the variable containing
the pre-specified subtype labels, the number of subtypes, a list of risk
factors, and the name of the dataframe and returns results
related to the
question of whether each risk factor differs across levels of the disease
subtypes. Input is a dataframe that contains the risk factors of interest and
a
variable containing numeric class labels that is 0 for control subjects.
Risk factors can be either binary or continuous. For categorical risk
factors, a reference level should be selected and then indicator variables
for each remaining level of the risk factor should be created.
Categorical risk factors entered as is will be treated as ordinal.
The multinomial
logistic regression model is fit using mlogit
.
eh_test_subtype(label, M, factors, data, digits = 2)
eh_test_subtype(label, M, factors, data, digits = 2)
label |
the name of the subtype variable in the data. This should be a
numeric variable with values 0 through M, where 0 indicates control subjects.
Must be supplied in quotes, e.g. |
M |
is the number of subtypes. For M>=2. |
factors |
a list of the names of the binary or continuous risk factors.
For binary or categorical risk factors the lowest level will be used as the
reference level.
e.g. |
data |
the name of the dataframe that contains the relevant variables. |
digits |
the number of digits to round the odds ratios and associated confidence intervals, and the estimates and associated standard errors. Defaults to 2. |
Returns a list.
beta
is a matrix containing the raw estimates from the
polytomous logistic regression model fit with mlogit
with a row for each risk factor and a column for each disease subtype.
beta_se
is a matrix containing the raw standard errors from the
polytomous logistic regression model fit with mlogit
with a row for each risk factor and a column for each disease subtype.
eh_pval
is a vector of unformatted p-values for testing whether each
risk factor differs across the levels of the disease subtype.
or_ci_p
is a dataframe with the odds ratio (95\
factor/subtype combination, as well as a column of formatted etiologic
heterogeneity p-values.
beta_se_p
is a dataframe with the estimates (SE) for
each risk factor/subtype combination, as well as a column of formatted
etiologic heterogeneity p-values.
var_covar
contains the variance-covariance matrix associated with
the model estimates contained in beta
.
Emily C Zabor [email protected]
eh_test_subtype( label = "subtype", M = 4, factors = list("x1", "x2", "x3"), data = subtype_data, digits = 2 )
eh_test_subtype( label = "subtype", M = 4, factors = list("x1", "x2", "x3"), data = subtype_data, digits = 2 )
optimal_kmeans_d
applies k-means clustering using the
kmeans
function with many random starts. The D value is
then calculated for the cluster solution at each random start using the
d
function, and the cluster solution that maximizes D is returned,
along with the corresponding value of D. In this way the optimally
etiologically heterogeneous subtype solution can be identified from possibly
high-dimensional disease marker data.
optimal_kmeans_d(markers, M, factors, case, data, nstart = 100, seed = NULL)
optimal_kmeans_d(markers, M, factors, case, data, nstart = 100, seed = NULL)
markers |
a vector of the names of the disease markers. These markers
should be of a type that is suitable for use with
|
M |
is the number of clusters to identify using
|
factors |
a list of the names of the binary or continuous risk factors.
For binary risk factors the lowest level will be used as the reference level.
e.g. |
case |
denotes the variable that contains each subject's status as a
case or control. This value should be 1 for cases and 0 for controls.
Argument must be supplied in quotes, e.g. |
data |
the name of the dataframe that contains the relevant variables. |
nstart |
the number of random starts to use with
|
seed |
an integer argument passed to |
Returns a list
optimal_d
The D value for the optimal D solution
optimal_d_data
The original data frame supplied through the
data
argument, with a column called optimal_d_label
added for the optimal D subtype label.
This has the subtype assignment for cases, and is 0 for all controls.
Begg, C. B., Zabor, E. C., Bernstein, J. L., Bernstein, L., Press, M. F., & Seshan, V. E. (2013). A conceptual and methodological framework for investigating etiologic heterogeneity. Stat Med, 32(29), 5039-5052.
# Cluster 30 disease markers to identify the optimally # etiologically heterogeneous 3-subtype solution res <- optimal_kmeans_d( markers = c(paste0("y", seq(1:30))), M = 3, factors = list("x1", "x2", "x3"), case = "case", data = subtype_data, nstart = 100, seed = 81110224 ) # Look at the value of D for the optimal D solution res[["optimal_d"]] # Look at a table of the optimal D solution table(res[["optimal_d_data"]]$optimal_d_label)
# Cluster 30 disease markers to identify the optimally # etiologically heterogeneous 3-subtype solution res <- optimal_kmeans_d( markers = c(paste0("y", seq(1:30))), M = 3, factors = list("x1", "x2", "x3"), case = "case", data = subtype_data, nstart = 100, seed = 81110224 ) # Look at the value of D for the optimal D solution res[["optimal_d"]] # Look at a table of the optimal D solution table(res[["optimal_d_data"]]$optimal_d_label)
eh_test_subtype
fit.posthoc_factor_test
takes a eh_test_subtype
fit
and returns an overall p-value for a specified factor variable.
posthoc_factor_test(fit, factor, nlevels)
posthoc_factor_test(fit, factor, nlevels)
fit |
the resulting |
factor |
is the name of the factor variable of interest, supplied
in quotes, e.g. |
nlevels |
is the number of levels the factor variable in |
Returns a list.
pval
is a formatted p-value.
pval_raw
is the raw, unformatted p-value.
Emily C Zabor [email protected]
A dataset containing 2000 patients: 1200 cases and 800 controls. There are four subtypes, and both numeric and character subtype labels. The subtypes are formed by cross-classification of two binary disease markers, disease marker 1 and disease marker 2. There are three risk factors, two continuous and one binary. One of the continuous risk factors and the binary risk factor are related to the disease subtypes. There are also 30 continuous tumor markers, 20 of which are related to the subtypes and 10 of which represent noise, which could be used in a clustering analysis.
subtype_data
subtype_data
A data frame with 2000 rows–one row per patient
Indicator of case control status, 1 for cases and 0 for controls
Numeric subtype label, 0 for control subjects
Character subtype label
Disease marker 1
Disease marker 2
Continuous risk factor 1
Continuous risk factor 2
Binary risk factor
Continuous tumor marker 1
Continuous tumor marker 2
Continuous tumor marker 3
Continuous tumor marker 4
Continuous tumor marker 5
Continuous tumor marker 6
Continuous tumor marker 7
Continuous tumor marker 8
Continuous tumor marker 9
Continuous tumor marker 10
Continuous tumor marker 11
Continuous tumor marker 12
Continuous tumor marker 13
Continuous tumor marker 14
Continuous tumor marker 15
Continuous tumor marker 16
Continuous tumor marker 17
Continuous tumor marker 18
Continuous tumor marker 19
Continuous tumor marker 20
Continuous tumor marker 21
Continuous tumor marker 22
Continuous tumor marker 23
Continuous tumor marker 24
Continuous tumor marker 25
Continuous tumor marker 26
Continuous tumor marker 27
Continuous tumor marker 28
Continuous tumor marker 29
Continuous tumor marker 30