Title: | Causal Batch Effects |
---|---|
Description: | Software which provides numerous functionalities for detecting and removing group-level effects from high-dimensional scientific data which, when combined with additional assumptions, allow for causal conclusions, as-described in our manuscripts Bridgeford et al. (2024) <doi:10.1101/2021.09.03.458920> and Bridgeford et al. (2023) <arXiv:2307.13868>. Also provides a number of useful utilities for generating simulations and balancing covariates across multiple groups/batches of data via matching and propensity trimming for more than two groups. |
Authors: | Eric W. Bridgeford [aut, cre], Michael Powell [ctb], Brian Caffo [ctb], Joshua T. Vogelstein [ctb] |
Maintainer: | Eric W. Bridgeford <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.3.0 |
Built: | 2024-11-07 09:22:58 UTC |
Source: | https://github.com/neurodata/causal_batch |
A function for performing k-way matching using the matchIt package. Looks for samples which have corresponding matches across all other treatment levels.
cb.align.kway_match( Ts, Xs, match.form, reference = NULL, match.args = list(method = "nearest", exact = NULL, replace = FALSE, caliper = 0.1), retain.ratio = 0.05 )
cb.align.kway_match( Ts, Xs, match.form, reference = NULL, match.args = list(method = "nearest", exact = NULL, replace = FALSE, caliper = 0.1), retain.ratio = 0.05 )
Ts |
|
Xs |
|
match.form |
A formula of columns from |
reference |
the name of the reference/control batch, against which to match. Defaults to |
match.args |
A named list arguments for the |
retain.ratio |
If the number of samples retained is less than |
a list, containing the following:
Retained.Ids
[m]
vector consisting of the sample ids of the n
original samples that were retained after matching.
Reference
the reference batch.
For more details see the help vignette:
vignette("causal_balancing", package = "causalBatch")
Eric W. Bridgeford
Eric W. Bridgeford, et al. "A Causal Perspective for Batch Effects: When is no answer better than a wrong answer?" Biorxiv (2024).
Daniel E. Ho, et al. "MatchIt: Nonparametric Preprocessing for Parametric Causal Inference" JSS (2011).
library(causalBatch) sim <- cb.sims.sim_linear(a=-1, n=100, err=1/8, unbalancedness=1.5) cb.align.kway_match(sim$Ts, data.frame(Covar=sim$Xs), "Covar")
library(causalBatch) sim <- cb.sims.sim_linear(a=-1, n=100, err=1/8, unbalancedness=1.5) cb.align.kway_match(sim$Ts, data.frame(Covar=sim$Xs), "Covar")
A function for implementing the vector matching procedure, a pre-processing step for causal conditional distance correlation. Uses propensity scores to strategically include/exclude samples from subsequent inference, based on whether (or not) there are samples with similar propensity scores across all treatment levels (conceptually, a k-way "propensity trimming"). It is imperative that this function is used in conjunction with domain expertise to ensure that the covariates are not colliders, and that the system satisfies the strong ignorability condiiton to derive causal conclusions.
cb.align.vm_trim( Ts, Xs, prop.form = NULL, retain.ratio = 0.05, ddx = FALSE, reference = NULL )
cb.align.vm_trim( Ts, Xs, prop.form = NULL, retain.ratio = 0.05, ddx = FALSE, reference = NULL )
Ts |
|
Xs |
|
prop.form |
a formula specifying a propensity scoring model. Defaults o |
retain.ratio |
If the number of samples retained is less than |
ddx |
whether to show additional diagnosis messages. Defaults to |
reference |
the name of a reference label, against which to align other labels. Defaults to |
a [m]
vector containing the indices of samples retained after vector matching.
For more details see the help vignette:
vignette("causal_balancing", package = "causalBatch")
Eric W. Bridgeford
Michael J. Lopez, et al. "Estimation of Causal Effects with Multiple Treatments" Statistical Science (2017). ran
library(causalBatch) sim <- cb.sims.sim_linear(a=-1, n=100, err=1/8, unbalancedness=3) cb.align.vm_trim(sim$Ts, sim$Xs)
library(causalBatch) sim <- cb.sims.sim_linear(a=-1, n=100, err=1/8, unbalancedness=3) cb.align.vm_trim(sim$Ts, sim$Xs)
A function for implementing the AIPW conditional ComBat (AIPW cComBat) algorithm. This algorithm allows users to remove batch effects (in each dimension), while adjusting for known confounding variables. It is imperative that this function is used in conjunction with domain expertise (e.g., to ensure that the covariates are not colliders, and that the system could be argued to satisfy the ignorability condition) to derive causal conclusions. See citation for more details as to the conditions under which conclusions derived are causal.
cb.correct.aipw_cComBat( Ys, Ts, Xs, aipw.form, covar.out.form = NULL, reference = NULL, retain.ratio = 0.05 )
cb.correct.aipw_cComBat( Ys, Ts, Xs, aipw.form, covar.out.form = NULL, reference = NULL, retain.ratio = 0.05 )
Ys |
an |
Ts |
|
Xs |
|
aipw.form |
A covariate model, given as a formula. Applies for the estimation of propensities for the AIPW step. |
covar.out.form |
A covariate model, given as a formula. Applies for the outcome regression step of the |
retain.ratio |
If the number of samples retained is less than |
apply.oos |
A boolean that indicates whether or not to apply the learned batch effect correction to non-matched samples that are still within a region of covariate support. Defaults to |
Note: This function is experimental, and has not been tested on real data. It has only been tested with simulated data with binary (0 or 1) exposures.
a list, containing the following:
Ys.corrected
an [m, d]
matrix, for the m
retained samples in d
dimensions, after correction.
Ts
[m]
the labels of the m
retained samples, with K < n
levels.
Xs
the r
covariates/confounding variables for each of the m
retained samples.
Model
the fit batch effect correction model.
Corrected.Ids
the ids to which batch effect correction was applied.
For more details see the help vignette:
vignette("causal_ccombat", package = "causalBatch")
Eric W. Bridgeford
Eric W. Bridgeford, et al. "A Causal Perspective for Batch Effects: When is no answer better than a wrong answer?" Biorxiv (2024).
Daniel E. Ho, et al. "MatchIt: Nonparametric Preprocessing for Parametric Causal Inference" JSS (2011).
W Evan Johnson, et al. "Adjusting batch effects in microarray expression data using empirical Bayes methods" Biostatistics (2007).
library(causalBatch) sim <- cb.sims.sim_linear(a=-1, n=100, err=1/8, unbalancedness=3) cb.correct.aipw_cComBat(sim$Ys, sim$Ts, data.frame(Covar=sim$Xs), "Covar")
library(causalBatch) sim <- cb.sims.sim_linear(a=-1, n=100, err=1/8, unbalancedness=3) cb.correct.aipw_cComBat(sim$Ys, sim$Ts, data.frame(Covar=sim$Xs), "Covar")
This function applies an Augmented Inverse Probability Weighting (AIPW) ComBat model for batch effect correction to new data.
cb.correct.apply_aipw_cComBat(Ys, Ts, Xs, Model)
cb.correct.apply_aipw_cComBat(Ys, Ts, Xs, Model)
Ys |
an |
Ts |
|
Xs |
|
Model |
a list containing the following parameters:
This model is output after fitting with |
Note: This function is experimental, and has not been tested on real data. It has only been tested with simulated data with binary (0 or 1) exposures.
an [n, d]
matrix, the batch-effect corrected data.
library(causalBatch) sim <- cb.sims.sim_linear(a=-1, n=200, err=1/8, unbalancedness=3) # fit batch effect correction for first 100 samples cb.fit <- cb.correct.matching_cComBat(sim$Ys[1:100,,drop=FALSE], sim$Ts[1:100], data.frame(Covar=sim$Xs[1:100,,drop=FALSE]), "Covar") # apply to all samples cor.dat <- cb.correct.apply_cComBat(sim$Ys, sim$Ts, data.frame(Covar=sim$Xs), cb.fit$Model)
library(causalBatch) sim <- cb.sims.sim_linear(a=-1, n=200, err=1/8, unbalancedness=3) # fit batch effect correction for first 100 samples cb.fit <- cb.correct.matching_cComBat(sim$Ys[1:100,,drop=FALSE], sim$Ts[1:100], data.frame(Covar=sim$Xs[1:100,,drop=FALSE]), "Covar") # apply to all samples cor.dat <- cb.correct.apply_cComBat(sim$Ys, sim$Ts, data.frame(Covar=sim$Xs), cb.fit$Model)
ComBat allows users to adjust for batch effects in datasets where the batch covariate is known, using methodology described in Johnson et al. 2007. It uses either parametric or non-parametric empirical Bayes frameworks for adjusting data for batch effects. Users are returned an expression matrix that has been corrected for batch effects. The input data are assumed to be cleaned and normalized before batch effect removal.
cb.correct.apply_cComBat(Ys, Ts, Xs, Model)
cb.correct.apply_cComBat(Ys, Ts, Xs, Model)
Ys |
an |
Ts |
|
Xs |
|
Model |
a list containing the following parameters:
This model is output after fitting with |
Note: this code is adapted directly from the ComBat
algorithm featured in the 'sva' package.
an [n, d]
matrix, the batch-effect corrected data.
library(causalBatch) sim <- cb.sims.sim_linear(a=-1, n=200, err=1/8, unbalancedness=3) # fit batch effect correction for first 100 samples cb.fit <- cb.correct.matching_cComBat(sim$Ys[1:100,,drop=FALSE], sim$Ts[1:100], data.frame(Covar=sim$Xs[1:100,,drop=FALSE]), "Covar") # apply to all samples cor.dat <- cb.correct.apply_cComBat(sim$Ys, sim$Ts, data.frame(Covar=sim$Xs), cb.fit$Model)
library(causalBatch) sim <- cb.sims.sim_linear(a=-1, n=200, err=1/8, unbalancedness=3) # fit batch effect correction for first 100 samples cb.fit <- cb.correct.matching_cComBat(sim$Ys[1:100,,drop=FALSE], sim$Ts[1:100], data.frame(Covar=sim$Xs[1:100,,drop=FALSE]), "Covar") # apply to all samples cor.dat <- cb.correct.apply_cComBat(sim$Ys, sim$Ts, data.frame(Covar=sim$Xs), cb.fit$Model)
A function for implementing the matching conditional ComBat (matching cComBat) algorithm. This algorithm allows users to remove batch effects (in each dimension), while adjusting for known confounding variables. It is imperative that this function is used in conjunction with domain expertise (e.g., to ensure that the covariates are not colliders, and that the system could be argued to satisfy the ignorability condition) to derive causal conclusions. See citation for more details as to the conditions under which conclusions derived are causal.
cb.correct.matching_cComBat( Ys, Ts, Xs, match.form, covar.out.form = NULL, prop.form = NULL, reference = NULL, match.args = list(method = "nearest", exact = NULL, replace = FALSE, caliper = 0.1), retain.ratio = 0.05, apply.oos = FALSE )
cb.correct.matching_cComBat( Ys, Ts, Xs, match.form, covar.out.form = NULL, prop.form = NULL, reference = NULL, match.args = list(method = "nearest", exact = NULL, replace = FALSE, caliper = 0.1), retain.ratio = 0.05, apply.oos = FALSE )
Ys |
an |
Ts |
|
Xs |
|
match.form |
A formula of columns from |
covar.out.form |
A covariate model, given as a formula. Applies for the outcome regression step of the |
prop.form |
A propensity model, given as a formula. Applies for the estimation of propensities for the propensity trimming step. Defaults to |
reference |
the name of the reference/control batch, against which to match. Defaults to |
match.args |
A named list arguments for the |
retain.ratio |
If the number of samples retained is less than |
apply.oos |
A boolean that indicates whether or not to apply the learned batch effect correction to non-matched samples that are still within a region of covariate support. Defaults to |
a list, containing the following:
Ys.corrected
an [m, d]
matrix, for the m
retained samples in d
dimensions, after correction.
Ts
[m]
the labels of the m
retained samples, with K < n
levels.
Xs
the r
covariates/confounding variables for each of the m
retained samples.
Model
the fit batch effect correction model. See ComBat
for details.
InSample.Ids
the ids which were used to fit the batch effect correction model.
Corrected.Ids
the ids to which batch effect correction was applied. Differs from InSample.Ids
if apply.oos
is TRUE
.
For more details see the help vignette:
vignette("causal_ccombat", package = "causalBatch")
Eric W. Bridgeford
Eric W. Bridgeford, et al. "A Causal Perspective for Batch Effects: When is no answer better than a wrong answer?" Biorxiv (2024).
Daniel E. Ho, et al. "MatchIt: Nonparametric Preprocessing for Parametric Causal Inference" JSS (2011).
W Evan Johnson, et al. "Adjusting batch effects in microarray expression data using empirical Bayes methods" Biostatistics (2007).
library(causalBatch) sim <- cb.sims.sim_linear(a=-1, n=100, err=1/8, unbalancedness=3) cb.correct.matching_cComBat(sim$Ys, sim$Ts, data.frame(Covar=sim$Xs), "Covar")
library(causalBatch) sim <- cb.sims.sim_linear(a=-1, n=100, err=1/8, unbalancedness=3) cb.correct.matching_cComBat(sim$Ys, sim$Ts, data.frame(Covar=sim$Xs), "Covar")
A function for implementing the causal conditional distance correlation (causal cDCorr) algorithm. This algorithm allows users to identify whether a treatment causes changes in an outcome, given assorted covariates/confounding variables. It is imperative that this function is used in conjunction with domain expertise (e.g., to ensure that the covariates are not colliders, and that the system satisfies the strong ignorability condiiton) to derive causal conclusions. See citation for more details as to the conditions under which conclusions derived are causal.
cb.detect.caus_cdcorr( Ys, Ts, Xs, prop.form = NULL, R = 1000, dist.method = "euclidean", distance = FALSE, seed = 1, num.threads = 1, retain.ratio = 0.05, ddx = FALSE )
cb.detect.caus_cdcorr( Ys, Ts, Xs, prop.form = NULL, R = 1000, dist.method = "euclidean", distance = FALSE, seed = 1, num.threads = 1, retain.ratio = 0.05, ddx = FALSE )
Ys |
Either:
|
Ts |
|
Xs |
|
prop.form |
a formula specifying a propensity scoring model. Defaults o |
R |
the number of repetitions for permutation testing. Defaults to |
dist.method |
the method used for computing distance matrices. Defaults to |
distance |
a boolean for whether (or not) |
seed |
a random seed to set. Defaults to |
num.threads |
The number of threads for parallel processing (if desired). Defaults to |
retain.ratio |
If the number of samples retained is less than |
ddx |
whether to show additional diagnosis messages. Defaults to |
a list, containing the following:
Test
The outcome of the statistical test, from cdcov.test
.
Retained.Ids
The sample indices retained after vertex matching, which correspond to the samples for which statistical inference is performed.
For more details see the help vignette:
vignette("causal_cdcorr", package = "causalBatch")
Eric W. Bridgeford
Eric W. Bridgeford, et al. "A Causal Perspective for Batch Effects: When is no answer better than a wrong answer?" Biorxiv (2024).
Eric W. Bridgeford, et al. "Learning sources of variability from high-dimensional observational studies" arXiv (2023).
Xueqin Wang, et al. "Conditional Distance Correlation" American Statistical Association (2015).
library(causalBatch) sim <- cb.sims.sim_linear(a=-1, n=100, err=1/8, unbalancedness=3) cb.detect.caus_cdcorr(sim$Ys, sim$Ts, sim$Xs)
library(causalBatch) sim <- cb.sims.sim_linear(a=-1, n=100, err=1/8, unbalancedness=3) cb.detect.caus_cdcorr(sim$Ys, sim$Ts, sim$Xs)
Impulse Simulation
cb.sims.sim_impulse( n = 100, pi = 0.5, eff_sz = 1, alpha = 2, unbalancedness = 1, err = 1/2, null = FALSE, a = -0.5, b = 1/2, c = 4, nbreaks = 200 )
cb.sims.sim_impulse( n = 100, pi = 0.5, eff_sz = 1, alpha = 2, unbalancedness = 1, err = 1/2, null = FALSE, a = -0.5, b = 1/2, c = 4, nbreaks = 200 )
n |
the number of samples. Defaults to |
pi |
the balance between the classes, where samples will be from group 1
with probability |
eff_sz |
the treatment effect between the different groups. Defaults to |
alpha |
the alpha for the covariate sampling procedure. Defaults to |
unbalancedness |
the level of covariate dissimilarity between the covariates
for each of the groups. Defaults to |
err |
the level of noise for the simulation. Defaults to |
null |
whether to generate a null simulation. Defaults to |
a |
the first parameter for the covariate/outcome relationship. Defaults to |
b |
the second parameter for the covariate/outcome relationship. Defaults to |
c |
the third parameter for the covariate/outcome relationship. Defaults to |
nbreaks |
the number of breakpoints for computing the expected outcome at a given covariate level
for each batch. Defaults to |
a list, containing the following:
Ys |
an |
Ts |
an |
Xs |
an |
Eps |
an |
x.bounds |
the theoretical bounds for the covariate values. |
Ytrue |
an |
Ttrue |
an |
Xtrue |
an |
Effect |
The batch effect magnitude. |
Overlap |
the theoretical degree of overlap between the covariate distributions for each of the two groups/batches. |
oracle_fn |
A function for fitting outcomes given covariates. |
A sigmoidal relationship between the covariate and the outcome. The first dimension of the outcome is:
where is the probability density function for the normal distribution with
mean
and standard deviation
.
where the batch/group labels are:
The beta coefficient for the covariate sampling is:
The covariate values for the first batch are:
and the covariate values for the second batch are:
Note that , or that the covariates are symmetric
about the origin in distribution.
Finally, the error terms are:
For more details see the help vignette:
vignette("causal_simulations", package = "causalBatch")
Eric W. Bridgeford
Eric W. Bridgeford, et al. "A Causal Perspective for Batch Effects: When is no answer better than a wrong answer?" Biorxiv (2024).
library(causalBatch) sim = cb.sims.sim_impulse()
library(causalBatch) sim = cb.sims.sim_impulse()
Impulse Simulation with Asymmetric Covariates
cb.sims.sim_impulse_asycov( n = 100, pi = 0.5, eff_sz = 1, alpha = 2, unbalancedness = 1, null = FALSE, a = -0.5, b = 1/2, c = 4, err = 1/2, nbreaks = 200 )
cb.sims.sim_impulse_asycov( n = 100, pi = 0.5, eff_sz = 1, alpha = 2, unbalancedness = 1, null = FALSE, a = -0.5, b = 1/2, c = 4, err = 1/2, nbreaks = 200 )
n |
the number of samples. Defaults to |
pi |
the balance between the classes, where samples will be from group 1
with probability |
eff_sz |
the treatment effect between the different groups. Defaults to |
alpha |
the alpha for the covariate sampling procedure. Defaults to |
unbalancedness |
the level of covariate dissimilarity between the covariates
for each of the groups. Defaults to |
null |
whether to generate a null simulation. Defaults to |
a |
the first parameter for the covariate/outcome relationship. Defaults to |
b |
the second parameter for the covariate/outcome relationship. Defaults to |
c |
the third parameter for the covariate/outcome relationship. Defaults to |
err |
the level of noise for the simulation. Defaults to |
nbreaks |
the number of breakpoints for computing the expected outcome at a given covariate level
for each batch. Defaults to |
a list, containing the following:
Ys |
an |
Ts |
an |
Xs |
an |
Eps |
an |
x.bounds |
the theoretical bounds for the covariate values. |
Ytrue |
an |
Ttrue |
an |
Xtrue |
an |
Effect |
The batch effect magnitude. |
Overlap |
the theoretical degree of overlap between the covariate distributions for each of the two groups/batches. |
oracle_fn |
A function for fitting outcomes given covariates. |
A sigmoidal relationship between the covariate and the outcome. The first dimension of the outcome is:
where is the probability density function for the normal distribution with
mean
and standard deviation
.
where the batch/group labels are:
The beta coefficient for the covariate sampling is:
The covariate values for the first batch are asymmetric, in that for the first batch:
and the covariate values for the second batch are:
Finally, the error terms are:
For more details see the help vignette:
vignette("causal_simulations", package = "causalBatch")
Eric W. Bridgeford
Eric W. Bridgeford, et al. "A Causal Perspective for Batch Effects: When is no answer better than a wrong answer?" Biorxiv (2024).
library(causalBatch) sim = cb.sims.sim_impulse_asycov()
library(causalBatch) sim = cb.sims.sim_impulse_asycov()
Linear Simulation
cb.sims.sim_linear( n = 100, pi = 0.5, eff_sz = 1, alpha = 2, unbalancedness = 1, err = 1/2, null = FALSE, a = -2, b = -1, nbreaks = 200 )
cb.sims.sim_linear( n = 100, pi = 0.5, eff_sz = 1, alpha = 2, unbalancedness = 1, err = 1/2, null = FALSE, a = -2, b = -1, nbreaks = 200 )
n |
the number of samples. Defaults to |
pi |
the balance between the classes, where samples will be from group 1
with probability |
eff_sz |
the treatment effect between the different groups. Defaults to |
alpha |
the alpha for the covariate sampling procedure. Defaults to |
unbalancedness |
the level of covariate dissimilarity between the covariates
for each of the groups. Defaults to |
err |
the level of noise for the simulation. Defaults to |
null |
whether to generate a null simulation. Defaults to |
a |
the first parameter for the covariate/outcome relationship. Defaults to |
b |
the second parameter for the covariate/outcome relationship. Defaults to |
nbreaks |
the number of breakpoints for computing the expected outcome at a given covariate level
for each batch. Defaults to |
a list, containing the following:
Ys |
an |
Ts |
an |
Xs |
an |
Eps |
an |
x.bounds |
the theoretical bounds for the covariate values. |
Ytrue |
an |
Ttrue |
an |
Xtrue |
an |
Effect |
The batch effect magnitude. |
Overlap |
the theoretical degree of overlap between the covariate distributions for each of the two groups/batches. |
oracle_fn |
A function for fitting outcomes given covariates. |
A linear relationship between the covariate and the outcome. The first dimension of the outcome is:
where the batch/group labels are:
The beta coefficient for the covariate sampling is:
The covariate values for the first batch are:
and the covariate values for the second batch are:
Finally, the error terms are:
For more details see the help vignette:
vignette("causal_simulations", package = "causalBatch")
Eric W. Bridgeford
Eric W. Bridgeford, et al. "A Causal Perspective for Batch Effects: When is no answer better than a wrong answer?" Biorxiv (2024).
library(causalBatch) sim = cb.sims.sim_linear()
library(causalBatch) sim = cb.sims.sim_linear()
Sigmoidal Simulation
cb.sims.sim_sigmoid( n = 100, pi = 0.5, eff_sz = 1, alpha = 2, unbalancedness = 1, null = FALSE, a = -4, b = 8, err = 1/2, nbreaks = 200 )
cb.sims.sim_sigmoid( n = 100, pi = 0.5, eff_sz = 1, alpha = 2, unbalancedness = 1, null = FALSE, a = -4, b = 8, err = 1/2, nbreaks = 200 )
n |
the number of samples. Defaults to |
pi |
the balance between the classes, where samples will be from group 1
with probability |
eff_sz |
the treatment effect between the different groups. Defaults to |
alpha |
the alpha for the covariate sampling procedure. Defaults to |
unbalancedness |
the level of covariate dissimilarity between the covariates
for each of the groups. Defaults to |
null |
whether to generate a null simulation. Defaults to |
a |
the first parameter for the covariate/outcome relationship. Defaults to |
b |
the second parameter for the covariate/outcome relationship. Defaults to |
err |
the level of noise for the simulation. Defaults to |
nbreaks |
the number of breakpoints for computing the expected outcome at a given covariate level
for each batch. Defaults to |
a list, containing the following:
Y |
an |
Ts |
an |
Xs |
an |
Eps |
an |
x.bounds |
the theoretical bounds for the covariate values. |
Ytrue |
an |
Ttrue |
an |
Xtrue |
an |
Effect |
The batch effect magnitude. |
Overlap |
the theoretical degree of overlap between the covariate distributions for each of the two groups/batches. |
oracle_fn |
A function for fitting outcomes given covariates. |
A sigmoidal relationship between the covariate and the outcome. The first dimension of the outcome is:
where the batch/group labels are:
The beta coefficient for the covariate sampling is:
The covariate values for the first batch are:
and the covariate values for the second batch are:
Finally, the error terms are:
For more details see the help vignette:
vignette("causal_simulations", package = "causalBatch")
Eric W. Bridgeford
Eric W. Bridgeford, et al. "A Causal Perspective for Batch Effects: When is no answer better than a wrong answer?" Biorxiv (2024).
library(causalBatch) sim = cb.sims.sim_sigmoid()
library(causalBatch) sim = cb.sims.sim_sigmoid()