Chapter 13 Principals of Differential Analysis methods
(Nearing et al. 2022) has showed that the multiple DA methods have different performances on microbiota (16s). Here, we integrated the its methods’ part and our own understandings to give a draft whole picture for users who using DA of XMAS2.
Microbiota data have the following characteristics:
- Compositional: Mathematically, a data is defined as compositional, if it contains D multiple parts of nonnegative numbers whose sum is 1 or any constant-sum constraint. It can be formally stated as:
\[S^{D} = \left \{ X = [x_{1}, x_{2}, ..., x_{D}] | x_{i} > 0, i = 1, 2, ..., D; \sum^{D}_{i=1} x_{i} = K \right \}\]
Over-Dispersed: The variance are generally greater than the mean. There are several possible causes of overdispersion: correlated taxa counts, clustering of subjects, and within group heterogeneity. Negative Binomial Model addresses overdispersion by by explicitly modeling the correlated and sparse events via a latent variable.
Sparsity: Microbiome data typically is overdispersed and sparse with many zeros. When the number of zeros is excess than the standard distributions (e.g., normal, Poisson, binomial, NB, beta or gamma) can be readily fit, the data set is considered as ’zero inflate.
13.1 ALDEx2
ALDEx2 is from (Fernandes et al. 2014)
13.1.2 Input Data Type
All the input data type should be Raw Counts:
16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.
13.1.3 Input Key Arguments
reads: Raw counts matrix (Rows->Features; columns->Samples).
conditions: A vector of group labels for testing.
mc.samples: The number of Monte Carlo samples, default is 128.
denom: The methods for Geometric Mean calculation, default is “all”.
test: The test to perform, default is “t.test”.
13.1.4 Procedures
Monte Carlo samples of Dirichlet distributions for each sample, using a uniform prior; (Sparsity)
Performing CLR (log-ratio) transformation of each realization; (Compositional & Overdispersion)
Performing Wilcoxon tests on the transformed realizations.
Finally, the function returned the expected Benjamini-Hochberg (BH) FDR-corrected p-value for each feature based on the results the different across Monte Carlo samples.
13.1.5 Key output
TaxaID: taxa ID.
EffectSize: Effect Size generated by
Aldex2
, which equals to Log2FoldChange.AdjustedPvalue: Adjusted Pvalue by
stats::p.adjust
.Median CLR (All/Groups): median value of CLR abundance normalized by
Aldex2
.Occurrence (All/Groups): prevalence of taxa.
13.1.7 XMAS2 Syntax
run_aldex(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "none",
method = "t.test",
p_adjust = "BH",
pvalue_cutoff = 0.05)
# ALDEx2::aldex(reads = otu_tab,
# conditions = sam_tab$Compvar,
# mc.samples = 128,
# test = test,
# effect = TRUE,
# include.sample.summary = FALSE,
# denom = "all",
# verbose = TRUE)
13.2 ANCOM-II
ANCOM-II is from ANCOM-II (version 2.1) and it derived from (Mandal et al. 2015).
13.2.1 Input Data Type
All the input data type should be Raw Counts:
16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.
13.2.2 Input Key Arguments
feature_table: Raw counts matrix (Rows->Features; columns->Samples).
meta_data: data.frame with group information.
struc_zero: data.frame with structure zero taxa.
group_var: group variable for comparison.
alpha: the significant level cutoff for testing.
13.2.3 Procedures
feature_table_pre_process
firstly identify outlier zeros and structural zeros;
Outlier zeros, identified by finding outliers in the distribution of taxon counts within each sample grouping, were ignored during differential abundance analysis, and replaced with NA.
Structural zeros, taxa that were absent in one grouping but present in the other, were ignored during data analysis and automatically called as differential abundant.
additive log-ratios for transformation with a pseudo count of 1;
Wilcoxon rank-sum tests for significance, and p-values were FDR-corrected using the BH method;
Detection threshold for Ajusted-pvalue per taxa, if the number (ratio) of Ajusted-pvalue reaches nominal significance would be called as DA taxa.
13.2.4 Key output
TaxaID: taxa ID.
EffectSize: Effect Size to evaluate the quality of (W)q-values < alpha.
W_ratio: the ratio of significant taxa compared to totoal number of taxa.
detected: the taxa is significant passed the W_ratio cutoff.
13.2.5 Original Syntax
preprocess_ANCOM <- function(
feature_table,
meta_data,
sample_var = NULL,
group_var = NULL,
out_cut = 0.05,
zero_cut = 0.90,
lib_cut = 1000,
neg_lb = FALSE)
main_ANCOM <- function(
feature_table,
meta_data,
struc_zero = NULL,
group_var = NULL,
p_adj_method = "BH",
alpha = 0.05,
adj_formula = NULL,
rand_formula = NULL,
w_cutoff = 0.7,
...)
13.2.6 XMAS2 Syntax
run_ancom(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "none",
p_adjust = "BH",
pvalue_cutoff = 0.05,
w_cutoff = 0.7)
# preprocess_ANCOM(
# feature_table = otu_tab,
# meta_data = sam_tab %>% tibble::rownames_to_column("TempRowNames"),
# sample_var = "TempRowNames",
# group_var = "Compvar",
# out_cut = 0.05,
# zero_cut = 0.90,
# lib_cut = 1000,
# neg_lb = FALSE)
#
# main_ANCOM(
# feature_table = prepro_res$feature_table,
# meta_data = prepro_res$meta_data,
# struc_zero = prepro_res$structure_zeros,
# group_var = "Compvar",
# p_adj_method = p_adjust,
# alpha = pvalue_cutoff,
# rand_formula = NULL,
# w_cutoff = 0.7)
13.3 corncob
corncob is from (Martin, Witten, and Willis 2020)
13.3.2 Input Data Type
All the input data type should be Raw Counts:
16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.
13.3.3 Input Key Arguments
data: a data frame containing the OTU table, or phyloseq object containing the variables in the models (counts matrix).
formula: an object of class formula without the response.
test: test method, options include: “Wald” and “LRT” for two groups comparison (default: “Wald”).
fdr_cutoff: cutoff of FDR.
13.3.4 Procedures
Taxon count abundance fit to a beta-binomial model, using logit link functions for both the mean and overdispersion.
Wald tests for significance testing, and BH for FDR-corrected p-values.
13.3.5 Key output
TaxaID: taxa ID.
EffectSize: Effect Size to evaluate the power of pvalue.
AdjustedPvalue: Adjusted Pvalue by
stats::p.adjust
.Occurrence (All/Groups): prevalence of taxa.
13.3.6 Original Syntax
differentialTest(
formula,
phi.formula,
formula_null,
phi.formula_null,
data,
link = "logit",
phi.link = "logit",
test,
boot = FALSE,
B = 1000,
sample_data = NULL,
taxa_are_rows = TRUE,
filter_discriminant = TRUE,
fdr_cutoff = 0.05,
fdr = "fdr",
full_output = FALSE,
inits = NULL,
inits_null = NULL,
try_only = NULL,
...
)
13.3.7 XMAS2 Syntax
run_corncob(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "none",
method = c("Wald", "LRT"),
p_adjust = "BH",
pvalue_cutoff = 0.05)
# corncob::differentialTest(
# data = ps_normed_new,
# formula = my_formula,
# phi.formula = my_formula,
# phi.formula_null = my_formula,
# formula_null = ~ 1,
# test = method,
# boot = F,
# fdr_cutoff = pvalue_cutoff,
# fdr = p_adjust)
13.4 DESeq2
DESeq2 is from (Love, Huber, and Anders 2014)
13.4.2 Input Data Type
All the input data type should be un-normalized raw counts:
16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.
13.4.3 Input Key Arguments
countData: Raw counts matrix without normalization (Rows->Features; columns->Samples).
colData: a DataFrame or data.frame with at least a single column..
design: a formula or matrix. the formula expresses how the counts for each gene depend on the variables in colData.
13.4.4 Procedures
estimation of size factors, which are used to normalize library sizes in a model-based fashion;
estimation of dispersions from the negative binomial likelihood for each feature, and subsequent shrinkage of each dispersion estimate towards the parametric (default) trendline by empirical Bayes;
fitting each feature to the specified class groupings with negative binomial generalized linear models and performing hypothesis testing, for which we chose the default Wald test;
Finally, using the
results
function to obtain the resulting BH FDR-corrected p-values.
13.4.5 Key output
TaxaID: taxa ID.
logFC: Log2FoldChange by
DESeq2::DESeq
.AdjustedPvalue: Adjusted Pvalue by
stats::p.adjust
.Occurrence (All/Groups): prevalence of taxa.
13.4.7 XMAS2 Syntax
run_deseq2(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "none",
p_adjust = "BH",
pvalue_cutoff = 0.05,
...)
# DESeq2::DESeqDataSetFromMatrix(
# countData = otu_tab,
# colData = sam_tab,
# design =~ Compvar)
# DESeq2::DESeq(ddsm)
# DESeq2::results(dds, contrast = c("Compvar", group_names))
13.5 edgeR
edgeR is from (Robinson and Oshlack 2010)
13.5.2 Input Data Type
All the input data type should be Raw Counts:
16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.
13.5.3 Input Key Arguments
counts: Raw counts matrix (Rows->Features; columns->Samples) for DEGlist.
design: A vector of group labels for testing.
contrast: A string for comparison groups.
13.5.4 Procedures
Add a pseudocount of 1 to the non-rarefied feature table and used the function
calcNormFactors
from the edgeR package to compute relative log expression normalization factors;Negative binomial dispersion parameters were then estimated using the functions
estimateCommonDisp
followed byestimateTagwiseDisp
to shrink feature-wise dispersion estimates through an empirical Bayes approach;exactTest
for negative binomial data to identify features that differ between the specified groups;The resulting p-values were then corrected for multiple testing with the BH method with the function
topTags
.
13.5.5 Key output
TaxaID: taxa ID.
logFC: Log2FoldChange by
edgeR::topTags
.AdjustedPvalue: Adjusted Pvalue by
stats::p.adjust
.Occurrence (All/Groups): prevalence of taxa.
13.5.7 XMAS2 Syntax
run_edger(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "none",
p_adjust = "BH",
pvalue_cutoff = 0.05,
...)
# # Filter data
# keep <- rowSums(edgeR::cpm(otu_tab) > 100) >= 2
# otu_tab <- otu_tab[keep, ]
#
# # DGEList object: otu_tab -> colnames:Samples; rownames:Taxa
# dge_obj <- edgeR::DGEList(counts = otu_tab)
# dge_obj_factors <- edgeR::calcNormFactors(dge_obj)
# limma::plotMDS(dge_obj_factors, method = "bcv", col = as.numeric(sam_tab$Compvar))
# graphics::legend("bottomleft", group_names, col = 1:2, pch = 20)
#
# # GLM estimates of dispersion
# dge <- edgeR::estimateGLMCommonDisp(dge_obj_factors, design)
# dge <- edgeR::estimateGLMTrendedDisp(dge, design, method = "power")
#
# # You can change method to "auto", "bin.spline", "power", "spline", "bin.loess".
# # The default is "auto" which chooses "bin.spline" when > 200 tags and "power" otherwise.
# dge <- edgeR::estimateGLMTagwiseDisp(dge, design)
#
# # GLM testing for differential expression:
# fit <- edgeR::glmFit(dge, design)
# fit2 <- edgeR::glmLRT(fit, contrast = contrast) # Normal:-1; Tumor:1
# edgeR_res <- edgeR::topTags(fit2, n = nrow(otu_tab)) %>% data.frame()
13.6 LEfSe
LEfSe is from (Segata et al. 2011)
13.6.2 Input Data Type
All the input data type could be Raw Counts or Relative abundance:
16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA.
13.6.3 Input Key Arguments
assays: data.frame or matrix of features (Rows->Features; columns->Samples).
colData: data.frame or matrix with group information.
kruskal.threshold : cutoff of KW test pvalue.
wilcox.threshold: cutoff of wilcox test pvalue.
lda.threshold: cutoff of LDA scores.
13.6.4 Procedures
Normalizing the matrix data using CPM: pre-sample normalization of the sum of the values to 1e+06.;
Performing a Kruskal-Wallis (which in our two-group case reduces to the Wilcoxon rank-sum) hypothesis test to identify potential differentially abundant features;
Using differentially abundant features to perform linear discriminant analysis (LDA) of class labels on abundances to estimate the effect sizes for sig- nificant features.
Only features with scaled LDA analysis scores above the threshold (default: 2.0) were called as DA.
13.6.7 XMAS2 Syntax
run_lefse(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "CPM",
wl.p = 0.05,
Lda = 2)
run_lefse2(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
subgroup = NULL,
transform = c("identity", "log10", "log10p"),
norm = "CPM",
kw_cutoff = 0.05,
lda_cutoff = 2,
bootstrap_n = 30,
bootstrap_fraction = 2 / 3,
wilcoxon_cutoff = 0.05,
multigrp_strat = FALSE,
strict = c("0", "1", "2"),
sample_min = 10,
only_same_subgrp = FALSE,
curv = FALSE)
# se_object <- SummarizedExperiment::SummarizedExperiment(
# assays = list(counts = otu_tab),
# colData = sam_tab,
# metadata = "Profile")
#
# lefse_res <- lefser(se_object,
# kruskal.threshold = 0.05,
# wilcox.threshold = wl.p,
# lda.threshold = Lda,
# groupCol = "Compvar",
# blockCol = NULL,
# assay = 1L,
# trim.names = TRUE)
13.7 limma voom
limma-voom is from (Law et al. 2014)
13.7.2 Input Data Type
All the input data type should be Raw Counts:
16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.
13.7.3 Input Key Arguments
counts: a numeric matrix containing raw counts (Rows->Features; columns->Samples).
design: design matrix with rows corresponding to samples and columns to coefficients to be estimated.
lib.size: numeric vector containing total library sizes for each sample.
span: width of the smoothing window used for the lowess mean-variance trend.
13.7.4 Procedures
Normalizing table by the trimmed mean of M-values (TMM) or TMM with singleton pairing (TMMwsp) option via
calcNormFactors function
(highly sparse data);After normalization, using the limma R package function
voom
to convert normalized counts to log2-counts-per-million and assign precision weights to each observation based on the mean-variance trend;Using the functions
lmFit
,eBayes
, andtopTable
in the limma R package to fit weighted linear regression models, perform tests based on an empirical Bayes moderated t-statistic76 and obtain BH FDR-corrected p-values.
13.7.5 Key output
TaxaID: taxa ID.
EffectSize and logFC: both results equals to Log2FoldChange.
AdjustedPvalue: Adjusted Pvalue by
stats::p.adjust
.Occurrence (All/Groups): prevalence of taxa.
13.7.7 XMAS2 Syntax
run_limma_voom(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "none",
voom_span = 0.5,
p_adjust = "BH",
pvalue_cutoff = 0.05,
...)
# # voom fitting
# voom_out <- limma::voom(
# otu_tab,
# design = design,
# lib.size = lib_size,
# span = voom_span)
#
# # linear model
# fit_out <- limma::lmFit(voom_out, design = design)
# test_out <- limma::eBayes(fit_out)
# test_df <- limma::topTable(
# test_out,
# number = nrow(otu_tab),
# adjust.method = p_adjust)
13.8 MaAsLin2
MaAsLin2 is from (Mallick et al. 2021)
13.8.1 Vignette
Multivariable association discovery in population-scale meta-omics studies: MaAsLin2
13.8.2 Input Data Type
All the input data type could be Raw Counts or Relative abundance:
16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.
13.8.3 Input Key Arguments
input_data: The tab-delimited input file of features (Rows->Features; columns->Samples).
input_metadata: The tab-delimited input file of metadata.
normalization: The normalization method to apply.
transform: The transform to apply.
analysis_method: The analysis method for linear regression to apply.
random_effects: The random effects for the model.
fixed_effects: The fixed effects for the model, such as the groups to compare.
13.8.4 Procedures
Using arcsine square-root transformation (AST) for transformation and total sum scaling normalization (TSS) for normalization;
The function fit a linear model (without random effects chosen) to each feature’s transformed abundance on the specified sample grouping, tested significance using a Wald test, and output BH FDR-corrected p-values;
13.8.5 Key output
TaxaID: taxa ID.
AdjustedPvalue: Adjusted Pvalue by
stats::p.adjust
.Occurrence (All/Groups): prevalence of taxa.
13.8.6 Original Syntax
Maaslin2(
input_data,
input_metadata,
output,
min_abundance = 0.0,
min_prevalence = 0.1,
min_variance = 0.0,
normalization = "TSS",
transform = "LOG",
analysis_method = "LM",
max_significance = 0.25,
random_effects = NULL,
fixed_effects = NULL,
correction = "BH",
standardize = TRUE,
cores = 1,
plot_heatmap = TRUE,
plot_scatter = TRUE,
heatmap_first_n = 50,
reference = NULL
)
13.8.7 XMAS2 Syntax
run_maaslin2(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "LOG",
norm = "TSS",
outdir = "./demo_output",
method = "LM",
p_adjust = "BH",
pvalue_cutoff = 0.05)
# Maaslin2(input_data = otu_tab %>% t() %>% data.frame(), # row->smapleID; col->Taxa
# input_metadata = sam_tab,
# output = outdir,
# normalization = norm,
# transform = transform,
# analysis_method = method,
# max_significance = pvalue_cutoff,
# fixed_effects = "Compvar",
# correction = p_adjust,
# plot_heatmap = FALSE,
# plot_scatter = FALSE,
# cores = 2)
13.9 metagenomeSeq
metagenomeSeq is from (Paulson et al. 2013)
13.9.2 Input Data Type
All the input data type should be Raw Counts:
16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.
13.9.3 Input Key Arguments
reads: Raw counts matrix (Rows->Features; columns->Samples).
conditions: A vector of group labels for testing.
mc.samples: The number of Monte Carlo samples, default is 128.
denom: The methods for Geometric Mean calculation, default is “all”.
test: The test to perform, default is “t.test”.
13.9.4 Procedures
Converting the counts and sample information into newMRexperiment object by the function
newMRexperiment
from the metagenomeSeq R package;Using
cumNormStat
andcumNorm
to apply cumulative sum-scaling normalization (CSS), which attempts to normalize sequence counts based on the lower-quartile abundance of features;Using
fitFeatureModel
to fit normalized feature counts with zero-inflated log-normal models (with pseudo-counts of 1 added prior to log2 transformation) and perform empirical Bayes moderated t-tests, andMRfulltable
to obtain BH FDR-corrected p-values.
13.9.7 XMAS2 Syntax
run_metagenomeseq(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "CSS",
method = c(
"ZILN", "ZIG"
),
p_adjust = "BH",
pvalue_cutoff = 0.05,
...)
# # run DA
# if (method == "ZILN") {
# fit <- metagenomeSeq::fitFeatureModel(mgs_summarized, mod)
# } else {
# fit <- metagenomeSeq::fitZig(mgs_summarized, mod)
# }
#
# # metagenomeSeq vignette: We recommend the user remove features based on
# # the number of estimated effective samples, please see
# # calculateEffectiveSamples. We recommend removing features with less
# # than the average number of effective samples in all features. In
# # essence, setting eff = .5 when using MRcoefs, MRfulltable, or MRtable.
# res <- metagenomeSeq::MRcoefs(
# fit,
# number = ntaxa(ps_normed_new),
# adjustMethod = p_adjust,
# group = 3,
# eff = 0.5)
# res <- dplyr::rename(
# res,
# Pvalue = .data$pvalues,
# AdjustedPvalue = .data$adjPvalues)
13.10 t-test
13.10.1 Input Data Type
All the input data type could be Raw Counts or relative abundance:
16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.
13.10.2 Input Key Arguments
x: Raw counts matrix (Rows->Features; columns->Samples).
y: A vector of group labels for testing.
13.10.3 Procedures
Applying total sum scaling normalization (TSS) or not to the feature table;
Performing an unpaired Welch’s t-test for each feature to compare the specified groups;
Correcting the resulting p-values with the BH method.
13.10.4 Key output
TaxaID: taxa ID.
AdjustedPvalue: Adjusted Pvalue by
stats::p.adjust
.Occurrence (All/Groups): prevalence of taxa.
13.10.6 XMAS2 Syntax
run_ttest(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "none",
p_adjust = "BH",
pvalue_cutoff = 0.05,
paired = FALSE,
paired_column = NULL)
# # run wilcox rank sum test
# if (paired) {
# if (!is.null(paired_column)) {
# res <- paired_t(x = otu_tab,
# y = sam_tab,
# group = "Compvar",
# gnames = group_names,
# paired = paired,
# PID = paired_column,
# p_ad = p_adjust)
# } else {
# stop("Please provide the paired column for paired test")
# }
#
# } else {
# res <- apply(otu_tab, 1, function(x, y) {
# dat <- data.frame(value = as.numeric(x), group = y)
# rest <- t.test(data = dat, value ~ group)
# res <- c(rest$statistic, rest$p.value)
# return(res)
# }, sam_tab$Compvar) %>%
# t() %>% data.frame() %>%
# stats::setNames(c("Statistic", "Pvalue")) %>%
# tibble::rownames_to_column("TaxaID") %>%
# dplyr::mutate(AdjustedPvalue = p.adjust(as.numeric(Pvalue), method = p_adjust))
# }
13.11 Wilcoxon test
13.11.1 Input Data Type
All the input data type could be Raw Counts or relative abundance:
16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.
13.11.2 Input Key Arguments
x: Raw counts matrix (Rows->Features; columns->Samples).
y: A vector of group labels for testing.
13.11.3 Procedures
CLR (after applying a pseudocount of 1) transforms or not to abundances;
Performing Wilcoxon rank-sum tests for each feature to compare the specified sample groupings;
Correcting the resulting p-values with the BH method.
13.11.4 Key output
TaxaID: taxa ID.
AdjustedPvalue: Adjusted Pvalue by
stats::p.adjust
.Occurrence (All/Groups): prevalence of taxa.
13.11.6 XMAS2 Syntax
run_wilcox(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "none",
p_adjust = "BH",
pvalue_cutoff = 0.05,
paired = FALSE,
paired_column = NULL)
# # run wilcox rank sum test
# if (paired) {
# if (!is.null(paired_column)) {
# res <- paired_wilcox(x = otu_tab,
# y = sam_tab,
# group = "Compvar",
# gnames = group_names,
# paired = paired,
# PID = paired_column,
# p_ad = p_adjust)
# } else {
# stop("Please provide the paired column for paired test")
# }
#
# } else {
# res <- apply(otu_tab, 1, function(x, y) {
# dat <- data.frame(value = as.numeric(x), group = y)
# rest <- wilcox.test(data = dat, value ~ group)
# res <- c(rest$statistic, rest$p.value)
# return(res)
# }, sam_tab$Compvar) %>%
# t() %>% data.frame() %>%
# stats::setNames(c("Statistic", "Pvalue")) %>%
# tibble::rownames_to_column("TaxaID") %>%
# dplyr::mutate(AdjustedPvalue = p.adjust(as.numeric(Pvalue), method = p_adjust))
# }
13.12 RAIDA
RAIDA is from (Sohn, Du, and An 2015)
13.12.1 Vignette
A robust approach for identifying differentially abundant features in metagenomic samples: RAIDA
13.12.2 Input Data Type
All the input data type should be Raw Counts:
16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.
13.12.3 Input Key Arguments
c.data: a data frame containing counts for both conditions.
n.lib: a vector containing the numbers of samples for both conditions.
13.12.4 Procedures
Approach for Identifying Differential Abundance (RAIDA) - is a robust approach for identifying differentially abundant features in metagenomic samples across different conditions. It utilizes the ratio between features in a modified zero-inflated log-normal model.
Utilizing the ratios between the counts of features in each sample, eliminating possible problems associated with counts on different scales within and between conditions;
To fit ratios with zeros by EM algorithm, using a modified ZIL (zero-inflated log normal: sparsity);
Constructing a moderated t-statistics for the log ratio of each feature \(y_{ij}\) using the estimated mean \(\mu\) and variance \(r^2\) and obtain \(P\) values for the null hypotheses for all features;
Adjust \(P\) values using a multiple testing correction method
13.12.5 Key output
TaxaID: taxa ID.
AdjustedPvalue: Adjusted Pvalue by
stats::p.adjust
.Occurrence (All/Groups): prevalence of taxa.
13.13 Summary
- The most commonly assumed data distribution was the negative binomial distribution (DESeq2, edgeR, omnibus, mbzinb).
No data transformations were performed for negative binomial methods, or metagenomeSeq methods, to try and bring the data to normality as non-normality of data is taken into account in their statistical models.
The remaining parametric methods (ALDEx2 t-test, t-test, limma-voom) all used statistical tests that assumed a Gaussian distribution of the data, therefore, transformations were needed before analysis, which here included a log transform of some kind.
Three methods (ALDEx2 Wilcoxon, ANCOM-II, LEfSe) were considered non-parametric (assumes no underlying distribution of data) as they used statistical tests that transformed data to ranks.
Two of the four negative binomial methods (DESeq2, edgeR) calculated scaling factors for each sample to account for uneven sequence count.
Cumulative sum scaling (CSS) was used for both metagenomeSeq and MaAsLin2. Total sum scaling (TSS; also referred to as relative abundance) was performed for LEfSe, and for method that did not have a built-in normalization function (t-test) as this strategy is commonly used in the literature. Log-ratio based transformations were used for ALDEx2 methods, ANCOM, and, in addition to TSS, were applied to methods without a built-in normalization function.
13.14 Advantages and disadvantages
limma-voom, edgeR, Wilcoxon (CLR), and LEfSe output a high number of significant ASVs on average, while ALDEx2 and ANCOM-II tended to identify only a relatively small number of ASVs as significant;
limma-voom, edgeR, Wilcoxon (CLR), and LEfSe methods have been previously found to exhibit a high FDR, in contrast, ANCOM appropriately controls the FDR;
We can clearly recommend that users avoid using edgeR (a tool primarily intended for RNA-seq data) as well as LEfSe (without p-value correction) for conducting DA testing with 16S rRNA gene data.
Users should also be aware that limma voom and the Wilcoxon (CLR) approaches may perform poorly on unfiltered data that is highly sparse.
More generally, we recommend that users employ several methods and focus on significant features identified by most tools.