Chapter 13 Principals of Differential Analysis methods

(Nearing et al. 2022) has showed that the multiple DA methods have different performances on microbiota (16s). Here, we integrated the its methods’ part and our own understandings to give a draft whole picture for users who using DA of XMAS2.

Microbiota data have the following characteristics:

Compositional: Mathematically, a data is defined as compositional, if it contains D multiple parts of nonnegative numbers whose sum is 1 or any constant-sum constraint. It can be formally stated as:

\[S^{D} = \left \{ X = [x_{1}, x_{2}, ..., x_{D}] | x_{i} > 0, i = 1, 2, ..., D; \sum^{D}_{i=1} x_{i} = K \right \}\]

Over-Dispersed: The variance are generally greater than the mean. There are several possible causes of overdispersion: correlated taxa counts, clustering of subjects, and within group heterogeneity. Negative Binomial Model addresses overdispersion by by explicitly modeling the correlated and sparse events via a latent variable.
Sparsity: Microbiome data typically is overdispersed and sparse with many zeros. When the number of zeros is excess than the standard distributions (e.g., normal, Poisson, binomial, NB, beta or gamma) can be readily fit, the data set is considered as ’zero inflate.

13.1 ALDEx2

ALDEx2 is from (Fernandes et al. 2014)

13.1.1 Vignette

ANOVA-Like Differential Expression tool for high throughput sequencing data: ALDEx2

13.1.2 Input Data Type

All the input data type should be Raw Counts:

16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.

13.1.3 Input Key Arguments

reads: Raw counts matrix (Rows->Features; columns->Samples).
conditions: A vector of group labels for testing.
mc.samples: The number of Monte Carlo samples, default is 128.
denom: The methods for Geometric Mean calculation, default is “all”.
test: The test to perform, default is “t.test”.

13.1.4 Procedures

Monte Carlo samples of Dirichlet distributions for each sample, using a uniform prior; (Sparsity)
Performing CLR (log-ratio) transformation of each realization; (Compositional & Overdispersion)
Performing Wilcoxon tests on the transformed realizations.

Finally, the function returned the expected Benjamini-Hochberg (BH) FDR-corrected p-value for each feature based on the results the different across Monte Carlo samples.

13.1.5 Key output

TaxaID: taxa ID.
EffectSize: Effect Size generated by Aldex2, which equals to Log2FoldChange.
AdjustedPvalue: Adjusted Pvalue by stats::p.adjust.
Median CLR (All/Groups): median value of CLR abundance normalized by Aldex2.
Occurrence (All/Groups): prevalence of taxa.

13.1.6 Original Syntax

aldex(
  reads,
  conditions,
  mc.samples = 128,
  test = "t",
  effect = TRUE,
  include.sample.summary = FALSE,
  verbose = FALSE,
  denom = "all",
  iterate = FALSE,
  ...
)

13.1.7 XMAS2 Syntax

run_aldex(
    ps = NULL,
    taxa_level = NULL,
    data_otu = NULL,
    data_sam = NULL,
    group,
    group_names = NULL,
    transform = "identity",
    norm = "none",
    method = "t.test",
    p_adjust = "BH",
    pvalue_cutoff = 0.05)


# ALDEx2::aldex(reads = otu_tab,
#               conditions = sam_tab$Compvar,
#               mc.samples = 128,
#               test = test,
#               effect = TRUE,
#               include.sample.summary = FALSE,
#               denom = "all",
#               verbose = TRUE)

13.1.8 Notes

Effect Size could be regarded as Log2FoldChange;

13.2 ANCOM-II

ANCOM-II is from ANCOM-II (version 2.1) and it derived from (Mandal et al. 2015).

13.2.1 Input Data Type

All the input data type should be Raw Counts:

16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.

13.2.2 Input Key Arguments

feature_table: Raw counts matrix (Rows->Features; columns->Samples).
meta_data: data.frame with group information.
struc_zero: data.frame with structure zero taxa.
group_var: group variable for comparison.
alpha: the significant level cutoff for testing.

13.2.3 Procedures

feature_table_pre_process firstly identify outlier zeros and structural zeros;

Outlier zeros, identified by finding outliers in the distribution of taxon counts within each sample grouping, were ignored during differential abundance analysis, and replaced with NA.
Structural zeros, taxa that were absent in one grouping but present in the other, were ignored during data analysis and automatically called as differential abundant.

additive log-ratios for transformation with a pseudo count of 1;
Wilcoxon rank-sum tests for significance, and p-values were FDR-corrected using the BH method;
Detection threshold for Ajusted-pvalue per taxa, if the number (ratio) of Ajusted-pvalue reaches nominal significance would be called as DA taxa.

13.2.4 Key output

TaxaID: taxa ID.
EffectSize: Effect Size to evaluate the quality of (W)q-values < alpha.
W_ratio: the ratio of significant taxa compared to totoal number of taxa.
detected: the taxa is significant passed the W_ratio cutoff.

13.2.5 Original Syntax

preprocess_ANCOM <- function(
          feature_table,
          meta_data,
          sample_var = NULL,
          group_var = NULL,
          out_cut = 0.05,
          zero_cut = 0.90,
          lib_cut = 1000,
          neg_lb = FALSE)

main_ANCOM <- function(
                  feature_table,
                  meta_data,
                  struc_zero = NULL,
                  group_var = NULL,
                  p_adj_method = "BH",
                  alpha = 0.05,
                  adj_formula = NULL,
                  rand_formula = NULL,
                  w_cutoff = 0.7,
                  ...)

13.2.6 XMAS2 Syntax

run_ancom(
    ps = NULL,
    taxa_level = NULL,
    data_otu = NULL,
    data_sam = NULL,
    group,
    group_names = NULL,
    transform = "identity",
    norm = "none",
    p_adjust = "BH",
    pvalue_cutoff = 0.05,
    w_cutoff = 0.7)

# preprocess_ANCOM(
#       feature_table = otu_tab,
#       meta_data = sam_tab %>% tibble::rownames_to_column("TempRowNames"),
#       sample_var = "TempRowNames",
#       group_var = "Compvar",
#       out_cut = 0.05,
#       zero_cut = 0.90,
#       lib_cut = 1000,
#       neg_lb = FALSE)
# 
# main_ANCOM(
#       feature_table = prepro_res$feature_table,
#       meta_data = prepro_res$meta_data,
#       struc_zero = prepro_res$structure_zeros,
#       group_var = "Compvar",
#       p_adj_method = p_adjust,
#       alpha = pvalue_cutoff,
#       rand_formula = NULL,
#       w_cutoff = 0.7)

13.2.7 Notes

the column detected_ with TRUE shows the significant taxa by `run_ancom``;

13.3 corncob

corncob is from (Martin, Witten, and Willis 2020)

13.3.1 Vignette

Count Regression for Correlated Observations with the Beta-Binomial: corncob

13.3.2 Input Data Type

All the input data type should be Raw Counts:

16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.

13.3.3 Input Key Arguments

data: a data frame containing the OTU table, or phyloseq object containing the variables in the models (counts matrix).
formula: an object of class formula without the response.
test: test method, options include: “Wald” and “LRT” for two groups comparison (default: “Wald”).
fdr_cutoff: cutoff of FDR.

13.3.4 Procedures

Taxon count abundance fit to a beta-binomial model, using logit link functions for both the mean and overdispersion.
Wald tests for significance testing, and BH for FDR-corrected p-values.

13.3.5 Key output

TaxaID: taxa ID.
EffectSize: Effect Size to evaluate the power of pvalue.
AdjustedPvalue: Adjusted Pvalue by stats::p.adjust.
Occurrence (All/Groups): prevalence of taxa.

13.3.6 Original Syntax

differentialTest(
  formula,
  phi.formula,
  formula_null,
  phi.formula_null,
  data,
  link = "logit",
  phi.link = "logit",
  test,
  boot = FALSE,
  B = 1000,
  sample_data = NULL,
  taxa_are_rows = TRUE,
  filter_discriminant = TRUE,
  fdr_cutoff = 0.05,
  fdr = "fdr",
  full_output = FALSE,
  inits = NULL,
  inits_null = NULL,
  try_only = NULL,
  ...
)

13.3.7 XMAS2 Syntax

run_corncob(
    ps = NULL,
    taxa_level = NULL,
    data_otu = NULL,
    data_sam = NULL,
    group,
    group_names = NULL,
    transform = "identity",
    norm = "none",
    method = c("Wald", "LRT"),
    p_adjust = "BH",
    pvalue_cutoff = 0.05)


# corncob::differentialTest(
#                     data = ps_normed_new,
#                     formula = my_formula,
#                     phi.formula = my_formula,
#                     phi.formula_null = my_formula,
#                     formula_null = ~ 1,
#                     test = method,
#                     boot = F,
#                     fdr_cutoff = pvalue_cutoff,
#                     fdr = p_adjust)

13.3.8 Notes

Columns with Log2FoldChange are based on raw counts;

13.4 DESeq2

DESeq2 is from (Love, Huber, and Anders 2014)

13.4.1 Vignette

DESeq2

13.4.2 Input Data Type

All the input data type should be un-normalized raw counts:

16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.

13.4.3 Input Key Arguments

countData: Raw counts matrix without normalization (Rows->Features; columns->Samples).
colData: a DataFrame or data.frame with at least a single column..
design: a formula or matrix. the formula expresses how the counts for each gene depend on the variables in colData.

13.4.4 Procedures

estimation of size factors, which are used to normalize library sizes in a model-based fashion;
estimation of dispersions from the negative binomial likelihood for each feature, and subsequent shrinkage of each dispersion estimate towards the parametric (default) trendline by empirical Bayes;
fitting each feature to the specified class groupings with negative binomial generalized linear models and performing hypothesis testing, for which we chose the default Wald test;
Finally, using the results function to obtain the resulting BH FDR-corrected p-values.

13.4.5 Key output

TaxaID: taxa ID.
logFC: Log2FoldChange by DESeq2::DESeq.
AdjustedPvalue: Adjusted Pvalue by stats::p.adjust.
Occurrence (All/Groups): prevalence of taxa.

13.4.6 Original Syntax

DESeqDataSetFromMatrix(
  countData,
  colData,
  design,
  tidy = FALSE,
  ignoreRank = FALSE,
  ...
)

13.4.7 XMAS2 Syntax

run_deseq2(
    ps = NULL,
    taxa_level = NULL,
    data_otu = NULL,
    data_sam = NULL,
    group,
    group_names = NULL,
    transform = "identity",
    norm = "none",
    p_adjust = "BH",
    pvalue_cutoff = 0.05,
    ...)

# DESeq2::DESeqDataSetFromMatrix(
#                     countData = otu_tab,
#                     colData =  sam_tab,
#                     design =~ Compvar)
# DESeq2::DESeq(ddsm)
# DESeq2::results(dds, contrast = c("Compvar", group_names))

13.5 edgeR

edgeR is from (Robinson and Oshlack 2010)

13.5.1 Vignette

edgeR

13.5.2 Input Data Type

All the input data type should be Raw Counts:

16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.

13.5.3 Input Key Arguments

counts: Raw counts matrix (Rows->Features; columns->Samples) for DEGlist.
design: A vector of group labels for testing.
contrast: A string for comparison groups.

13.5.4 Procedures

Add a pseudocount of 1 to the non-rarefied feature table and used the function calcNormFactors from the edgeR package to compute relative log expression normalization factors;
Negative binomial dispersion parameters were then estimated using the functions estimateCommonDisp followed by estimateTagwiseDisp to shrink feature-wise dispersion estimates through an empirical Bayes approach;
exactTest for negative binomial data to identify features that differ between the specified groups;
The resulting p-values were then corrected for multiple testing with the BH method with the function topTags.

13.5.5 Key output

TaxaID: taxa ID.
logFC: Log2FoldChange by edgeR::topTags.
AdjustedPvalue: Adjusted Pvalue by stats::p.adjust.
Occurrence (All/Groups): prevalence of taxa.

13.5.6 Original Syntax

# DGEList object & normalized factor

# GLM estimates of dispersion

# GLM testing for differential expression

13.5.7 XMAS2 Syntax

run_edger(
    ps = NULL,
    taxa_level = NULL,
    data_otu = NULL,
    data_sam = NULL,
    group,
    group_names = NULL,
    transform = "identity",
    norm = "none",
    p_adjust = "BH",
    pvalue_cutoff = 0.05,
    ...)


# # Filter data
# keep <- rowSums(edgeR::cpm(otu_tab) > 100) >= 2
# otu_tab <- otu_tab[keep, ]
# 
# # DGEList object: otu_tab -> colnames:Samples; rownames:Taxa
# dge_obj <- edgeR::DGEList(counts = otu_tab)
# dge_obj_factors <- edgeR::calcNormFactors(dge_obj)
# limma::plotMDS(dge_obj_factors, method = "bcv", col = as.numeric(sam_tab$Compvar))
# graphics::legend("bottomleft", group_names, col = 1:2, pch = 20)
# 
# # GLM estimates of dispersion
# dge <- edgeR::estimateGLMCommonDisp(dge_obj_factors, design)
# dge <- edgeR::estimateGLMTrendedDisp(dge, design, method = "power")
# 
# # You can change method to "auto", "bin.spline", "power", "spline", "bin.loess".
# # The default is "auto" which chooses "bin.spline" when > 200 tags and "power" otherwise.
# dge <- edgeR::estimateGLMTagwiseDisp(dge, design)
# 
# # GLM testing for differential expression:
# fit <- edgeR::glmFit(dge, design)
# fit2 <- edgeR::glmLRT(fit, contrast = contrast) # Normal:-1; Tumor:1
# edgeR_res <- edgeR::topTags(fit2, n = nrow(otu_tab)) %>% data.frame()

13.5.8 Notes

Effect Size could be regarded as Log2FoldChange;

13.6 LEfSe

LEfSe is from (Segata et al. 2011)

13.6.1 Vignette

Liner discriminant analysis (LDA) effect size: LEfSe

13.6.2 Input Data Type

All the input data type could be Raw Counts or Relative abundance:

16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA.

13.6.3 Input Key Arguments

assays: data.frame or matrix of features (Rows->Features; columns->Samples).
colData: data.frame or matrix with group information.
kruskal.threshold : cutoff of KW test pvalue.
wilcox.threshold: cutoff of wilcox test pvalue.
lda.threshold: cutoff of LDA scores.

13.6.4 Procedures

Normalizing the matrix data using CPM: pre-sample normalization of the sum of the values to 1e+06.;
Performing a Kruskal-Wallis (which in our two-group case reduces to the Wilcoxon rank-sum) hypothesis test to identify potential differentially abundant features;
Using differentially abundant features to perform linear discriminant analysis (LDA) of class labels on abundances to estimate the effect sizes for sig- nificant features.
Only features with scaled LDA analysis scores above the threshold (default: 2.0) were called as DA.

13.6.5 Key output

TaxaID: taxa ID.
LDA score: LDA score to determine the significant taxa.

13.6.6 Original Syntax

lefser(expr,
       kruskal.threshold = 0.05,
       wilcox.threshold = 0.05,
       lda.threshold = 2.0,
       groupCol = "GROUP",
       blockCol = NULL,
       assay = 1L,
       trim.names = FALSE)

13.6.7 XMAS2 Syntax

run_lefse(
    ps = NULL,
    taxa_level = NULL,
    data_otu = NULL,
    data_sam = NULL,
    group,
    group_names = NULL,
    transform = "identity",
    norm = "CPM",
    wl.p = 0.05,
    Lda = 2)

run_lefse2(
    ps = NULL,
    taxa_level = NULL,
    data_otu = NULL,
    data_sam = NULL,
    group,
    group_names = NULL,
    subgroup = NULL,
    transform = c("identity", "log10", "log10p"),
    norm = "CPM",
    kw_cutoff = 0.05,
    lda_cutoff = 2,
    bootstrap_n = 30,
    bootstrap_fraction = 2 / 3,
    wilcoxon_cutoff = 0.05,
    multigrp_strat = FALSE,
    strict = c("0", "1", "2"),
    sample_min = 10,
    only_same_subgrp = FALSE,
    curv = FALSE)

# se_object <- SummarizedExperiment::SummarizedExperiment(
#                         assays = list(counts = otu_tab),
#                         colData = sam_tab,
#                         metadata = "Profile")
# 
# lefse_res <- lefser(se_object,
#                     kruskal.threshold = 0.05,
#                     wilcox.threshold  = wl.p,
#                     lda.threshold     = Lda,
#                     groupCol = "Compvar",
#                     blockCol = NULL,
#                     assay    = 1L,
#                     trim.names = TRUE)

13.6.8 Notes

Effect Size could be regarded as Log2FoldChange;

13.7 limma voom

limma-voom is from (Law et al. 2014)

13.7.1 Vignette

limma

13.7.2 Input Data Type

All the input data type should be Raw Counts:

16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.

13.7.3 Input Key Arguments

counts: a numeric matrix containing raw counts (Rows->Features; columns->Samples).
design: design matrix with rows corresponding to samples and columns to coefficients to be estimated.
lib.size: numeric vector containing total library sizes for each sample.
span: width of the smoothing window used for the lowess mean-variance trend.

13.7.4 Procedures

Normalizing table by the trimmed mean of M-values (TMM) or TMM with singleton pairing (TMMwsp) option via calcNormFactors function (highly sparse data);
After normalization, using the limma R package function voom to convert normalized counts to log2-counts-per-million and assign precision weights to each observation based on the mean-variance trend;
Using the functions lmFit, eBayes, and topTable in the limma R package to fit weighted linear regression models, perform tests based on an empirical Bayes moderated t-statistic76 and obtain BH FDR-corrected p-values.

13.7.5 Key output

TaxaID: taxa ID.
EffectSize and logFC: both results equals to Log2FoldChange.
AdjustedPvalue: Adjusted Pvalue by stats::p.adjust.
Occurrence (All/Groups): prevalence of taxa.

13.7.6 Original Syntax

fit_out <- limma::lmFit(voom_out, design = design)
test_out <- limma::eBayes(fit_out)
test_df <- limma::topTable(
    test_out,
    number = nrow(otu_tab),
    adjust.method = p_adjust)

13.7.7 XMAS2 Syntax

run_limma_voom(
    ps = NULL,
    taxa_level = NULL,
    data_otu = NULL,
    data_sam = NULL,
    group,
    group_names = NULL,
    transform = "identity",
    norm = "none",
    voom_span = 0.5,
    p_adjust = "BH",
    pvalue_cutoff = 0.05,
    ...)


# # voom fitting
# voom_out <- limma::voom(
#     otu_tab,
#     design = design,
#     lib.size = lib_size,
#     span = voom_span)
# 
# # linear model
# fit_out <- limma::lmFit(voom_out, design = design)
# test_out <- limma::eBayes(fit_out)
# test_df <- limma::topTable(
#     test_out,
#     number = nrow(otu_tab),
#     adjust.method = p_adjust)

13.7.8 Notes

Effect Size could be regarded as Log2FoldChange;

13.8 MaAsLin2

MaAsLin2 is from (Mallick et al. 2021)

13.8.1 Vignette

Multivariable association discovery in population-scale meta-omics studies: MaAsLin2

13.8.2 Input Data Type

All the input data type could be Raw Counts or Relative abundance:

16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.

13.8.3 Input Key Arguments

input_data: The tab-delimited input file of features (Rows->Features; columns->Samples).
input_metadata: The tab-delimited input file of metadata.
normalization: The normalization method to apply.
transform: The transform to apply.
analysis_method: The analysis method for linear regression to apply.
random_effects: The random effects for the model.
fixed_effects: The fixed effects for the model, such as the groups to compare.

13.8.4 Procedures

Using arcsine square-root transformation (AST) for transformation and total sum scaling normalization (TSS) for normalization;
The function fit a linear model (without random effects chosen) to each feature’s transformed abundance on the specified sample grouping, tested significance using a Wald test, and output BH FDR-corrected p-values;

13.8.5 Key output

TaxaID: taxa ID.
AdjustedPvalue: Adjusted Pvalue by stats::p.adjust.
Occurrence (All/Groups): prevalence of taxa.

13.8.6 Original Syntax

Maaslin2(
    input_data,
    input_metadata,
    output,
    min_abundance = 0.0,
    min_prevalence = 0.1,
    min_variance = 0.0,
    normalization = "TSS",
    transform = "LOG",
    analysis_method = "LM",
    max_significance = 0.25,
    random_effects = NULL,
    fixed_effects = NULL,
    correction = "BH",
    standardize = TRUE,
    cores = 1,
    plot_heatmap = TRUE,
    plot_scatter = TRUE,
    heatmap_first_n = 50,
    reference = NULL
)

13.8.7 XMAS2 Syntax

run_maaslin2(
    ps = NULL,
    taxa_level = NULL,
    data_otu = NULL,
    data_sam = NULL,
    group,
    group_names = NULL,
    transform = "LOG",
    norm = "TSS",
    outdir = "./demo_output",
    method = "LM",
    p_adjust = "BH",
    pvalue_cutoff = 0.05)

# Maaslin2(input_data = otu_tab %>% t() %>% data.frame(), # row->smapleID; col->Taxa
#          input_metadata = sam_tab,
#          output = outdir,
#          normalization = norm,
#          transform = transform,
#          analysis_method = method,
#          max_significance = pvalue_cutoff,
#          fixed_effects = "Compvar",
#          correction = p_adjust,
#          plot_heatmap = FALSE,
#          plot_scatter = FALSE,
#          cores = 2)

13.9 metagenomeSeq

metagenomeSeq is from (Paulson et al. 2013)

13.9.1 Vignette

metagenomeSeq

13.9.2 Input Data Type

All the input data type should be Raw Counts:

16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.

13.9.3 Input Key Arguments

reads: Raw counts matrix (Rows->Features; columns->Samples).
conditions: A vector of group labels for testing.
mc.samples: The number of Monte Carlo samples, default is 128.
denom: The methods for Geometric Mean calculation, default is “all”.
test: The test to perform, default is “t.test”.

13.9.4 Procedures

Converting the counts and sample information into newMRexperiment object by the function newMRexperiment from the metagenomeSeq R package;
Using cumNormStat and cumNorm to apply cumulative sum-scaling normalization (CSS), which attempts to normalize sequence counts based on the lower-quartile abundance of features;
Using fitFeatureModel to fit normalized feature counts with zero-inflated log-normal models (with pseudo-counts of 1 added prior to log2 transformation) and perform empirical Bayes moderated t-tests, and MRfulltable to obtain BH FDR-corrected p-values.

13.9.5 Key output

TaxaID: taxa ID.
AdjustedPvalue: Adjusted Pvalue by stats::p.adjust.

13.9.6 Original Syntax

metagenomeSeq::fitFeatureModel(mgs_summarized, mod)

metagenomeSeq::fitZig(mgs_summarized, mod)

13.9.7 XMAS2 Syntax

run_metagenomeseq(
    ps = NULL,
    taxa_level = NULL,
    data_otu = NULL,
    data_sam = NULL,
    group,
    group_names = NULL,
    transform = "identity",
    norm = "CSS",
    method = c(
      "ZILN", "ZIG"
    ),
    p_adjust = "BH",
    pvalue_cutoff = 0.05,
    ...)


# # run DA
# if (method == "ZILN") {
#   fit <- metagenomeSeq::fitFeatureModel(mgs_summarized, mod)
# } else {
#   fit <- metagenomeSeq::fitZig(mgs_summarized, mod)
# }
# 
# # metagenomeSeq vignette: We recommend the user remove features based on
# # the number of estimated effective samples, please see
# # calculateEffectiveSamples. We recommend removing features with less
# # than the average number of effective samples in all features. In
# # essence, setting eff = .5 when using MRcoefs, MRfulltable, or MRtable.
# res <- metagenomeSeq::MRcoefs(
#     fit,
#     number = ntaxa(ps_normed_new),
#     adjustMethod = p_adjust,
#     group = 3,
#     eff = 0.5)
#   res <- dplyr::rename(
#     res,
#     Pvalue = .data$pvalues,
#     AdjustedPvalue = .data$adjPvalues)

13.10 t-test

13.10.1 Input Data Type

All the input data type could be Raw Counts or relative abundance:

16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.

13.10.2 Input Key Arguments

x: Raw counts matrix (Rows->Features; columns->Samples).
y: A vector of group labels for testing.

13.10.3 Procedures

Applying total sum scaling normalization (TSS) or not to the feature table;
Performing an unpaired Welch’s t-test for each feature to compare the specified groups;
Correcting the resulting p-values with the BH method.

13.10.4 Key output

TaxaID: taxa ID.
AdjustedPvalue: Adjusted Pvalue by stats::p.adjust.
Occurrence (All/Groups): prevalence of taxa.

13.10.5 Original Syntax

t.test(x, y = NULL,
       alternative = c("two.sided", "less", "greater"),
       mu = 0, paired = FALSE, var.equal = FALSE,
       conf.level = 0.95, ...)

13.10.6 XMAS2 Syntax

run_ttest(
    ps = NULL,
    taxa_level = NULL,
    data_otu = NULL,
    data_sam = NULL,
    group,
    group_names = NULL,
    transform = "identity",
    norm = "none",
    p_adjust = "BH",
    pvalue_cutoff = 0.05,
    paired = FALSE,
    paired_column = NULL)


# # run wilcox rank sum test
#   if (paired) {
#     if (!is.null(paired_column)) {
#       res <- paired_t(x = otu_tab,
#                       y = sam_tab,
#                       group = "Compvar",
#                       gnames = group_names,
#                       paired = paired,
#                       PID = paired_column,
#                       p_ad = p_adjust)
#     } else {
#       stop("Please provide the paired column for paired test")
#     }
# 
#   } else {
#     res <- apply(otu_tab, 1, function(x, y) {
#       dat <- data.frame(value = as.numeric(x), group = y)
#       rest <- t.test(data = dat, value ~ group)
#       res <- c(rest$statistic, rest$p.value)
#       return(res)
#     }, sam_tab$Compvar) %>%
#       t() %>% data.frame() %>%
#       stats::setNames(c("Statistic", "Pvalue")) %>%
#       tibble::rownames_to_column("TaxaID") %>%
#       dplyr::mutate(AdjustedPvalue = p.adjust(as.numeric(Pvalue), method = p_adjust))
#   }

13.11 Wilcoxon test

13.11.1 Input Data Type

All the input data type could be Raw Counts or relative abundance:

16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.

13.11.2 Input Key Arguments

x: Raw counts matrix (Rows->Features; columns->Samples).
y: A vector of group labels for testing.

13.11.3 Procedures

CLR (after applying a pseudocount of 1) transforms or not to abundances;
Performing Wilcoxon rank-sum tests for each feature to compare the specified sample groupings;
Correcting the resulting p-values with the BH method.

13.11.4 Key output

TaxaID: taxa ID.
AdjustedPvalue: Adjusted Pvalue by stats::p.adjust.
Occurrence (All/Groups): prevalence of taxa.

13.11.5 Original Syntax

wilcox.test(x, y = NULL,
            alternative = c("two.sided", "less", "greater"),
            mu = 0, paired = FALSE, exact = NULL, correct = TRUE,
            conf.int = FALSE, conf.level = 0.95,
            tol.root = 1e-4, digits.rank = Inf, ...)

13.11.6 XMAS2 Syntax

run_wilcox(
    ps = NULL,
    taxa_level = NULL,
    data_otu = NULL,
    data_sam = NULL,
    group,
    group_names = NULL,
    transform = "identity",
    norm = "none",
    p_adjust = "BH",
    pvalue_cutoff = 0.05,
    paired = FALSE,
    paired_column = NULL)

# # run wilcox rank sum test
#   if (paired) {
#     if (!is.null(paired_column)) {
#       res <- paired_wilcox(x = otu_tab,
#                       y = sam_tab,
#                       group = "Compvar",
#                       gnames = group_names,
#                       paired = paired,
#                       PID = paired_column,
#                       p_ad = p_adjust)
#     } else {
#       stop("Please provide the paired column for paired test")
#     }
# 
#   } else {
#   res <- apply(otu_tab, 1, function(x, y) {
#       dat <- data.frame(value = as.numeric(x), group = y)
#       rest <- wilcox.test(data = dat, value ~ group)
#       res <- c(rest$statistic, rest$p.value)
#       return(res)
#     }, sam_tab$Compvar) %>%
#     t() %>% data.frame() %>%
#     stats::setNames(c("Statistic", "Pvalue")) %>%
#     tibble::rownames_to_column("TaxaID") %>%
#     dplyr::mutate(AdjustedPvalue = p.adjust(as.numeric(Pvalue), method = p_adjust))
#   }

13.12 RAIDA

RAIDA is from (Sohn, Du, and An 2015)

13.12.1 Vignette

A robust approach for identifying differentially abundant features in metagenomic samples: RAIDA

13.12.2 Input Data Type

All the input data type should be Raw Counts:

16s rRNA sequencing (amplicon sequencing);
Metagenomic sequencing (absolute abundance);
RNA-seq.

13.12.3 Input Key Arguments

c.data: a data frame containing counts for both conditions.
n.lib: a vector containing the numbers of samples for both conditions.

13.12.4 Procedures

Approach for Identifying Differential Abundance (RAIDA) - is a robust approach for identifying differentially abundant features in metagenomic samples across different conditions. It utilizes the ratio between features in a modified zero-inflated log-normal model.

Utilizing the ratios between the counts of features in each sample, eliminating possible problems associated with counts on different scales within and between conditions;
To fit ratios with zeros by EM algorithm, using a modified ZIL (zero-inflated log normal: sparsity);
Constructing a moderated t-statistics for the log ratio of each feature \(y_{ij}\) using the estimated mean \(\mu\) and variance \(r^2\) and obtain \(P\) values for the null hypotheses for all features;
Adjust \(P\) values using a multiple testing correction method

13.12.5 Key output

TaxaID: taxa ID.
AdjustedPvalue: Adjusted Pvalue by stats::p.adjust.
Occurrence (All/Groups): prevalence of taxa.

13.12.6 Original Syntax

raida(c.data, 
      n.lib, 
      show.ref.features = FALSE, 
      show.all.features = FALSE, 
      mtcm = "BH", 
      zsc = 0)

13.12.7 XMAS2 Syntax

run_raida(
    ps = NULL,
    taxa_level = NULL,
    data_otu = NULL,
    data_sam = NULL,
    group,
    group_names = NULL,
    transform = "identity",
    norm = "none",
    p_adjust = "BH",
    pvalue_cutoff = 0.05)

# RAIDA::raida(c.data = otu_tab_reorder, n.lib = n.samples)

13.13 Summary

The most commonly assumed data distribution was the negative binomial distribution (DESeq2, edgeR, omnibus, mbzinb).

No data transformations were performed for negative binomial methods, or metagenomeSeq methods, to try and bring the data to normality as non-normality of data is taken into account in their statistical models.

The remaining parametric methods (ALDEx2 t-test, t-test, limma-voom) all used statistical tests that assumed a Gaussian distribution of the data, therefore, transformations were needed before analysis, which here included a log transform of some kind.
Three methods (ALDEx2 Wilcoxon, ANCOM-II, LEfSe) were considered non-parametric (assumes no underlying distribution of data) as they used statistical tests that transformed data to ranks.
Two of the four negative binomial methods (DESeq2, edgeR) calculated scaling factors for each sample to account for uneven sequence count.
Cumulative sum scaling (CSS) was used for both metagenomeSeq and MaAsLin2. Total sum scaling (TSS; also referred to as relative abundance) was performed for LEfSe, and for method that did not have a built-in normalization function (t-test) as this strategy is commonly used in the literature. Log-ratio based transformations were used for ALDEx2 methods, ANCOM, and, in addition to TSS, were applied to methods without a built-in normalization function.

13.14 Advantages and disadvantages

limma-voom, edgeR, Wilcoxon (CLR), and LEfSe output a high number of significant ASVs on average, while ALDEx2 and ANCOM-II tended to identify only a relatively small number of ASVs as significant;
limma-voom, edgeR, Wilcoxon (CLR), and LEfSe methods have been previously found to exhibit a high FDR, in contrast, ANCOM appropriately controls the FDR;
We can clearly recommend that users avoid using edgeR (a tool primarily intended for RNA-seq data) as well as LEfSe (without p-value correction) for conducting DA testing with 16S rRNA gene data.
Users should also be aware that limma voom and the Wilcoxon (CLR) approaches may perform poorly on unfiltered data that is highly sparse.
More generally, we recommend that users employ several methods and focus on significant features identified by most tools.

References

Fernandes, Andrew D, Jennifer Ns Reid, Jean M Macklaim, Thomas A McMurrough, David R Edgell, and Gregory B Gloor. 2014. “Unifying the Analysis of High-Throughput Sequencing Datasets: Characterizing RNA-Seq, 16S rRNA Gene Sequencing and Selective Growth Experiments by Compositional Data Analysis.” Microbiome 2 (1): 1–13.

Law, Charity W, Yunshun Chen, Wei Shi, and Gordon K Smyth. 2014. “Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-Seq Read Counts.” Genome Biology 15 (2): 1–17.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15 (12): 1–21.

Mallick, Himel, Ali Rahnavard, Lauren J McIver, Siyuan Ma, Yancong Zhang, Long H Nguyen, Timothy L Tickle, et al. 2021. “Multivariable Association Discovery in Population-Scale Meta-Omics Studies.” PLoS Computational Biology 17 (11): e1009442.

Mandal, Siddhartha, Will Van Treuren, Richard A White, Merete Eggesbø, Rob Knight, and Shyamal D Peddada. 2015. “Analysis of Composition of Microbiomes: A Novel Method for Studying Microbial Composition.” Microbial Ecology in Health and Disease 26 (1): 27663.

Martin, Bryan D, Daniela Witten, and Amy D Willis. 2020. “Modeling Microbial Abundances and Dysbiosis with Beta-Binomial Regression.” The Annals of Applied Statistics 14 (1): 94.

Nearing, Jacob T, Gavin M Douglas, Molly G Hayes, Jocelyn MacDonald, Dhwani K Desai, Nicole Allward, Casey Jones, et al. 2022. “Microbiome Differential Abundance Methods Produce Different Results Across 38 Datasets.” Nature Communications 13 (1): 1–16.

Paulson, Joseph N, O Colin Stine, Héctor Corrada Bravo, and Mihai Pop. 2013. “Differential Abundance Analysis for Microbial Marker-Gene Surveys.” Nature Methods 10 (12): 1200–1202.

Robinson, Mark D, and Alicia Oshlack. 2010. “A Scaling Normalization Method for Differential Expression Analysis of RNA-Seq Data.” Genome Biology 11 (3): 1–9.

Segata, Nicola, Jacques Izard, Levi Waldron, Dirk Gevers, Larisa Miropolsky, Wendy S Garrett, and Curtis Huttenhower. 2011. “Metagenomic Biomarker Discovery and Explanation.” Genome Biology 12 (6): 1–18.

Sohn, Michael B, Ruofei Du, and Lingling An. 2015. “A Robust Approach for Identifying Differentially Abundant Features in Metagenomic Samples.” Bioinformatics 31 (14): 2269–75.