# Chapter 13 Principals of Differential Analysis methods

(Nearing et al. 2022) has showed that the multiple DA methods have different performances on microbiota (16s). Here, we integrated the its methods’ part and our own understandings to give a draft whole picture for users who using DA of XMAS2.

Microbiota data have the following characteristics:

**Compositional**: Mathematically, a data is defined as compositional, if it contains D multiple parts of nonnegative numbers whose sum is 1 or any constant-sum constraint. It can be formally stated as:

\[S^{D} = \left \{ X = [x_{1}, x_{2}, ..., x_{D}] | x_{i} > 0, i = 1, 2, ..., D; \sum^{D}_{i=1} x_{i} = K \right \}\]

**Over-Dispersed**: The variance are generally greater than the mean. There are several possible causes of overdispersion: correlated taxa counts, clustering of subjects, and within group heterogeneity. Negative Binomial Model addresses overdispersion by by explicitly modeling the correlated and sparse events via a latent variable.**Sparsity**: Microbiome data typically is overdispersed and sparse with many zeros. When the number of zeros is excess than the standard distributions (e.g., normal, Poisson, binomial, NB, beta or gamma) can be readily fit, the data set is considered as ’zero inflate.

## 13.1 ALDEx2

ALDEx2 is from (Fernandes et al. 2014)

### 13.1.2 Input Data Type

All the input data type should be **Raw Counts**:

16s rRNA sequencing (amplicon sequencing);

Metagenomic sequencing (absolute abundance);

RNA-seq.

### 13.1.3 Input Key Arguments

**reads**: Raw counts matrix (Rows->Features; columns->Samples).**conditions**: A vector of group labels for testing.**mc.samples**: The number of Monte Carlo samples, default is 128.**denom**: The methods for Geometric Mean calculation, default is “all”.**test**: The test to perform, default is “t.test”.

### 13.1.4 Procedures

Monte Carlo samples of Dirichlet distributions for each sample, using a uniform prior; (Sparsity)

Performing CLR (log-ratio) transformation of each realization; (Compositional & Overdispersion)

Performing Wilcoxon tests on the transformed realizations.

Finally, the function returned the expected Benjamini-Hochberg (BH) FDR-corrected p-value for each feature based on the results the different across Monte Carlo samples.

### 13.1.5 Key output

**TaxaID**: taxa ID.**EffectSize**: Effect Size generated by`Aldex2`

, which equals to Log2FoldChange.**AdjustedPvalue**: Adjusted Pvalue by`stats::p.adjust`

.**Median CLR (All/Groups)**: median value of CLR abundance normalized by`Aldex2`

.**Occurrence (All/Groups)**: prevalence of taxa.

### 13.1.7 XMAS2 Syntax

```
run_aldex(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "none",
method = "t.test",
p_adjust = "BH",
pvalue_cutoff = 0.05)
# ALDEx2::aldex(reads = otu_tab,
# conditions = sam_tab$Compvar,
# mc.samples = 128,
# test = test,
# effect = TRUE,
# include.sample.summary = FALSE,
# denom = "all",
# verbose = TRUE)
```

## 13.2 ANCOM-II

ANCOM-II is from ANCOM-II (version 2.1) and it derived from (Mandal et al. 2015).

### 13.2.1 Input Data Type

All the input data type should be **Raw Counts**:

16s rRNA sequencing (amplicon sequencing);

Metagenomic sequencing (absolute abundance);

RNA-seq.

### 13.2.2 Input Key Arguments

**feature_table**: Raw counts matrix (Rows->Features; columns->Samples).**meta_data**: data.frame with group information.**struc_zero**: data.frame with structure zero taxa.**group_var**: group variable for comparison.**alpha**: the significant level cutoff for testing.

### 13.2.3 Procedures

`feature_table_pre_process`

firstly identify outlier zeros and structural zeros;

Outlier zeros, identified by finding outliers in the distribution of taxon counts within each sample grouping, were ignored during differential abundance analysis, and replaced with NA.

Structural zeros, taxa that were absent in one grouping but present in the other, were ignored during data analysis and automatically called as differential abundant.

additive log-ratios for transformation with a pseudo count of 1;

Wilcoxon rank-sum tests for significance, and p-values were FDR-corrected using the BH method;

Detection threshold for Ajusted-pvalue per taxa, if the number (ratio) of Ajusted-pvalue reaches nominal significance would be called as DA taxa.

### 13.2.4 Key output

**TaxaID**: taxa ID.**EffectSize**: Effect Size to evaluate the quality of*(W)q-values < alpha*.**W_ratio**: the ratio of significant taxa compared to totoal number of taxa.**detected**: the taxa is significant passed the W_ratio cutoff.

### 13.2.5 Original Syntax

```
preprocess_ANCOM <- function(
feature_table,
meta_data,
sample_var = NULL,
group_var = NULL,
out_cut = 0.05,
zero_cut = 0.90,
lib_cut = 1000,
neg_lb = FALSE)
main_ANCOM <- function(
feature_table,
meta_data,
struc_zero = NULL,
group_var = NULL,
p_adj_method = "BH",
alpha = 0.05,
adj_formula = NULL,
rand_formula = NULL,
w_cutoff = 0.7,
...)
```

### 13.2.6 XMAS2 Syntax

```
run_ancom(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "none",
p_adjust = "BH",
pvalue_cutoff = 0.05,
w_cutoff = 0.7)
# preprocess_ANCOM(
# feature_table = otu_tab,
# meta_data = sam_tab %>% tibble::rownames_to_column("TempRowNames"),
# sample_var = "TempRowNames",
# group_var = "Compvar",
# out_cut = 0.05,
# zero_cut = 0.90,
# lib_cut = 1000,
# neg_lb = FALSE)
#
# main_ANCOM(
# feature_table = prepro_res$feature_table,
# meta_data = prepro_res$meta_data,
# struc_zero = prepro_res$structure_zeros,
# group_var = "Compvar",
# p_adj_method = p_adjust,
# alpha = pvalue_cutoff,
# rand_formula = NULL,
# w_cutoff = 0.7)
```

## 13.3 corncob

corncob is from (Martin, Witten, and Willis 2020)

### 13.3.2 Input Data Type

All the input data type should be **Raw Counts**:

16s rRNA sequencing (amplicon sequencing);

Metagenomic sequencing (absolute abundance);

RNA-seq.

### 13.3.3 Input Key Arguments

**data**: a data frame containing the OTU table, or phyloseq object containing the variables in the models (counts matrix).**formula**: an object of class formula without the response.**test**: test method, options include: “Wald” and “LRT” for two groups comparison (default: “Wald”).**fdr_cutoff**: cutoff of FDR.

### 13.3.4 Procedures

Taxon count abundance fit to a

**beta-binomial model**, using logit link functions for both the mean and overdispersion.Wald tests for significance testing, and BH for FDR-corrected p-values.

### 13.3.5 Key output

**TaxaID**: taxa ID.**EffectSize**: Effect Size to evaluate the power of pvalue.**AdjustedPvalue**: Adjusted Pvalue by`stats::p.adjust`

.**Occurrence (All/Groups)**: prevalence of taxa.

### 13.3.6 Original Syntax

```
differentialTest(
formula,
phi.formula,
formula_null,
phi.formula_null,
data,
link = "logit",
phi.link = "logit",
test,
boot = FALSE,
B = 1000,
sample_data = NULL,
taxa_are_rows = TRUE,
filter_discriminant = TRUE,
fdr_cutoff = 0.05,
fdr = "fdr",
full_output = FALSE,
inits = NULL,
inits_null = NULL,
try_only = NULL,
...
)
```

### 13.3.7 XMAS2 Syntax

```
run_corncob(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "none",
method = c("Wald", "LRT"),
p_adjust = "BH",
pvalue_cutoff = 0.05)
# corncob::differentialTest(
# data = ps_normed_new,
# formula = my_formula,
# phi.formula = my_formula,
# phi.formula_null = my_formula,
# formula_null = ~ 1,
# test = method,
# boot = F,
# fdr_cutoff = pvalue_cutoff,
# fdr = p_adjust)
```

## 13.4 DESeq2

DESeq2 is from (Love, Huber, and Anders 2014)

### 13.4.2 Input Data Type

All the input data type should be **un-normalized raw counts**:

16s rRNA sequencing (amplicon sequencing);

Metagenomic sequencing (absolute abundance);

RNA-seq.

### 13.4.3 Input Key Arguments

**countData**: Raw counts matrix without normalization (Rows->Features; columns->Samples).**colData**: a DataFrame or data.frame with at least a single column..**design**: a formula or matrix. the formula expresses how the counts for each gene depend on the variables in colData.

### 13.4.4 Procedures

estimation of size factors, which are used to normalize library sizes in a model-based fashion;

estimation of dispersions from the negative binomial likelihood for each feature, and subsequent shrinkage of each dispersion estimate towards the parametric (default) trendline by empirical Bayes;

fitting each feature to the specified class groupings with negative binomial generalized linear models and performing hypothesis testing, for which we chose the default Wald test;

Finally, using the

`results`

function to obtain the resulting BH FDR-corrected p-values.

### 13.4.5 Key output

**TaxaID**: taxa ID.**logFC**: Log2FoldChange by`DESeq2::DESeq`

.**AdjustedPvalue**: Adjusted Pvalue by`stats::p.adjust`

.**Occurrence (All/Groups)**: prevalence of taxa.

### 13.4.7 XMAS2 Syntax

```
run_deseq2(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "none",
p_adjust = "BH",
pvalue_cutoff = 0.05,
...)
# DESeq2::DESeqDataSetFromMatrix(
# countData = otu_tab,
# colData = sam_tab,
# design =~ Compvar)
# DESeq2::DESeq(ddsm)
# DESeq2::results(dds, contrast = c("Compvar", group_names))
```

## 13.5 edgeR

edgeR is from (Robinson and Oshlack 2010)

### 13.5.2 Input Data Type

All the input data type should be **Raw Counts**:

16s rRNA sequencing (amplicon sequencing);

Metagenomic sequencing (absolute abundance);

RNA-seq.

### 13.5.3 Input Key Arguments

**counts**: Raw counts matrix (Rows->Features; columns->Samples) for DEGlist.**design**: A vector of group labels for testing.**contrast**: A string for comparison groups.

### 13.5.4 Procedures

Add a pseudocount of 1 to the non-rarefied feature table and used the function

`calcNormFactors`

from the edgeR package to compute relative log expression normalization factors;Negative binomial dispersion parameters were then estimated using the functions

`estimateCommonDisp`

followed by`estimateTagwiseDisp`

to shrink feature-wise dispersion estimates through an empirical Bayes approach;`exactTest`

for negative binomial data to identify features that differ between the specified groups;The resulting p-values were then corrected for multiple testing with the BH method with the function

`topTags`

.

### 13.5.5 Key output

**TaxaID**: taxa ID.**logFC**: Log2FoldChange by`edgeR::topTags`

.**AdjustedPvalue**: Adjusted Pvalue by`stats::p.adjust`

.**Occurrence (All/Groups)**: prevalence of taxa.

### 13.5.7 XMAS2 Syntax

```
run_edger(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "none",
p_adjust = "BH",
pvalue_cutoff = 0.05,
...)
# # Filter data
# keep <- rowSums(edgeR::cpm(otu_tab) > 100) >= 2
# otu_tab <- otu_tab[keep, ]
#
# # DGEList object: otu_tab -> colnames:Samples; rownames:Taxa
# dge_obj <- edgeR::DGEList(counts = otu_tab)
# dge_obj_factors <- edgeR::calcNormFactors(dge_obj)
# limma::plotMDS(dge_obj_factors, method = "bcv", col = as.numeric(sam_tab$Compvar))
# graphics::legend("bottomleft", group_names, col = 1:2, pch = 20)
#
# # GLM estimates of dispersion
# dge <- edgeR::estimateGLMCommonDisp(dge_obj_factors, design)
# dge <- edgeR::estimateGLMTrendedDisp(dge, design, method = "power")
#
# # You can change method to "auto", "bin.spline", "power", "spline", "bin.loess".
# # The default is "auto" which chooses "bin.spline" when > 200 tags and "power" otherwise.
# dge <- edgeR::estimateGLMTagwiseDisp(dge, design)
#
# # GLM testing for differential expression:
# fit <- edgeR::glmFit(dge, design)
# fit2 <- edgeR::glmLRT(fit, contrast = contrast) # Normal:-1; Tumor:1
# edgeR_res <- edgeR::topTags(fit2, n = nrow(otu_tab)) %>% data.frame()
```

## 13.6 LEfSe

LEfSe is from (Segata et al. 2011)

### 13.6.2 Input Data Type

All the input data type could be **Raw Counts** or **Relative abundance**:

16s rRNA sequencing (amplicon sequencing);

Metagenomic sequencing (absolute abundance);

RNA.

### 13.6.3 Input Key Arguments

**assays**: data.frame or matrix of features (Rows->Features; columns->Samples).**colData**: data.frame or matrix with group information.**kruskal.threshold**: cutoff of KW test pvalue.**wilcox.threshold**: cutoff of wilcox test pvalue.**lda.threshold**: cutoff of LDA scores.

### 13.6.4 Procedures

Normalizing the matrix data using

*CPM: pre-sample normalization of the sum of the values to 1e+06.*;Performing a Kruskal-Wallis (which in our two-group case reduces to the Wilcoxon rank-sum) hypothesis test to identify potential differentially abundant features;

Using differentially abundant features to perform linear discriminant analysis (LDA) of class labels on abundances to estimate the effect sizes for sig- nificant features.

Only features with scaled LDA analysis scores above the threshold (default: 2.0) were called as DA.

### 13.6.7 XMAS2 Syntax

```
run_lefse(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "CPM",
wl.p = 0.05,
Lda = 2)
run_lefse2(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
subgroup = NULL,
transform = c("identity", "log10", "log10p"),
norm = "CPM",
kw_cutoff = 0.05,
lda_cutoff = 2,
bootstrap_n = 30,
bootstrap_fraction = 2 / 3,
wilcoxon_cutoff = 0.05,
multigrp_strat = FALSE,
strict = c("0", "1", "2"),
sample_min = 10,
only_same_subgrp = FALSE,
curv = FALSE)
# se_object <- SummarizedExperiment::SummarizedExperiment(
# assays = list(counts = otu_tab),
# colData = sam_tab,
# metadata = "Profile")
#
# lefse_res <- lefser(se_object,
# kruskal.threshold = 0.05,
# wilcox.threshold = wl.p,
# lda.threshold = Lda,
# groupCol = "Compvar",
# blockCol = NULL,
# assay = 1L,
# trim.names = TRUE)
```

## 13.7 limma voom

limma-voom is from (Law et al. 2014)

### 13.7.2 Input Data Type

All the input data type should be **Raw Counts**:

16s rRNA sequencing (amplicon sequencing);

Metagenomic sequencing (absolute abundance);

RNA-seq.

### 13.7.3 Input Key Arguments

**counts**: a numeric matrix containing raw counts (Rows->Features; columns->Samples).**design**: design matrix with rows corresponding to samples and columns to coefficients to be estimated.**lib.size**: numeric vector containing total library sizes for each sample.**span**: width of the smoothing window used for the lowess mean-variance trend.

### 13.7.4 Procedures

Normalizing table by the trimmed mean of M-values (TMM) or TMM with singleton pairing (TMMwsp) option via

`calcNormFactors function`

(highly sparse data);After normalization, using the limma R package function

`voom`

to convert normalized counts to log2-counts-per-million and assign precision weights to each observation based on the mean-variance trend;Using the functions

`lmFit`

,`eBayes`

, and`topTable`

in the limma R package to fit weighted linear regression models, perform tests based on an empirical Bayes moderated t-statistic76 and obtain BH FDR-corrected p-values.

### 13.7.5 Key output

**TaxaID**: taxa ID.**EffectSize**and**logFC**: both results equals to Log2FoldChange.**AdjustedPvalue**: Adjusted Pvalue by`stats::p.adjust`

.**Occurrence (All/Groups)**: prevalence of taxa.

### 13.7.7 XMAS2 Syntax

```
run_limma_voom(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "none",
voom_span = 0.5,
p_adjust = "BH",
pvalue_cutoff = 0.05,
...)
# # voom fitting
# voom_out <- limma::voom(
# otu_tab,
# design = design,
# lib.size = lib_size,
# span = voom_span)
#
# # linear model
# fit_out <- limma::lmFit(voom_out, design = design)
# test_out <- limma::eBayes(fit_out)
# test_df <- limma::topTable(
# test_out,
# number = nrow(otu_tab),
# adjust.method = p_adjust)
```

## 13.8 MaAsLin2

MaAsLin2 is from (Mallick et al. 2021)

### 13.8.1 Vignette

Multivariable association discovery in population-scale meta-omics studies: MaAsLin2

### 13.8.2 Input Data Type

All the input data type could be **Raw Counts** or **Relative abundance**:

16s rRNA sequencing (amplicon sequencing);

Metagenomic sequencing (absolute abundance);

RNA-seq.

### 13.8.3 Input Key Arguments

**input_data**: The tab-delimited input file of features (Rows->Features; columns->Samples).**input_metadata**: The tab-delimited input file of metadata.**normalization**: The normalization method to apply.**transform**: The transform to apply.**analysis_method**: The analysis method for linear regression to apply.**random_effects**: The random effects for the model.**fixed_effects**: The fixed effects for the model, such as the groups to compare.

### 13.8.4 Procedures

Using

*arcsine square-root transformation (AST)*for transformation and*total sum scaling normalization (TSS)*for normalization;The function fit a linear model (without random effects chosen) to each feature’s transformed abundance on the specified sample grouping, tested significance using a Wald test, and output BH FDR-corrected p-values;

### 13.8.5 Key output

**TaxaID**: taxa ID.**AdjustedPvalue**: Adjusted Pvalue by`stats::p.adjust`

.**Occurrence (All/Groups)**: prevalence of taxa.

### 13.8.6 Original Syntax

```
Maaslin2(
input_data,
input_metadata,
output,
min_abundance = 0.0,
min_prevalence = 0.1,
min_variance = 0.0,
normalization = "TSS",
transform = "LOG",
analysis_method = "LM",
max_significance = 0.25,
random_effects = NULL,
fixed_effects = NULL,
correction = "BH",
standardize = TRUE,
cores = 1,
plot_heatmap = TRUE,
plot_scatter = TRUE,
heatmap_first_n = 50,
reference = NULL
)
```

### 13.8.7 XMAS2 Syntax

```
run_maaslin2(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "LOG",
norm = "TSS",
outdir = "./demo_output",
method = "LM",
p_adjust = "BH",
pvalue_cutoff = 0.05)
# Maaslin2(input_data = otu_tab %>% t() %>% data.frame(), # row->smapleID; col->Taxa
# input_metadata = sam_tab,
# output = outdir,
# normalization = norm,
# transform = transform,
# analysis_method = method,
# max_significance = pvalue_cutoff,
# fixed_effects = "Compvar",
# correction = p_adjust,
# plot_heatmap = FALSE,
# plot_scatter = FALSE,
# cores = 2)
```

## 13.9 metagenomeSeq

metagenomeSeq is from (Paulson et al. 2013)

### 13.9.2 Input Data Type

All the input data type should be **Raw Counts**:

16s rRNA sequencing (amplicon sequencing);

Metagenomic sequencing (absolute abundance);

RNA-seq.

### 13.9.3 Input Key Arguments

**reads**: Raw counts matrix (Rows->Features; columns->Samples).**conditions**: A vector of group labels for testing.**mc.samples**: The number of Monte Carlo samples, default is 128.**denom**: The methods for Geometric Mean calculation, default is “all”.**test**: The test to perform, default is “t.test”.

### 13.9.4 Procedures

Converting the counts and sample information into newMRexperiment object by the function

`newMRexperiment`

from the metagenomeSeq R package;Using

`cumNormStat`

and`cumNorm`

to apply**cumulative sum-scaling normalization (CSS)**, which attempts to normalize sequence counts based on the lower-quartile abundance of features;Using

`fitFeatureModel`

to fit normalized feature counts with zero-inflated log-normal models (with pseudo-counts of 1 added prior to log2 transformation) and perform empirical Bayes moderated t-tests, and`MRfulltable`

to obtain BH FDR-corrected p-values.

### 13.9.7 XMAS2 Syntax

```
run_metagenomeseq(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "CSS",
method = c(
"ZILN", "ZIG"
),
p_adjust = "BH",
pvalue_cutoff = 0.05,
...)
# # run DA
# if (method == "ZILN") {
# fit <- metagenomeSeq::fitFeatureModel(mgs_summarized, mod)
# } else {
# fit <- metagenomeSeq::fitZig(mgs_summarized, mod)
# }
#
# # metagenomeSeq vignette: We recommend the user remove features based on
# # the number of estimated effective samples, please see
# # calculateEffectiveSamples. We recommend removing features with less
# # than the average number of effective samples in all features. In
# # essence, setting eff = .5 when using MRcoefs, MRfulltable, or MRtable.
# res <- metagenomeSeq::MRcoefs(
# fit,
# number = ntaxa(ps_normed_new),
# adjustMethod = p_adjust,
# group = 3,
# eff = 0.5)
# res <- dplyr::rename(
# res,
# Pvalue = .data$pvalues,
# AdjustedPvalue = .data$adjPvalues)
```

## 13.10 t-test

### 13.10.1 Input Data Type

All the input data type could be **Raw Counts** or **relative abundance**:

16s rRNA sequencing (amplicon sequencing);

Metagenomic sequencing (absolute abundance);

RNA-seq.

### 13.10.2 Input Key Arguments

**x**: Raw counts matrix (Rows->Features; columns->Samples).**y**: A vector of group labels for testing.

### 13.10.3 Procedures

Applying

*total sum scaling normalization (TSS)*or not to the feature table;Performing an unpaired Welch’s t-test for each feature to compare the specified groups;

Correcting the resulting p-values with the BH method.

### 13.10.4 Key output

**TaxaID**: taxa ID.**AdjustedPvalue**: Adjusted Pvalue by`stats::p.adjust`

.**Occurrence (All/Groups)**: prevalence of taxa.

### 13.10.6 XMAS2 Syntax

```
run_ttest(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "none",
p_adjust = "BH",
pvalue_cutoff = 0.05,
paired = FALSE,
paired_column = NULL)
# # run wilcox rank sum test
# if (paired) {
# if (!is.null(paired_column)) {
# res <- paired_t(x = otu_tab,
# y = sam_tab,
# group = "Compvar",
# gnames = group_names,
# paired = paired,
# PID = paired_column,
# p_ad = p_adjust)
# } else {
# stop("Please provide the paired column for paired test")
# }
#
# } else {
# res <- apply(otu_tab, 1, function(x, y) {
# dat <- data.frame(value = as.numeric(x), group = y)
# rest <- t.test(data = dat, value ~ group)
# res <- c(rest$statistic, rest$p.value)
# return(res)
# }, sam_tab$Compvar) %>%
# t() %>% data.frame() %>%
# stats::setNames(c("Statistic", "Pvalue")) %>%
# tibble::rownames_to_column("TaxaID") %>%
# dplyr::mutate(AdjustedPvalue = p.adjust(as.numeric(Pvalue), method = p_adjust))
# }
```

## 13.11 Wilcoxon test

### 13.11.1 Input Data Type

All the input data type could be **Raw Counts** or **relative abundance**:

16s rRNA sequencing (amplicon sequencing);

Metagenomic sequencing (absolute abundance);

RNA-seq.

### 13.11.2 Input Key Arguments

**x**: Raw counts matrix (Rows->Features; columns->Samples).**y**: A vector of group labels for testing.

### 13.11.3 Procedures

CLR (after applying a pseudocount of 1) transforms or not to abundances;

Performing Wilcoxon rank-sum tests for each feature to compare the specified sample groupings;

Correcting the resulting p-values with the BH method.

### 13.11.4 Key output

**TaxaID**: taxa ID.**AdjustedPvalue**: Adjusted Pvalue by`stats::p.adjust`

.**Occurrence (All/Groups)**: prevalence of taxa.

### 13.11.6 XMAS2 Syntax

```
run_wilcox(
ps = NULL,
taxa_level = NULL,
data_otu = NULL,
data_sam = NULL,
group,
group_names = NULL,
transform = "identity",
norm = "none",
p_adjust = "BH",
pvalue_cutoff = 0.05,
paired = FALSE,
paired_column = NULL)
# # run wilcox rank sum test
# if (paired) {
# if (!is.null(paired_column)) {
# res <- paired_wilcox(x = otu_tab,
# y = sam_tab,
# group = "Compvar",
# gnames = group_names,
# paired = paired,
# PID = paired_column,
# p_ad = p_adjust)
# } else {
# stop("Please provide the paired column for paired test")
# }
#
# } else {
# res <- apply(otu_tab, 1, function(x, y) {
# dat <- data.frame(value = as.numeric(x), group = y)
# rest <- wilcox.test(data = dat, value ~ group)
# res <- c(rest$statistic, rest$p.value)
# return(res)
# }, sam_tab$Compvar) %>%
# t() %>% data.frame() %>%
# stats::setNames(c("Statistic", "Pvalue")) %>%
# tibble::rownames_to_column("TaxaID") %>%
# dplyr::mutate(AdjustedPvalue = p.adjust(as.numeric(Pvalue), method = p_adjust))
# }
```

## 13.12 RAIDA

RAIDA is from (Sohn, Du, and An 2015)

### 13.12.1 Vignette

A robust approach for identifying differentially abundant features in metagenomic samples: RAIDA

### 13.12.2 Input Data Type

All the input data type should be **Raw Counts**:

16s rRNA sequencing (amplicon sequencing);

Metagenomic sequencing (absolute abundance);

RNA-seq.

### 13.12.3 Input Key Arguments

**c.data**: a data frame containing counts for both conditions.**n.lib**: a vector containing the numbers of samples for both conditions.

### 13.12.4 Procedures

**Approach for Identifying Differential Abundance (RAIDA)** - is a robust approach for identifying differentially abundant features in metagenomic samples across different conditions. It utilizes the ratio between features in a modified zero-inflated log-normal model.

Utilizing the ratios between the counts of features in each sample, eliminating possible problems associated with counts on different scales within and between conditions;

To fit ratios with zeros by EM algorithm, using a modified ZIL (zero-inflated log normal: sparsity);

Constructing a moderated t-statistics for the log ratio of each feature \(y_{ij}\) using the estimated mean \(\mu\) and variance \(r^2\) and obtain \(P\) values for the null hypotheses for all features;

Adjust \(P\) values using a multiple testing correction method

### 13.12.5 Key output

**TaxaID**: taxa ID.**AdjustedPvalue**: Adjusted Pvalue by`stats::p.adjust`

.**Occurrence (All/Groups)**: prevalence of taxa.

## 13.13 Summary

- The most commonly assumed data distribution was the negative binomial distribution (DESeq2, edgeR, omnibus, mbzinb).

No data transformations were performed for negative binomial methods, or metagenomeSeq methods, to try and bring the data to normality as non-normality of data is taken into account in their statistical models.

The remaining parametric methods (ALDEx2 t-test, t-test, limma-voom) all used statistical tests that assumed a Gaussian distribution of the data, therefore, transformations were needed before analysis, which here included a log transform of some kind.

Three methods (ALDEx2 Wilcoxon, ANCOM-II, LEfSe) were considered non-parametric (assumes no underlying distribution of data) as they used statistical tests that transformed data to ranks.

Two of the four negative binomial methods (DESeq2, edgeR) calculated scaling factors for each sample to account for uneven sequence count.

Cumulative sum scaling (CSS) was used for both metagenomeSeq and MaAsLin2. Total sum scaling (TSS; also referred to as relative abundance) was performed for LEfSe, and for method that did not have a built-in normalization function (t-test) as this strategy is commonly used in the literature. Log-ratio based transformations were used for ALDEx2 methods, ANCOM, and, in addition to TSS, were applied to methods without a built-in normalization function.

## 13.14 Advantages and disadvantages

limma-voom, edgeR, Wilcoxon (CLR), and LEfSe output a high number of significant ASVs on average, while ALDEx2 and ANCOM-II tended to identify only a relatively small number of ASVs as significant;

limma-voom, edgeR, Wilcoxon (CLR), and LEfSe methods have been previously found to exhibit a high FDR, in contrast, ANCOM appropriately controls the FDR;

We can clearly recommend that users avoid using edgeR (a tool primarily intended for RNA-seq data) as well as LEfSe (without p-value correction) for conducting DA testing with 16S rRNA gene data.

Users should also be aware that limma voom and the Wilcoxon (CLR) approaches may perform poorly on unfiltered data that is highly sparse.

More generally, we recommend that users employ several methods and focus on significant features identified by most tools.

### References

*Microbiome*2 (1): 1–13.

*Genome Biology*15 (2): 1–17.

*Genome Biology*15 (12): 1–21.

*PLoS Computational Biology*17 (11): e1009442.

*Microbial Ecology in Health and Disease*26 (1): 27663.

*The Annals of Applied Statistics*14 (1): 94.

*Nature Communications*13 (1): 1–16.

*Nature Methods*10 (12): 1200–1202.

*Genome Biology*11 (3): 1–9.

*Genome Biology*12 (6): 1–18.

*Bioinformatics*31 (14): 2269–75.