Chapter 3 Quality Control

Loading packages


Quality control of DADA2 results will help us have more rational determinations on the further data analysis.

3.1 Reads’ track by DADA2

plot_Dada2Track(data = dada2_res$reads_track)
Reads' track by DADA2

Figure 3.1: Reads’ track by DADA2

The percentage of the final remained read counts approximate 70%, indicating that we should consider the sequence depth for analysis when we build the sequence library.

3.2 Spike-in sample (BRS) assessment

The taxonomic levels of spike-in sample’s bacteria is genus. Firstly, using the summarize_taxa to get the genus level phyloseq object and get the BRS_ID.

dada2_ps_genus <- summarize_taxa(ps = dada2_ps, 
                                 taxa_level = "Genus")

##       Group
## S6030    BB
## S6032    BB
## S6033    BB
## S6035    AA
## S6036    BB
## S6037    AA
## S6040    BB
## S6043    AA
## S6045    BB
## S6046    BB
## S6048    BB
## S6049    AA
## S6050    BB
## S6054    BB
## S6055    BB
## S6058    BB
## S6059    AA
## S6060    AA
## S6061    AA
## S6063    BB
## S6065    AA
## S6066    AA
## S6068    BB
## S8005    QC

do run_RefCheck2 under the optimal parameters.

  • BRS_ID: the ID of BRS sample;

  • Reference: the directory of the latest spike-in sample matrix (default: /share/projects/Analytics/analytics/XMAS/RefCheck/);

  • Save: the directory to save the latest spike-in sample matrix (default: /share/projects/Analytics/analytics/XMAS/RefCheck/).

To see more details to use ?run_RefCheck2.

    ps = dada2_ps_genus,
    BRS_ID = "S8005",
    Ref_type = "16s")
## Noting: the Reference Matrix is for 16s
## S8005 is in the Reference Matrix's samples and remove it to run
## ############Matched baterica of the BRS sample#############
## The number of BRS' bacteria matched the Reference Matrix is [15]
## g__Bifidobacterium
## g__Bacteroides
## g__Faecalibacterium
## g__Lactobacillus
## g__Parabacteroides
## g__Collinsella
## g__Coprococcus_3
## g__Dorea
## g__Streptococcus
## g__Roseburia
## g__Anaerostipes
## g__Escherichia_Shigella
## g__Enterococcus
## g__Prevotella_9
## g__Eggerthella
## The number of the additional bacteria compared to Reference Matrix is [1]
## ###########################################################
## ##################Status of the BRS sample##################
## Whether the BRS has the all bateria of Reference Matrix: TRUE
## Correlation Coefficient of the BRS is: 0.9714
## Bray Curtis of the BRS is: 0.07607
## Impurity of Max additional genus (g__Cutibacterium) of the BRS is: 0.06409
## ###########################################################
## #####Final Evaluation Results of the BRS #######
## The BRS of sequencing dataset passed the cutoff of the Reference Matrix 
## Cutoff of Coefficient is 0.8946
## Cutoff of BrayCurtis is 0.1425
## Cutoff of Impurity is 0.1565
## ###########################################################
##               Gold_Cutoff     BRS
## Coef               0.8946 0.97140
## Bray               0.1425 0.07607
## Impurity(max)      1.0000 0.06409

3.3 Spike-in sample’s (BRS) remove

After evaluating the sequencing quality, we remove the BRS.

dada2_ps_remove_BRS <- get_GroupPhyloseq(
                         ps = dada2_ps,
                         group = "Group",
                         group_names = "QC",
                         discard = TRUE)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 896 taxa and 23 samples ]
## sample_data() Sample Data:       [ 23 samples by 1 sample variables ]
## tax_table()   Taxonomy Table:    [ 896 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 896 tips and 893 internal nodes ]
## refseq()      DNAStringSet:      [ 896 reference sequences ]

3.4 Rarefaction curves

plot_RarefCurve(ps = dada2_ps_remove_BRS,
                taxa_level = "OTU",
                step = 100,
                label = "Group",
                color = "Group")
## rarefying sample S6030
## rarefying sample S6032
## rarefying sample S6033
## rarefying sample S6035
## rarefying sample S6036
## rarefying sample S6037
## rarefying sample S6040
## rarefying sample S6043
## rarefying sample S6045
## rarefying sample S6046
## rarefying sample S6048
## rarefying sample S6049
## rarefying sample S6050
## rarefying sample S6054
## rarefying sample S6055
## rarefying sample S6058
## rarefying sample S6059
## rarefying sample S6060
## rarefying sample S6061
## rarefying sample S6063
## rarefying sample S6065
## rarefying sample S6066
## rarefying sample S6068
Rarefaction curves

Figure 3.2: Rarefaction curves

The result showed that all the samples had different sequence depth but had the full sample richness.

3.5 Summarize phyloseq-class object

Summarizing the phyloseq-class object by using summarize_phyloseq. It displayed that briefly introduction of the object.

summarize_phyloseq(ps = dada2_ps_remove_BRS)
## Compositional = NO2
## 1] Min. number of reads = 511812] Max. number of reads = 716673] Total number of reads = 14089154] Average number of reads = 61257.17391304355] Median number of reads = 614357] Sparsity = 0.8610248447204976] Any OTU sum to 1 or less? YES8] Number of singletons = 49] Percent of OTUs that are singletons
##         (i.e. exactly one read detected across all samples)010] Number of sample variables are: 1Group2
## [[1]]
## [1] "1] Min. number of reads = 51181"
## [[2]]
## [1] "2] Max. number of reads = 71667"
## [[3]]
## [1] "3] Total number of reads = 1408915"
## [[4]]
## [1] "4] Average number of reads = 61257.1739130435"
## [[5]]
## [1] "5] Median number of reads = 61435"
## [[6]]
## [1] "7] Sparsity = 0.861024844720497"
## [[7]]
## [1] "6] Any OTU sum to 1 or less? YES"
## [[8]]
## [1] "8] Number of singletons = 4"
## [[9]]
## [1] "9] Percent of OTUs that are singletons\n        (i.e. exactly one read detected across all samples)0"
## [[10]]
## [1] "10] Number of sample variables are: 1"
## [[11]]
## [1] "Group"

The minus account of the OTU counts is 51181 in the phyloseq object, and we can use it as the threshold to rarefy.

Notice the Sparsity (0.86), indicating the data has many zeros and pay attention to the downstream data analysis. A common property of amplicon based microbiota data generated by sequencing.

