Contents
Sample | Sequences | Pairs Joined | Join Rate | Average Insert | Unique | Clean | Assigned Function | Assigned Taxonomy | Assigned Both |
---|---|---|---|---|---|---|---|---|---|
S08 | 200000 | 54710 | 54.710% | 164.9 | 49732 | 38306 | 2633 | 657 | 486 |
S016 | 200000 | 40933 | 40.933% | 184.8 | 38873 | 34094 | 4067 | 1120 | 785 |
S017 | 200000 | 48086 | 48.086% | 183.9 | 44870 | 37230 | 4431 | 1381 | 948 |
S030 | 200000 | 53921 | 53.921% | 188.2 | 47044 | 39957 | 3725 | 1081 | 702 |
Samples are sorted based on their Shannon index calculated from taxonomically annotated sequences. The order is most to least diverse.
Paired-end sequences were evaluated for quality using VSEARCH [1]. Sequence reads are quality trimmed after successful merging using bbmerge [2]. Sequences are allowed to be extended up 300 bp during the merging process to account for non-overlapping R1 and R2 sequences (k=60 extend2=60 iterations=5 qtrim2=t). Merged sequences are deduplicated using the clumpify tool [2] then, by default, filtered of PhiX and rRNA using bbsplit [2]. An arbitrary number of Name:FASTA pairs may be specified during the decontamination process. Functional annotation and taxonomic classification were performed following the decontamination step.
The blastx algorithm of DIAMOND [3] was used to align nucleotide sequences to the KEGG protein reference database [4] consisting of non-redundant, family level fungal eukaryotes and genus level prokaryotes (--strand=both --evalue 0.00001). The highest scoring alignment per sequence was used for functional annotation.
Kmer-based taxonomic classification was performed on the merged reads using Kaiju [5] in greedy mode (-a greedy -E 0.05). NCBI's nr database [6] containing reference sequences for archaea, bacteria, viruses, fungi, and microbial eukaryotes was used as the reference index for Kaiju.
channels: - bioconda - conda-forge - defaults dependencies: - python=3.6 - bbmap=37.99 - click=6.7 - diamond=0.9.21 - kaiju=1.6.2 - numpy - pandas=0.23.1 - plotly=2.7.0 - snakemake>=5.1.3 - vsearch=2.6.0
Per sample classifications in tables/ contain:
Header ID | Definition |
---|---|
aa_alignment_length | The length of the DIAMOND blastx hit |
aa_percent_id | The percent ID of the DIAMOND blastx hit; could be used to increase post-processing stringency |
ec | Enzyme Commission number from KEGG; semicolon delimited where multiple |
ko | KEGG entry ID |
product | KEGG gene ID <semicolon> KEGG product |
read_id | The sequence identifier (unique) |
kaiju_alignment_length | The length of the Kaiju hit |
kaiju_classification | The Kaiju classification in order of superkingdom, phylum, order, class, family, genus, species; "NA" for each taxonomic level not defined |
blastx_lca_classification | The LCA result from the blastx HSPs |
Per taxonomy assignments in tables named summaries/taxonomy/<level>.txt contain:
Header ID | Definition |
---|---|
taxonomy_<level> | taxonomic level into which counts have been summed |
samples names | non-normalized, per sample sum at this taxonomic level |
Per function assignments in tables named summaries/function/<type>.txt contain:
Header ID | Definition |
---|---|
<type> | either KO, EC, or product into which counts have been summed |
samples names | non-normalized, per sample sum for this particular functional group |
level_1 | KEGG hierarchy [level 1] if KO defined in first column |
level_2 | KEGG hierarchy [level 2] if KO defined in first column |
level_3 | KEGG hierarchy [level 3] if KO defined in first column |
Per taxonomy+function assignments in tables named summaries/combined/<type>_<level>.txt contain:
Header ID | Definition |
---|---|
<type> | either KO, EC, or product; counts are summed using <type>+<taxonomy> |
taxonomy_<level> | taxonomic level; counts are summed using <type>+<taxonomy> |
sample names | non-normalized, per sample sum for this particular functional group |
level_1 | KEGG hierarchy [level 1] if KO defined in first column |
level_2 | KEGG hierarchy [level 2] if KO defined in first column |
level_3 | KEGG hierarchy [level 3] if KO defined in first column |