Contents
| Sample | Sequences | Pairs Joined | Join Rate | Average Insert | Unique | Clean | Assigned Function | Assigned Taxonomy | Assigned Both |
|---|---|---|---|---|---|---|---|---|---|
| S08 | 200000 | 54710 | 54.710% | 164.9 | 49732 | 38306 | 2633 | 657 | 486 |
| S016 | 200000 | 40933 | 40.933% | 184.8 | 38873 | 34094 | 4067 | 1120 | 785 |
| S017 | 200000 | 48086 | 48.086% | 183.9 | 44870 | 37230 | 4431 | 1381 | 948 |
| S030 | 200000 | 53921 | 53.921% | 188.2 | 47044 | 39957 | 3725 | 1081 | 702 |
Samples are sorted based on their Shannon index calculated from taxonomically annotated sequences. The order is most to least diverse.
Paired-end sequences were evaluated for quality using VSEARCH [1]. Sequence reads are quality trimmed after successful merging using bbmerge [2]. Sequences are allowed to be extended up 300 bp during the merging process to account for non-overlapping R1 and R2 sequences (k=60 extend2=60 iterations=5 qtrim2=t). Merged sequences are deduplicated using the clumpify tool [2] then, by default, filtered of PhiX and rRNA using bbsplit [2]. An arbitrary number of Name:FASTA pairs may be specified during the decontamination process. Functional annotation and taxonomic classification were performed following the decontamination step.
The blastx algorithm of DIAMOND [3] was used to align nucleotide sequences to the KEGG protein reference database [4] consisting of non-redundant, family level fungal eukaryotes and genus level prokaryotes (--strand=both --evalue 0.00001). The highest scoring alignment per sequence was used for functional annotation.
Kmer-based taxonomic classification was performed on the merged reads using Kaiju [5] in greedy mode (-a greedy -E 0.05). NCBI's nr database [6] containing reference sequences for archaea, bacteria, viruses, fungi, and microbial eukaryotes was used as the reference index for Kaiju.
channels:
- bioconda
- conda-forge
- defaults
dependencies:
- python=3.6
- bbmap=37.99
- click=6.7
- diamond=0.9.21
- kaiju=1.6.2
- numpy
- pandas=0.23.1
- plotly=2.7.0
- snakemake>=5.1.3
- vsearch=2.6.0
Per sample classifications in tables/ contain:
| Header ID | Definition |
|---|---|
| aa_alignment_length | The length of the DIAMOND blastx hit |
| aa_percent_id | The percent ID of the DIAMOND blastx hit; could be used to increase post-processing stringency |
| ec | Enzyme Commission number from KEGG; semicolon delimited where multiple |
| ko | KEGG entry ID |
| product | KEGG gene ID <semicolon> KEGG product |
| read_id | The sequence identifier (unique) |
| kaiju_alignment_length | The length of the Kaiju hit |
| kaiju_classification | The Kaiju classification in order of superkingdom, phylum, order, class, family, genus, species; "NA" for each taxonomic level not defined |
| blastx_lca_classification | The LCA result from the blastx HSPs |
Per taxonomy assignments in tables named summaries/taxonomy/<level>.txt contain:
| Header ID | Definition |
|---|---|
| taxonomy_<level> | taxonomic level into which counts have been summed |
| samples names | non-normalized, per sample sum at this taxonomic level |
Per function assignments in tables named summaries/function/<type>.txt contain:
| Header ID | Definition |
|---|---|
| <type> | either KO, EC, or product into which counts have been summed |
| samples names | non-normalized, per sample sum for this particular functional group |
| level_1 | KEGG hierarchy [level 1] if KO defined in first column |
| level_2 | KEGG hierarchy [level 2] if KO defined in first column |
| level_3 | KEGG hierarchy [level 3] if KO defined in first column |
Per taxonomy+function assignments in tables named summaries/combined/<type>_<level>.txt contain:
| Header ID | Definition |
|---|---|
| <type> | either KO, EC, or product; counts are summed using <type>+<taxonomy> |
| taxonomy_<level> | taxonomic level; counts are summed using <type>+<taxonomy> |
| sample names | non-normalized, per sample sum for this particular functional group |
| level_1 | KEGG hierarchy [level 1] if KO defined in first column |
| level_2 | KEGG hierarchy [level 2] if KO defined in first column |
| level_3 | KEGG hierarchy [level 3] if KO defined in first column |