Contents
| Sample | Sequences | Pairs Joined | Join Rate | Average Insert | Unique | Clean | Assigned Function | Assigned Taxonomy | Assigned Both | 
|---|---|---|---|---|---|---|---|---|---|
| S08 | 200000 | 54710 | 54.710% | 164.9 | 49732 | 38306 | 2633 | 657 | 486 | 
| S016 | 200000 | 40933 | 40.933% | 184.8 | 38873 | 34094 | 4067 | 1120 | 785 | 
| S017 | 200000 | 48086 | 48.086% | 183.9 | 44870 | 37230 | 4431 | 1381 | 948 | 
| S030 | 200000 | 53921 | 53.921% | 188.2 | 47044 | 39957 | 3725 | 1081 | 702 | 
Samples are sorted based on their Shannon index calculated from taxonomically annotated sequences. The order is most to least diverse.
Paired-end sequences were evaluated for quality using VSEARCH [1]. Sequence reads are quality trimmed after successful merging using bbmerge [2]. Sequences are allowed to be extended up 300 bp during the merging process to account for non-overlapping R1 and R2 sequences (k=60 extend2=60 iterations=5 qtrim2=t). Merged sequences are deduplicated using the clumpify tool [2] then, by default, filtered of PhiX and rRNA using bbsplit [2]. An arbitrary number of Name:FASTA pairs may be specified during the decontamination process. Functional annotation and taxonomic classification were performed following the decontamination step.
The blastx algorithm of DIAMOND [3] was used to align nucleotide sequences to the KEGG protein reference database [4] consisting of non-redundant, family level fungal eukaryotes and genus level prokaryotes (--strand=both --evalue 0.00001). The highest scoring alignment per sequence was used for functional annotation.
Kmer-based taxonomic classification was performed on the merged reads using Kaiju [5] in greedy mode (-a greedy -E 0.05). NCBI's nr database [6] containing reference sequences for archaea, bacteria, viruses, fungi, and microbial eukaryotes was used as the reference index for Kaiju.
channels:
    - bioconda
    - conda-forge
    - defaults
dependencies:
    - python=3.6
    - bbmap=37.99
    - click=6.7
    - diamond=0.9.21
    - kaiju=1.6.2
    - numpy
    - pandas=0.23.1
    - plotly=2.7.0
    - snakemake>=5.1.3
    - vsearch=2.6.0
Per sample classifications in tables/ contain:
| Header ID | Definition | 
|---|---|
| aa_alignment_length | The length of the DIAMOND blastx hit | 
| aa_percent_id | The percent ID of the DIAMOND blastx hit; could be used to increase post-processing stringency | 
| ec | Enzyme Commission number from KEGG; semicolon delimited where multiple | 
| ko | KEGG entry ID | 
| product | KEGG gene ID <semicolon> KEGG product | 
| read_id | The sequence identifier (unique) | 
| kaiju_alignment_length | The length of the Kaiju hit | 
| kaiju_classification | The Kaiju classification in order of superkingdom, phylum, order, class, family, genus, species; "NA" for each taxonomic level not defined | 
| blastx_lca_classification | The LCA result from the blastx HSPs | 
Per taxonomy assignments in tables named summaries/taxonomy/<level>.txt contain:
| Header ID | Definition | 
|---|---|
| taxonomy_<level> | taxonomic level into which counts have been summed | 
| samples names | non-normalized, per sample sum at this taxonomic level | 
Per function assignments in tables named summaries/function/<type>.txt contain:
| Header ID | Definition | 
|---|---|
| <type> | either KO, EC, or product into which counts have been summed | 
| samples names | non-normalized, per sample sum for this particular functional group | 
| level_1 | KEGG hierarchy [level 1] if KO defined in first column | 
| level_2 | KEGG hierarchy [level 2] if KO defined in first column | 
| level_3 | KEGG hierarchy [level 3] if KO defined in first column | 
Per taxonomy+function assignments in tables named summaries/combined/<type>_<level>.txt contain:
| Header ID | Definition | 
|---|---|
| <type> | either KO, EC, or product; counts are summed using <type>+<taxonomy> | 
| taxonomy_<level> | taxonomic level; counts are summed using <type>+<taxonomy> | 
| sample names | non-normalized, per sample sum for this particular functional group | 
| level_1 | KEGG hierarchy [level 1] if KO defined in first column | 
| level_2 | KEGG hierarchy [level 2] if KO defined in first column | 
| level_3 | KEGG hierarchy [level 3] if KO defined in first column |