Summary

Sequence Counts

Sample	Sequences	Pairs Joined	Join Rate	Average Insert	Unique	Clean	Assigned Function	Assigned Taxonomy	Assigned Both
S08	200000	54710	54.710%	164.9	49732	38306	2633	657	486
S016	200000	40933	40.933%	184.8	38873	34094	4067	1120	785
S017	200000	48086	48.086%	183.9	44870	37230	4431	1381	948
S030	200000	53921	53.921%	188.2	47044	39957	3725	1081	702

Sequence Quality

Taxonomy by Count

Samples are sorted based on their Shannon index calculated from taxonomically annotated sequences. The order is most to least diverse.

Taxonomy by Percent

Methods

Paired-end sequences were evaluated for quality using VSEARCH [1]. Sequence reads are quality trimmed after successful merging using bbmerge [2]. Sequences are allowed to be extended up 300 bp during the merging process to account for non-overlapping R1 and R2 sequences (k=60 extend2=60 iterations=5 qtrim2=t). Merged sequences are deduplicated using the clumpify tool [2] then, by default, filtered of PhiX and rRNA using bbsplit [2]. An arbitrary number of Name:FASTA pairs may be specified during the decontamination process. Functional annotation and taxonomic classification were performed following the decontamination step.

Functional Annotation

The blastx algorithm of DIAMOND [3] was used to align nucleotide sequences to the KEGG protein reference database [4] consisting of non-redundant, family level fungal eukaryotes and genus level prokaryotes (--strand=both --evalue 0.00001). The highest scoring alignment per sequence was used for functional annotation.

Taxonomic Annotation

Kmer-based taxonomic classification was performed on the merged reads using Kaiju [5] in greedy mode (-a greedy -E 0.05). NCBI's nr database [6] containing reference sequences for archaea, bacteria, viruses, fungi, and microbial eukaryotes was used as the reference index for Kaiju.

References

Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: a versatile open source tool for metagenomics. PeerJ. PeerJ Inc; 2016;4:e2584.
Bushnell B. BBTools [Internet]. Available from: https://sourceforge.net/projects/bbmap/
Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat. Methods. Nature Publishing Group; 2015;12:59–60.
Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 2016;44:D457–62.
Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. Nature Publishing Group; 2016;7:11257.
NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2018;46:D8–D13.

Execution Environment

channels:
    - bioconda
    - conda-forge
    - defaults
dependencies:
    - python=3.6
    - bbmap=37.99
    - click=6.7
    - diamond=0.9.21
    - kaiju=1.6.2
    - numpy
    - pandas=0.23.1
    - plotly=2.7.0
    - snakemake>=5.1.3
    - vsearch=2.6.0

Output

Classification Tables

Per sample classifications in tables/ contain:

Header ID	Definition
aa_alignment_length	The length of the DIAMOND blastx hit
aa_percent_id	The percent ID of the DIAMOND blastx hit; could be used to increase post-processing stringency
ec	Enzyme Commission number from KEGG; semicolon delimited where multiple
ko	KEGG entry ID
product	KEGG gene ID <semicolon> KEGG product
read_id	The sequence identifier (unique)
kaiju_alignment_length	The length of the Kaiju hit
kaiju_classification	The Kaiju classification in order of superkingdom, phylum, order, class, family, genus, species; "NA" for each taxonomic level not defined
blastx_lca_classification	The LCA result from the blastx HSPs

Summary Tables

Taxonomy

Per taxonomy assignments in tables named summaries/taxonomy/<level>.txt contain:

Header ID	Definition
taxonomy_<level>	taxonomic level into which counts have been summed
samples names	non-normalized, per sample sum at this taxonomic level

Function

Per function assignments in tables named summaries/function/<type>.txt contain:

Header ID	Definition
<type>	either KO, EC, or product into which counts have been summed
samples names	non-normalized, per sample sum for this particular functional group
level_1	KEGG hierarchy [level 1] if KO defined in first column
level_2	KEGG hierarchy [level 2] if KO defined in first column
level_3	KEGG hierarchy [level 3] if KO defined in first column

Combined

Per taxonomy+function assignments in tables named summaries/combined/<type>_<level>.txt contain:

Header ID	Definition
<type>	either KO, EC, or product; counts are summed using <type>+<taxonomy>
taxonomy_<level>	taxonomic level; counts are summed using <type>+<taxonomy>
sample names	non-normalized, per sample sum for this particular functional group
level_1	KEGG hierarchy [level 1] if KO defined in first column
level_2	KEGG hierarchy [level 2] if KO defined in first column
level_3	KEGG hierarchy [level 3] if KO defined in first column

Downloads

file1:

ko.txt

file2:

phylum.txt

file3:

ko_phylum.txt

2018-06-28

PerSeq - Per sequence functional and taxonomic assignments