PerSeq - Per sequence functional and taxonomic assignments

Contents

Summary

Sequence Counts

Sample Sequences Pairs Joined Join Rate Average Insert Unique Clean Assigned Function Assigned Taxonomy Assigned Both
S08 200000 54710 54.710% 164.9 49732 38306 2633 657 486
S016 200000 40933 40.933% 184.8 38873 34094 4067 1120 785
S017 200000 48086 48.086% 183.9 44870 37230 4431 1381 948
S030 200000 53921 53.921% 188.2 47044 39957 3725 1081 702

Sequence Quality

Taxonomy by Count

Samples are sorted based on their Shannon index calculated from taxonomically annotated sequences. The order is most to least diverse.

Taxonomy by Percent

Methods

Paired-end sequences were evaluated for quality using VSEARCH [1]. Sequence reads are quality trimmed after successful merging using bbmerge [2]. Sequences are allowed to be extended up 300 bp during the merging process to account for non-overlapping R1 and R2 sequences (k=60 extend2=60 iterations=5 qtrim2=t). Merged sequences are deduplicated using the clumpify tool [2] then, by default, filtered of PhiX and rRNA using bbsplit [2]. An arbitrary number of Name:FASTA pairs may be specified during the decontamination process. Functional annotation and taxonomic classification were performed following the decontamination step.

Functional Annotation

The blastx algorithm of DIAMOND [3] was used to align nucleotide sequences to the KEGG protein reference database [4] consisting of non-redundant, family level fungal eukaryotes and genus level prokaryotes (--strand=both --evalue 0.00001). The highest scoring alignment per sequence was used for functional annotation.

Taxonomic Annotation

Kmer-based taxonomic classification was performed on the merged reads using Kaiju [5] in greedy mode (-a greedy -E 0.05). NCBI's nr database [6] containing reference sequences for archaea, bacteria, viruses, fungi, and microbial eukaryotes was used as the reference index for Kaiju.

References

  1. Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: a versatile open source tool for metagenomics. PeerJ. PeerJ Inc; 2016;4:e2584.
  2. Bushnell B. BBTools [Internet]. Available from: https://sourceforge.net/projects/bbmap/
  3. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat. Methods. Nature Publishing Group; 2015;12:59–60.
  4. Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 2016;44:D457–62.
  5. Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. Nature Publishing Group; 2016;7:11257.
  6. NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2018;46:D8–D13.

Execution Environment

channels:
    - bioconda
    - conda-forge
    - defaults
dependencies:
    - python=3.6
    - bbmap=37.99
    - click=6.7
    - diamond=0.9.21
    - kaiju=1.6.2
    - numpy
    - pandas=0.23.1
    - plotly=2.7.0
    - snakemake>=5.1.3
    - vsearch=2.6.0

Output

Classification Tables

Per sample classifications in tables/ contain:

Header ID Definition
aa_alignment_length The length of the DIAMOND blastx hit
aa_percent_id The percent ID of the DIAMOND blastx hit; could be used to increase post-processing stringency
ec Enzyme Commission number from KEGG; semicolon delimited where multiple
ko KEGG entry ID
product KEGG gene ID <semicolon> KEGG product
read_id The sequence identifier (unique)
kaiju_alignment_length The length of the Kaiju hit
kaiju_classification The Kaiju classification in order of superkingdom, phylum, order, class, family, genus, species; "NA" for each taxonomic level not defined
blastx_lca_classification The LCA result from the blastx HSPs

Summary Tables

Taxonomy

Per taxonomy assignments in tables named summaries/taxonomy/<level>.txt contain:

Header ID Definition
taxonomy_<level> taxonomic level into which counts have been summed
samples names non-normalized, per sample sum at this taxonomic level

Function

Per function assignments in tables named summaries/function/<type>.txt contain:

Header ID Definition
<type> either KO, EC, or product into which counts have been summed
samples names non-normalized, per sample sum for this particular functional group
level_1 KEGG hierarchy [level 1] if KO defined in first column
level_2 KEGG hierarchy [level 2] if KO defined in first column
level_3 KEGG hierarchy [level 3] if KO defined in first column

Combined

Per taxonomy+function assignments in tables named summaries/combined/<type>_<level>.txt contain:

Header ID Definition
<type> either KO, EC, or product; counts are summed using <type>+<taxonomy>
taxonomy_<level> taxonomic level; counts are summed using <type>+<taxonomy>
sample names non-normalized, per sample sum for this particular functional group
level_1 KEGG hierarchy [level 1] if KO defined in first column
level_2 KEGG hierarchy [level 2] if KO defined in first column
level_3 KEGG hierarchy [level 3] if KO defined in first column

Downloads

file1:
ko.txt
file2:
phylum.txt
file3:
ko_phylum.txt
2018-06-28