Chapter 2. Upstream analysis

Chapter 2. Upstream analysis#

(From raw sequencing data to Feature counts matrix and Pathway counts matrix)

2.1 Basic Pipeline#

This chapter aims to guide you through the general pipeline of 16S analysis.

QIIME2 has provided a quite user-friendly documents, check:

https://docs.qiime2.org/2024.10/tutorials/overview/

Note

The tutorial’s command will depends on version: qiime2-2023.2. The code might be changed slightly when you use different version of QIIME2

2.1.1 Upstream: Importing data amd Examine the quality#

Import your sequencing data using qiime tools import. I hope you have already noticed that we should use manifest to input multiple sequencing data into qiime2.

Select the correct input type based on the tutorial. Check here.
Mind that your sample id or name should be the same as that in sample metadata (mentioned in chapter 1).

qiime tools import   --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path manifest.tsv  \
  --output-path paired-end-demux.qza  \
  --input-format PairedEndFastqManifestPhred33V2

Tip

check the example of the manifest and sample-metadata from NEEN project in the link here

After importation, Examine the sequencing quality Interactive Quality Plot with qiime demux summarize:

# make a report to judge the base length to trim
qiime demux summarize --i-data paired-end-demux.qza --o-visualization paired-end-demux.qzv

2.1.2 Upstream: Adapter/Primer/Barcodes trimming#

The artificial sequence (Adapter/Primer/Barcodes) will significantly influence the results of the classifiers, therefore it is essential to check and trim the sequence. We will apply cutadapt’s QIIM2 plugin for this purpose.

qiime cutadapt trim-paired \                  # again, if you are single end data, change the command
    --p-cores 4 \
    --i-demultiplexed-sequences paired-end-demux.qza \
    --p-front-f GTGCCAGCMGCCGCGGTAA \         # refers to the primer sequence of forward
    --p-front-r GGACTACNNGGGTATCTAAT \        # refers to the primer sequence of reverse
    --p-error-rate 0.1 \                      # We only 0.1 rate of error
    --p-overlap 3 \
    --p-match-read-wildcards \
    --p-match-adapter-wildcards \             
    --p-discard-untrimmed \                   # If no primer detected, meanings there might be some errors
    --o-trimmed-sequences Primer_trimmed-seqs.qza

The barcode will also be trimed, as long as the primer was detected. Check the trimmed clean data again!

You can also check the quality again!

# make a report to judge the base length to trim
qiime demux summarize --i-data Primer_trimmed-seqs.qza --o-visualization Primer_trimmed-seqs.qzv

This summary offers visual insights into the distribution of sequence qualities at each position within the sequence data for the subsequent step in the pipeline. The sequence qualities guide decisions regarding certain sequence-processing parameters, including the truncation settings for the DADA2 denoising step.

2.1.3 Upstream: Denoise sequences, selecting sequence variants#

QIIME2 provided two different types of denoising strategy: ASVs and OTUs. Check chapter 3 if you interested in this strategy. DADA2 is generally suggested for ASVs denosing, while QIIME2 also provides debular for the same purpose. You can search for the comparison’s between these two tools.

Through qiime dada2 denoise-xx, we can use dada2 within the QIIME2 platform (xx depends whether your input file is paired-end or single-end).

This command enables us to eliminate low-quality regions from the sequences. It also facilitates the removal of primers prior to denoising. DADA2 necessitates the removal of primers to avoid false positive detection of chimeras due to primer degeneracy.

Important parameter:

–p-trunc-len-f : n truncates each forward read sequence at position n

–p-trunc-len-r : n truncates each reverse read sequence at position n

–p-trim-left-f : m trims off the first m bases of each forward read sequence

–p-trim-left-r : m trims off the first m bases of each reverse read sequence

Click demux.qzv for an example

Question

Based on the plots you see in demux.qzv, what values would you choose for --p-trunc-len and --p-trim-left in this case?

Answer

In this example, we see that the quality of the initial bases seems to be high, so we won’t trim any bases from the beginning of the sequences. The quality seems to drop off around position 120, so we’ll truncate our sequences at 120 bases.

Check the result from last step to set the parameter used in this step.

“TBD”: To be determined by your own results. In NEEN project, we set “TBD” as zero, as the quality of the sequence was generally good.
Mind that you can also trim the primer sequence through --p-trim-left. But it is not generally recommanded

qiime dada2 denoise-paired \
    --i-demultiplexed-seqs Primer_trimmed-seqs.qza \
    --p-trunc-len-f "TBD" \
    --p-trunc-len-r "TBD" \
    --p-trim-left-f "TBD" \
    --p-trim-left-r "TBD" \
    --p-n-threads 4 \
    --o-representative-sequences rep-seqs.qza \
    --o-table table.qza \
    --o-denoising-stats denoising-stats.qza # --o referes to output

After the output, you will get the feature table (ASVs table) and their corresponding sequence (.fasta). But it has also been compacted in the QIIME2 (.qza).

Try use qiime tools peek

The feature_table.qza is a FeatureTable[Frequency] QIIME 2 artifact that contains counts (frequencies) of each unique sequence in each sample in the dataset.
The sequence.qza is a FeatureData[Sequence] QIIME 2 artifact, which maps feature identifiers in the FeatureTable to the sequences they represent.

Try use qiime feature-table summarize

You can also mapping the feature’s sequence to NCBI database BLAST through feature-table tabulate-seqs. Have a try on the example here!

2.1.4 Upstream: Building Phylogenetic Tree#

We can build the phylogenetic tree based on the sequence difference among features. Phylogenetic diversity metrics like Faith’s pd, weighted and unweighted Unifrac distance require the phylogenetic tree to calculate.

qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences rep-seqs.qza \ 
--o-alignment aligned-rep-seqs.qza \
--o-masked-alignment masked-aligned-rep-seqs.qza \
--o-tree unrooted-tree.qza \
--o-rooted-tree rooted-tree.qza # rep-seqs.qza is sequence of the feature (ASVs)

2.1.5 Upstream: Taxonomic assignment#

ASVs can be used to estimate the community diversity, while, to gain more biological insights, we should be more interested in what species or genus are presented in our data. So, to do this kind of taxonomic assignment, we should have a (1) a reference database, (2) an algorithm that we used to classify our sequence based on reference database.

SILVA and greengene2 was the main database used in 16S.

In QIIME2, we have different types of pre-trained naive-bayes classifiers (click here ). Those classifiers were mainly trained with (1) full-length of 16S region. (2) V4 region of 16S.

As discussed before, the choice of region to amplify will influence the resolution of your taxonomic assignment. The use of classifiers trained with different region of sequence will also influence the resolution. The more specific, the better.

As NEEN project target the v4 region, therefore, we just use the pre-trained v4 region of greengene2 database

qiime feature-classifier classify-sklearn \
--i-classifier classifier.qza \ 
--i-reads rep-seqs.qza \           # sequence of the feature (ASVs)
--o-classification taxonomy.qza

Note

You can also train your own classifiers using qiime feature-classifier fit-classifier-naive-bayes. You may find more about how to train your own classifiers, here. This is the code that I wrote for another project of V3-V4 REGION using SILVA 138.1 database. It will mainly depends on qiime2 plug-in rescript . Check here for the official documents from the developer.

The developers of greengene2 also claimed in this website that using the de novo phylogeny algorithm will have higher resolution in species level.

More#

Check the official “Moving pictures tutorial”. Read From “Sequence quality control and feature table construction” to the end. You may find that QIIME2 could also be used to do alpha and beta diversity. However, I would still suggest to use R to do the downstream analysis including alpha and beta diversity.

Note

Check Points (End of the week)

Import?
Feature Table?
Taxonomic bar chart?

2.2 Functional esimation based on PICRUST2#

This chapter will introduce how can we use PICRUST2 to do functional estimation based on KEGG and Metacyc pathway.

2.2.1 Prepare for the PICRUST2: Installation#

The installation will vary depends on your computer systems. Check:

You can download picrust2 as the plugin of QIIME2: here
Or download picrust2 as a standalone version: here

Or you can directly use Galaxy for graphical interface, which is a web-based platform for accessible, reproducible, and scalable biomedical data analyses (free).

https://usegalaxy.eu/?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fpicrust2_pipeline%2Fpicrust2_pipeline%2F2.5.1 galaxy0

2.2.2 PICRUST2 full pipeline#

To use Galaxy platform for picrust2 standalone version to do prediction, you need to input

the feature-table.biom (a file format used to store ASV table in QIIME2)
dna-sequences.fasta (a fasta file indicating the sequence of different feature you extracted from previous steps),

While for QIIME2 plugin, it can directly input the

Feature table QIIME2 artifact (table.qza - ASV count table )
Feature sequence QIIME2 artifact (rep-seqs.qza - ASV representative sequences)

To get the dna-sequences.fasta AND feature-table.biom from the QIIME2 artifact

You can use qiime tools export

Note

This turtorial provided code for picrust2 version v2.5.2 or v2.5.3. The developer of picrust2 recently post an update of their software to 2.6.1 check here. Please check section 2.2.6 for additional code to transfer from 2.5 to 2.6 smoothly.

picrust2_pipeline.py -s dna-sequences.fasta \   #input fasta 
-i feature-table.biom \       # input feature table
-o picrust2_out_pipeline \    # output to the directed directory
-p 6 --stratified --remove_intermediate --verbose

check the tutorial here: picrust/picrust2

The key output files are:

EC_metagenome_out

Folder containing unstratified EC number metagenome predictions (pred_metagenome_unstrat.tsv.gz), sequence table normalized by predicted 16S copy number abundances (seqtab_norm.tsv.gz), and the per-sample NSTI values weighted by the abundance of each ASV (weighted_nsti.tsv.gz).

KO_metagenome_out

As EC_metagenome_out above, but for KO metagenomes.

pathways_out

Folder containing predicted pathway abundances and coverages per-sample, based on predicted EC number abundances.

2.2.3 Add description#

Also, they only output the pathway ID and gene ID. You can find the description file which describing the ID, normally under where you installed picrust2: picrust2/default_files/description_mapfiles/.

Or use add_descriptions.py from picrust2 to do this mission: picrust/picrust2

2.2.4 KEGG pathway estimation#

Please check your picrust2 version, if you are using picurst2 version > 2.6. Check section 2.2.6 for additional steps.

To estimate the KEGG pathway abundance from the KO_metagenome_out, we need to rerun the script using the mapfile from KEGG pathway to KO, like this:

pathway_pipeline.py -i picrust2_out_pipeline/KO_metagenome_out/pred_metagenome_contrib.tsv.gz \ 
    -o KEGG_pathways_out --no_regroup \
--map KEGG_pathways_to_KO.tsv  

This map file has been deposited here. Right click to Copy the link to gitzip and download it.

Please note that this mapfile was created through KEGG database, assessed from 2023-09-26.

2.2.5 Key limitations of PICRUST2#

Read Key limitations here

Note

Check Points (End of the week)

PATHWAY TABLE?
Description?

2.2.6 Updated Scripts for PICRUST2-2.6.X#

The author posted an updated version of PICRUST2 (2.6.x), which mainly update the database from PICRUSt2-oldIMG to PICRUSt2-SC.

there are some additional tips in order to transfer smoothly from picrust2.5 to picrust2.6

KEGG pathway estimation#

Before you run the previous code for picurst2-2.5.x, you need to run a small script prior to it. This script was used to fix the format issue for KO output present after the update.

Note

You can find the scripts here

python fix_picrust2_output.py [input_file] [output_file] [column_ind]
# Help:-----------
#    input_file : str
#        Path to input file (tsv or tsv.gz)
#    output_file : str
#        Path to output file (optional). If None, will overwrite input.
#    column_idx : numeric,
#        The column_idx to remove "KO:", for _contrib.tsv.gz, Use column_idx = 1; for _unstrat.tsv.gz, use column_idx = 0

For the easy cases, just run (make sure you have fix_picrust2_output.py in your current working directory, or you can set the directory of the script by yourself)

# step1:
python fix_picrust2_output.py picrust2_out_pipeline/KO_metagenome_out/pred_metagenome_contrib.tsv.gz picrust2_out_pipeline/KO_metagenome_out/pred_metagenome_contrib_fixed.tsv.gz 1

# step2:
pathway_pipeline.py -i picrust2_out_pipeline/KO_metagenome_out/pred_metagenome_contrib_fixed.tsv.gz \ 
    -o KEGG_pathways_out --no_regroup \
--map KEGG_pathways_to_KO.tsv  

Additional GO/CAZY database estimation#

GO is a well-known and comprehensive database for all the basic biological GENOMIC annotation, while CAZY database is specified for carbohydrate-active enzyme.

picrust2_pipeline.py -s dna-sequences.fasta \   #input fasta 
-i feature-table.biom \       # input feature table
-o picrust2_out_pipeline \    # output to the directed directory
--in_traits CAZY,EC,GO,KO \   # By default, it was set to KO and EC.
-p 6 --stratified --remove_intermediate --verbose

Other Analysis Platform: Galaxy#

Here, we offer some possible trouble shootings

Trouble Shooting for Galaxy#

1.Import your data into the Galaxy.

Dataset expects a single file containing the data, where as a collection like you are trying to upload is a directory(folder) of data.

Currently, we do not have a good instructions for data importing, but what I am pretty sure is you can not use manifest to input your data.

Possibility 1 (rule-based builder)

Put your data online first, and then follow the instructions in the link here, And check: the rule builder will be your friend at the moment:

Hands-on: Rule Based Uploader / Rule Based Uploader / Using Galaxy and Managing your Data

Hands-on: Rule Based Uploader: Advanced / Rule Based Uploader: Advanced / Using Galaxy and Managing your Data

Possibility 2 (Collection builder)

First your file should follow the naming format of Casava One Eight Single Lane Per Sample Directory Format, match the following regex: .+_.+_R[12]_001.fastq.gz. Example like: FMT.0093C_46_L001_R2_001.fastq.gz.

Click Upload Data, select Collection, if paired-end data change Collection Type to List of Pairs, and add the local or ftp files. Click Start, and then Build.

Possibility 3 (Local machine builder)

Otherwise, we would still suggest to use local computer with manifest to import the raw data into .qza and then upload it.

Find more:

2.I do not know how to cut the adapter from the single end sequence.

https://forum.qiime2.org/t/cut-adapt-for-single-end-sequences/18648

3.Can not find input biom for picrust2 after using qiime2 export.

There is a mismatch between the biom file name from the newest version of qiime2 (BIOMv2DirFmt) and picrust2.5.3 (BIOMv2). You can first download your exported biom file. And then upload it again to the Galaxy. Galaxy will auto detect the format which would be acceptable for picurst2.

4.Can not find the input KO_metagenome_out/pred_metagenome_contrib.tsv.gz in Galaxy after finish picrust2 full pipeline ?

Use the ... (Browse and upload your datasets) to find the hidden layer

Other Sequencing Platform: Pacbio CCS - Full length 16S#

We will provide some trouble shooting here

Trouble Shooting for Pacbio CCS#

1, Do not know how to input?

For those users sequenced with Pacbio CCS platform, you could import your reads as single end.

qiime tools import
--type 'SampleData[SequencesWithQuality]'
--input-path se-33-manifest
--output-path single-end-demux.qza
--input-format SingleEndFastqManifestPhred33V2

2.How to denoise, do we need to cut adapter/primer before denoising?

No, you do not need to cut adapter/primer before denoising.

Denoise your reads with qiime dada2 denoise-ccs, because the command use Pacbio error model, which is different from previous next generation sequencing platform, to denoise.

You should set the primer information to the command, otherwise the software can not find the target to denoise, and it will cut the adapter for you:

--p-front : Forward primer in 5’ to 3’. Example: 27F/1492R - AGAGTTTGATCMTGGCTCAG

--p-adapter: Reverse primer in 5’ to 3’. Example: 27F/1492R - TACGGYTACCTTGTTAYGACTT

Check the link. And the author has shown their github here

Please also noted that if you still can not get a satisfactory retained reads. It may mean that some of the sequencing providers do not provide a sequence with adequate quality. In that casem we may transfer to OTU clustering with --p-perc-identity 0.99. Also, you need to remove the primer and re-orient the ccs reads using dada2::removePrimers in R first.

Downstream analysis: Suggestions for beginners#

Check:

see qiime2R to input the qiime2 artifacts to phyloseq
see my tutorial (Chapter 3) for basic analysis using phyloseq
MicrobiotaProcess - Easy to use and impressive visualization.
microeco - For a better visualization and comprehensive microbial community analysis.

Metabolomics and integration

MetaboanalystR (https://www.metaboanalyst.ca/docs/RTutorial.xhtml)
Pathway analysis

ASVs#: Amplicon Sequence Variants (ASVs)
Map file#: In bioinformatics, we mainly called some files could be used to represent some relationships between objects as map files.

Chapter 2. Upstream analysis

Contents

Chapter 2. Upstream analysis#

2.1 Basic Pipeline#

2.1.1 Upstream: Importing data amd Examine the quality#

2.1.2 Upstream: Adapter/Primer/Barcodes trimming#

2.1.3 Upstream: Denoise sequences, selecting sequence variants#

2.1.4 Upstream: Building Phylogenetic Tree#

2.1.5 Upstream: Taxonomic assignment#

More#

2.2 Functional esimation based on PICRUST2#

2.2.1 Prepare for the PICRUST2: Installation#

2.2.2 PICRUST2 full pipeline#

2.2.3 Add description#

2.2.4 KEGG pathway estimation#

2.2.5 Key limitations of PICRUST2#

2.2.6 Updated Scripts for PICRUST2-2.6.X#

KEGG pathway estimation#

Additional GO/CAZY database estimation#

Other Analysis Platform: Galaxy#

Trouble Shooting for Galaxy#

Other Sequencing Platform: Pacbio CCS - Full length 16S#

Trouble Shooting for Pacbio CCS#

Downstream analysis: Suggestions for beginners#