The purpose of this post is to show you how to analyze 16S metabarcoding datasets (Illumina 16S V3-V4 region) from the command line with FROGS on the migale server and how to explore data in a BIOM
file with phyloseq.
FROGS [1] is a tool dedicated to metabarcoding data analysis, available on a Galaxy server and from command line. Phyloseq [2] is the R package of reference for dealing with metabarcoding data.
The following analyses have been performed on the Migale cluster migale.jouy.inrae.fr
and rstudio.migale.inrae.fr
.
You can easily reproduce the analyses if you have got an account on our infrastructure. If you are not familiar with the Migale infrastructure, you can read the dedicated post.
Here are the informations about tools, packages and databanks we will use in this tutorial:
Tool, package or databank | Version |
---|---|
sra-tools [3] | 2.10.1 |
FROGS [1] | 3.2.0 |
fastqc [4] | 0.11.8 |
multiqc [5] | 1.8 |
phyloseq [2] | 1.32.0 |
phyloseq.extended [6] | 0.1.1.3003 |
ggplot2 [7] | 3.3.2 |
Silva [8] | 138 |
The dataset we will analyze in this post is a part of the samples used in this publication. These are 16S rRNA amplicons of meat and seafood products, as well as synthetic communities, sequenced with the Illumina MiSeq sequencing technology.
The following table gives informations on samples, commonly refered to as metadata and stored in a metadata file:
Sample | Replicate | Run | BioSample | Amplicon | Experiment | EnvType |
---|---|---|---|---|---|---|
CS2_16S | 2 | SRR7127609 | SAMN09070431 | 16S V3-V4 | SRX4048655 | Poultry sausage |
CS3_16S | 3 | SRR7127610 | SAMN09070432 | 16S V3-V4 | SRX4048654 | Poultry sausage |
SF1_16S | 1 | SRR7127612 | SAMN09070433 | 16S V3-V4 | SRX4048653 | Salmon fillet |
SF2_16S | 2 | SRR7127613 | SAMN09070434 | 16S V3-V4 | SRX4048652 | Salmon fillet |
PS1_16S | 1 | SRR7127614 | SAMN09070427 | 16S V3-V4 | SRX4048651 | Pork sausage |
PS2_16S | 2 | SRR7127615 | SAMN09070428 | 16S V3-V4 | SRX4048650 | Pork sausage |
PS3_16S | 3 | SRR7127616 | SAMN09070429 | 16S V3-V4 | SRX4048649 | Pork sausage |
CS1_16S | 1 | SRR7127617 | SAMN09070430 | 16S V3-V4 | SRX4048648 | Poultry sausage |
SF3_16S | 3 | SRR7127619 | SAMN09070435 | 16S V3-V4 | SRX4048646 | Salmon fillet |
CF1_16S | 1 | SRR7127620 | SAMN09070436 | 16S V3-V4 | SRX4048645 | Cod fillet |
CF2_16S | 2 | SRR7127628 | SAMN09070437 | 16S V3-V4 | SRX4048637 | Cod fillet |
CF3_16S | 3 | SRR7127629 | SAMN09070438 | 16S V3-V4 | SRX4048636 | Cod fillet |
GB1_16S | 1 | SRR7127630 | SAMN09070439 | 16S V3-V4 | SRX4048635 | Ground beef |
GB2_16S | 2 | SRR7127631 | SAMN09070440 | 16S V3-V4 | SRX4048634 | Ground beef |
GB3_16S | 3 | SRR7127632 | SAMN09070441 | 16S V3-V4 | SRX4048633 | Ground beef |
MC1_16S | 1 | SRR7127633 | SAMN09070442 | 16S V3-V4 | SRX4048632 | Mock community |
MC2_16S | 2 | SRR7127634 | SAMN09070443 | 16S V3-V4 | SRX4048631 | Mock community |
MC3_16S | 3 | SRR7127635 | SAMN09070444 | 16S V3-V4 | SRX4048630 | Mock community |
MC4_16S | 4 | SRR7127636 | SAMN09070445 | 16S V3-V4 | SRX4048629 | Mock community |
MC5_16S | 5 | SRR7127637 | SAMN09070446 | 16S V3-V4 | SRX4048628 | Mock community |
The first step is to get the FASTQ
files, containing the sequencing data. In our case they are available on a public repository and we will need to download them thanks to their accession ID with sra-tools [3].
cd ~/work
mkdir -p FROGS_16S/LOGS
cd FROGS_16S
conda activate sra-tools-2.10.1
awk '{print "fasterq-dump --split-files --progress --force --outdir . --threads 1", $3}' <(grep SRR metadata.tsv) >> download.sh
bash download.sh
conda deactivate
Some steps are needed to use these FASTQ files as FROGS inputs. FROGS needs to know which files belong to the same samples. FROGS will search the patterns _R1.fastq
and _R2.fastq
. Moreoever, sample names are the characters preceeding _R1.fastq
and _R2.fastq
.
We have to rename files from: SRR7127616_1.fastq
and SRR7127616_2.fastq
to PS3_16S_R1.fastq
and PS3_16S_R2.fastq
.
Finally, we can compress them to save disk space.
The following commands will compress and add the expected tag to all files:
for i in *.fastq ; do gzip $i ; mv $i.gz $(echo $i | sed "s/_/_R/" ).gz ; done
The following command will rename files from informations present in the metadata file:
awk -F $'\t' '{id = $1 ; oldr1 = $3"_R1.fastq.gz" ; oldr2 = $3"_R2.fastq.gz" ; r1 = id"_R1.fastq.gz" ; r2 = id"_R2.fastq.gz" ; system("mv " oldr1 " " r1 ) ; system("mv " oldr2 " " r2 )}' <(grep SRR metadata.tsv)
We can check rapidly if R1 and R2 files have the same number of reads. If not, the files may be corrupted during the download process.
for i in *.fastq.gz ; do echo $i $(zcat $i | echo $((`wc -l`/4))) ; done
The number of reads varies from 18 890 reads to 112 853.
It is useful to keep track of the initial number of reads. We can add it to the metadata file:
head -n 1 metadata.tsv | tr -d "\n" > header.txt
echo -e "\tReads" >> header.txt
grep SRR metadata.tsv | sort -k1,1 > file1
awk -F $'\t' '{system("zcat " $1"_R1.fastq.gz | echo $((`wc -l`/4))" )}' file1 >> reads
cat header.txt <(paste file1 reads) >> metadata2.txt
rm file1 header.txt reads
We now use FastQC [4] and then MultiQC [5] to aggregate all reports into one.
mkdir FASTQC
for i in *.fastq.gz ; do echo "conda activate fastqc-0.11.8 && fastqc $i -o FASTQC && conda deactivate" >> fastqc.sh ; done
qarray -cwd -V -N fastqc -o LOGS -e LOGS fastqc.sh
qsub -cwd -V -N multiqc -o LOGS -e LOGS -b y "conda activate multiqc-1.8 && multiqc FASTQC -o MULTIQC && conda deactivate"
Let look at the HTML file produced by MultiQC. Some characteristics are important to note for metabarcoding data:
A good practice is to create an archive containing all FASTQ files. It is easier to manipulate than the 40 individual files.
tar zcvf data.tar.gz *.fastq.gz
Now FASTQ files can be deleted because they are stored in the archive.
rm -f *.fastq.gz
# To extract files:
# tar xzvf data.tar.gz
The knowledge of your data is essential. You have to answer the following questions to choose the parameters:
Here are the answers for this dataset:
Sequencing technology
Type of data
Amplicon expected length
Primers sequences
Reads size
During the preprocess, paired-end reads are merged, filtered on length (according to min and max) and removed if they contain ambigous bases. Finally sequences are dereplicated to keep only one copy of each sequence. Counts per sample of each unique sequence are stored in the count matrix.
mkdir FROGS
qsub -cwd -V -N preprocess -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-3.2.0 && preprocess.py illumina --input-archive data.tar.gz --min-amplicon-size 200 --max-amplicon-size 490 --merge-software pear --five-prim-primer ACGGRAGGCWGCAGT --three-prim-primer AGGATTAGATACCCTGGTA --R1-size 250 --R2-size 250 --nb-cpus 8 --output-dereplicated FROGS/preprocess.fasta --output-count FROGS/preprocess.tsv --summary FROGS/preprocess.html --log-file FROGS/preprocess.log && conda deactivate"
Let look at the HTML file produced by FROGS preprocess to check what happened.
qsub -cwd -V -N preprocess -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-3.2.0 && preprocess.py illumina --input-archive data.tar.gz --min-amplicon-size 420 --max-amplicon-size 470 --merge-software pear --five-prim-primer ACGGRAGGCWGCAGT --three-prim-primer AGGATTAGATACCCTGGTA --R1-size 250 --R2-size 250 --nb-cpus 8 --output-dereplicated FROGS/preprocess.fasta --output-count FROGS/preprocess.tsv --summary FROGS/preprocess.html --log-file FROGS/preprocess.log && conda deactivate"
Differences can be seen in the second HTML report.
Following FROGS guidelines, swarm [9] is used with d=1.
qsub -cwd -V -N clustering -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-3.2.0 && clustering.py --input-fasta FROGS/preprocess.fasta --input-count FROGS/preprocess.tsv --distance 1 --fastidious --nb-cpus 8 --log-file FROGS/clustering.log --output-biom FROGS/clustering.biom --output-fasta FROGS/clustering.fasta --output-compo FROGS/clustering_otu_compositions.tsv && conda deactivate"
qsub -cwd -V -N clusters_stats -o LOGS -e LOGS -b y "conda activate frogs-3.2.0 && clusters_stat.py --input-biom FROGS/clustering.biom --output-file FROGS/clusters_stats.html --log-file FROGS/clusters_stats.log && conda deactivate"
This report shows classical characteristics of OTUs built with swarm:
The chimera detection is performed with vsearch [10].
qsub -cwd -V -N chimera -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-3.2.0 && remove_chimera.py --input-fasta FROGS/clustering.fasta --input-biom FROGS/clustering.biom --non-chimera FROGS/remove_chimera.fasta --nb-cpus 8 --log-file FROGS/remove_chimera.log --out-abundance FROGS/remove_chimera.biom --summary FROGS/remove_chimera.html && conda deactivate"
This report shows classical results for chimera detection in 16S data:
We now apply filters to remove low-abundant OTUs that are likely to be chimeras or artifacts. We check also if some phiX sequences are still present. Low-abundant OTUs are difficult to estimate. Following FROGS guidelines, we choose 0.005% of overall abundance. More stringent filters, including filters based on the prevalence across samples, can be made later if needed.
qsub -cwd -V -N filters -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-3.2.0 && otu_filters.py --input-fasta FROGS/remove_chimera.fasta --input-biom FROGS/remove_chimera.biom --output-fasta FROGS/filters.fasta --nb-cpus 8 --log-file FROGS/filters.log --output-biom FROGS/filters.biom --summary FROGS/filters.html --excluded FROGS/filters_excluded.tsv --contaminant /db/frogs_databanks/contaminants/phi.fa --min-sample-presence 1 --min-abundance 0.00005 && conda deactivate"
This report allows to show the impact of our filters:
It is now time to give our OTUs a taxonomic affiliation. We use the latest available version of Silva [8] (v.138) among all databanks available in the dedicated repository: /db/frogs_databanks/assignation/
.
qsub -cwd -V -N affiliation -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-3.2.0 && affiliation_OTU.py --input-fasta FROGS/filters.fasta --input-biom FROGS/filters.biom --nb-cpus 8 --log-file FROGS/affiliation.log --output-biom FROGS/affiliation.biom --summary FROGS/affiliation.html --reference /db/frogs_databanks/assignation/silva_138_16S/silva_138_16S.fasta && conda deactivate"
This report shows that all OTUs were affiliated.
qsub -cwd -V -N affiliations_stats -o LOGS -e LOGS -b y "conda activate frogs-3.2.0 && affiliations_stat.py --input-biom FROGS/affiliation.biom --output-file FROGS/affiliations_stats.html --log-file FROGS/affiliations_stats.log --multiple-tag blast_affiliations --tax-consensus-tag blast_taxonomy --identity-tag perc_identity --coverage-tag perc_query_coverage && conda deactivate"
You can use the Krona output to explore the affiliation.
This report shows complementary informations about how OTUs were affiliated. We can see that the 175 OTUs have a blast coverage of 100% and a blast percentage identity > 99%. It can give you indications on supplementary filters to perform (remove OTUs with too low coverage…).
qsub -cwd -V -N biom_to_tsv -o LOGS -e LOGS -b y "conda activate frogs-3.2.0 && biom_to_tsv.py --input-biom FROGS/affiliation.biom --input-fasta FROGS/filters.fasta --output-tsv FROGS/affiliation.tsv --output-multi-affi FROGS/multi_aff.tsv --log-file FROGS/biom_to_tsv.log && conda deactivate"
Here are the results for our OTUs, quite a few of them are multi-affiliated:
FROGS uses blast tool against a reference databank to assign OTUs. Particularly with 16S amplicon data, different species can harbor a similar, or even identical, 16S sequence in the targeted region. This is a very common phenomenon which explains why 16S analyses often do not discriminate between species within the same Genus. FROGS gives you the ability to view the conflicting affiliations of a given OTU. These are called multi-affiliations. Here is an example of a multi-affiliation:
observation_name | blast_taxonomy | blast_subject | blast_perc_identity | blast_perc_query_coverage | blast_evalue | blast_aln_length |
---|---|---|---|---|---|---|
Cluster_4 | Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;Lactobacillus sakei | JX275803.1.1516 | 100.0 | 100.0 | 0.0 | 425 |
Cluster_4 | Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;Lactobacillus curvatus | KT351722.1.1500 | 100.0 | 100.0 | 0.0 | 425 |
Sometimes it is useful to modify a multi-affiliation:
observation_name | blast_taxonomy | blast_subject | blast_perc_identity | blast_perc_query_coverage | blast_evalue | blast_aln_length |
---|---|---|---|---|---|---|
Cluster_2 | Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Photobacterium;unknown species | FJ456537.1.1524 | 100.0 | 100.0 | 0.0 | 425 |
Cluster_2 | Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Photobacterium;unknown species | FJ456356.1.1570 | 100.0 | 100.0 | 0.0 | 425 |
Cluster_2 | Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Photobacterium;Photobacterium phosphoreum | AB681911.1.1470 | 100.0 | 100.0 | 0.0 | 425 |
Cluster_2 | Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Photobacterium;Photobacterium phosphoreum | AY341437.1.1467 | 100.0 | 100.0 | 0.0 | 425 |
Cluster_2 | Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Photobacterium;Photobacterium phosphoreum | X74687.1.1457 | 100.0 | 100.0 | 0.0 | 425 |
You can use a dedicated Shiny application to do this easily through a nice interface: https://shiny.migale.inrae.fr/app/affiliationexplorer
Here is a video example illustrating the use of the app on our dataset:
Phyloseq [2] is a R package dedicated to diversity analyses. It must be loaded in your R session prior to any analysis. You can do so using the following commands on the Migale Rstudio server: https://rstudio.migale.inrae.fr/.
To create a phyloseq object, we need the BIOM file, the metadata file and eventually a tree file (not generated here).
Go to your work directory:
library(phyloseq)
library(phyloseq.extended)
setwd("~/work/FROGS_16S")
biomfile <- "FROGS/affiliation.biom"
frogs <- import_frogs(biomfile, taxMethod = "blast")
metadata <- read.table("metadata2.txt", row.names = 1, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
sample_data(frogs) <- metadata
frogs
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 195 taxa and 20 samples ]
## sample_data() Sample Data: [ 20 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 195 taxa by 7 taxonomic ranks ]
Here are the number of sequences before (red) and after (blue) the bioinformatics analyses, without additional curation or normalization.
samples <- rownames(sample_data(frogs))
final <- sample_sums(frogs)
initial <- metadata$Reads
final <- as.vector(t(final))
df <- data.frame(initial,final,samples)
df <- melt(df, id.vars='samples')
ggplot(df, aes(x=samples, y=value, fill=variable)) +
geom_bar(stat='identity', position='dodge') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
From this object, you can apply a lot of functions to explore it. The full documentation is available here.
We can plot the composition of our samples at Phylum level with the function plot_bar()
, ordered by EnvType metadata:
plot_bar(frogs, fill="Phylum") + facet_wrap(~EnvType, scales= "free_x", nrow=1)
plot_composition()
function allows to plot relative abundances:
plot_composition(frogs, taxaRank1 = NULL, taxaSet1 = NULL, taxaRank2 = "Phylum", numberOfTaxa = 10) +
scale_fill_brewer(palette = "Paired") +
facet_grid(~EnvType, scales = "free_x", space = "free_x")
We can rarefy samples to get equal depths for all samples before computing diversity indices with function rarefy_even_depth()
.
frogs_rare <- rarefy_even_depth(frogs, rngseed = 20200831)
samples <- rownames(sample_data(frogs_rare))
non_rarefied <- sample_sums(frogs)
rarefied <- sample_sums(frogs_rare)
initial <- metadata$Reads
df <- data.frame(initial,non_rarefied,rarefied,samples)
df <- melt(df, id.vars='samples')
p <- ggplot(df, aes(x=samples, y=value, fill=variable)) +
geom_bar(stat='identity', position='dodge') +
ylab("sequences")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p
The rarefaction curves allow to see it too:
p <- ggrare(physeq = frogs, step = 100, se = FALSE, plot = FALSE) +
ggtitle("Rarefaction curves") +
aes(color = factor(EnvType)) +
facet_wrap(~EnvType, scales = "free_y")
## rarefying sample CF1_16S
## rarefying sample CF2_16S
## rarefying sample CF3_16S
## rarefying sample CS1_16S
## rarefying sample CS2_16S
## rarefying sample CS3_16S
## rarefying sample GB1_16S
## rarefying sample GB2_16S
## rarefying sample GB3_16S
## rarefying sample MC1_16S
## rarefying sample MC2_16S
## rarefying sample MC3_16S
## rarefying sample MC4_16S
## rarefying sample MC5_16S
## rarefying sample PS1_16S
## rarefying sample PS2_16S
## rarefying sample PS3_16S
## rarefying sample SF1_16S
## rarefying sample SF2_16S
## rarefying sample SF3_16S
p
You can use a dedicated Shiny dedicated called easy16S to explore your phyloseq object easily and rapidly.
Here is a demonstration on this dataset:
The R commands used to generate the figures are available thanks to the button Show code
. Once your figure is ready, you can copy the R instructions in your report for reproductibility and tweak it to adapt it to your needs.
You have learned how to run FROGS on the migale server and to explore the results with easy16S. A small fraction of FROGS tools are presented here and some can be usefull for your own data (demultiplexing, phylogenetic tree, additional filters…). If you have any questions, you can contact us at help-migale@inrae.fr or at frogs-support@inrae.fr for FROGS specific questions.
1. Escudié F, Auer L, Bernard M, Mariadassou M, Cauquil L, Vidal K, et al. FROGS: Find, Rapidly, OTUs with Galaxy Solution. Bioinformatics. 2018;34:1287–94. doi:10.1093/bioinformatics/btx791.
2. McMurdie PJ, Holmes S. Phyloseq: An r package for reproducible interactive analysis and graphics of microbiome census data. PloS one. 2013;8.
3. Team STD. SRA tools. http://ncbi.github.io/sra-tools/.
4. Andrews S. FastQC a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
5. Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32:3047–8.
6. Mariadassou M. Phyloseq-extended: Various customs functions written to enhance the base functions of phyloseq. Most of them are used in the formation "métagénomique 16S" provided by the platforms migale and genotoul. https://github.com/mahendra-mariadassou/phyloseq-extended.
7. Wickham H. Ggplot2: Elegant graphics for data analysis. Springer-Verlag New York; 2016. https://ggplot2.tidyverse.org.
8. Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, et al. SILVA: A comprehensive online resource for quality checked and aligned ribosomal rna sequence data compatible with arb. Nucleic acids research. 2007;35:7188–96.
9. Mahé F, Rognes T, Quince C, Vargas C de, Dunthorn M. Swarm v2: Highly-scalable and high-resolution amplicon clustering. PeerJ. 2015;3:e1420.
10. Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: A versatile open source tool for metagenomics. PeerJ. 2016;4:e2584.