The purpose of this post is to show you how to analyze 16S metabarcoding datasets (Illumina 16S V3-V4 region) from the command line with FROGS on the migale server and how to explore data in a BIOM file with phyloseq.
Introduction
FROGS is a tool dedicated to metabarcoding data analysis, available on a Galaxy
server and from command line. Phyloseq is the R package of reference for dealing with metabarcoding data.
The following analyses have been performed on the Migale infrastructure (front.migale.inrae.fr and rstudio.migale.inrae.fr). You can easily reproduce the analyses if you have got an account on our infrastructure. If you are not familiar with the Migale infrastructure, you can read the dedicated post.
This post is intended neither to provide an in-depth analysis of the dataset nor to answer biological questions. It is rather an illustration of the technical possibilities and various tools offered by the Migale infrastructure for this kind of data. Please be aware that the parameters of the tools used here are tailored to this particular dataset and should be adapted to your own needs.
The dataset we will analyze in this post is a part of the samples used in this publication . These are 16S rRNA amplicons of meat and seafood products, as well as synthetic communities, sequenced with the Illumina MiSeq sequencing technology.
The following table gives information on samples, commonly refered to as metadata and stored in a text file:
Sample |
Replicate |
Run |
BioSample |
Amplicon |
Experiment |
EnvType |
CS2_16S |
2 |
SRR7127609 |
SAMN09070431 |
16S V3-V4 |
SRX4048655 |
Poultry sausage |
CS3_16S |
3 |
SRR7127610 |
SAMN09070432 |
16S V3-V4 |
SRX4048654 |
Poultry sausage |
SF1_16S |
1 |
SRR7127612 |
SAMN09070433 |
16S V3-V4 |
SRX4048653 |
Salmon fillet |
SF2_16S |
2 |
SRR7127613 |
SAMN09070434 |
16S V3-V4 |
SRX4048652 |
Salmon fillet |
PS1_16S |
1 |
SRR7127614 |
SAMN09070427 |
16S V3-V4 |
SRX4048651 |
Pork sausage |
PS2_16S |
2 |
SRR7127615 |
SAMN09070428 |
16S V3-V4 |
SRX4048650 |
Pork sausage |
PS3_16S |
3 |
SRR7127616 |
SAMN09070429 |
16S V3-V4 |
SRX4048649 |
Pork sausage |
CS1_16S |
1 |
SRR7127617 |
SAMN09070430 |
16S V3-V4 |
SRX4048648 |
Poultry sausage |
SF3_16S |
3 |
SRR7127619 |
SAMN09070435 |
16S V3-V4 |
SRX4048646 |
Salmon fillet |
CF1_16S |
1 |
SRR7127620 |
SAMN09070436 |
16S V3-V4 |
SRX4048645 |
Cod fillet |
CF2_16S |
2 |
SRR7127628 |
SAMN09070437 |
16S V3-V4 |
SRX4048637 |
Cod fillet |
CF3_16S |
3 |
SRR7127629 |
SAMN09070438 |
16S V3-V4 |
SRX4048636 |
Cod fillet |
GB1_16S |
1 |
SRR7127630 |
SAMN09070439 |
16S V3-V4 |
SRX4048635 |
Ground beef |
GB2_16S |
2 |
SRR7127631 |
SAMN09070440 |
16S V3-V4 |
SRX4048634 |
Ground beef |
GB3_16S |
3 |
SRR7127632 |
SAMN09070441 |
16S V3-V4 |
SRX4048633 |
Ground beef |
MC1_16S |
1 |
SRR7127633 |
SAMN09070442 |
16S V3-V4 |
SRX4048632 |
Mock community |
MC2_16S |
2 |
SRR7127634 |
SAMN09070443 |
16S V3-V4 |
SRX4048631 |
Mock community |
MC3_16S |
3 |
SRR7127635 |
SAMN09070444 |
16S V3-V4 |
SRX4048630 |
Mock community |
MC4_16S |
4 |
SRR7127636 |
SAMN09070445 |
16S V3-V4 |
SRX4048629 |
Mock community |
MC5_16S |
5 |
SRR7127637 |
SAMN09070446 |
16S V3-V4 |
SRX4048628 |
Mock community |
Get and prepare data
The first step is to get the FASTQ
files, containing the sequencing data. In our case they are available on a public repository and we will need to download them thanks to their accession ID with sra-tools .
cd ~/work
mkdir -p FROGS_16S/LOGS
cd FROGS_16S
echo -e "Sample\tReplicate\tRun\tBioSample\tAmplicon\tExperiment\tEnvType
CS2_16S\t2\tSRR7127609\tSAMN09070431\t16S V3-V4\tSRX4048655\tPoultry sausage
CS3_16S\t3\tSRR7127610\tSAMN09070432\t16S V3-V4\tSRX4048654\tPoultry sausage
SF1_16S\t1\tSRR7127612\tSAMN09070433\t16S V3-V4\tSRX4048653\tSalmon fillet
SF2_16S\t2\tSRR7127613\tSAMN09070434\t16S V3-V4\tSRX4048652\tSalmon fillet
PS1_16S\t1\tSRR7127614\tSAMN09070427\t16S V3-V4\tSRX4048651\tPork sausage
PS2_16S\t2\tSRR7127615\tSAMN09070428\t16S V3-V4\tSRX4048650\tPork sausage
PS3_16S\t3\tSRR7127616\tSAMN09070429\t16S V3-V4\tSRX4048649\tPork sausage
CS1_16S\t1\tSRR7127617\tSAMN09070430\t16S V3-V4\tSRX4048648\tPoultry sausage
SF3_16S\t3\tSRR7127619\tSAMN09070435\t16S V3-V4\tSRX4048646\tSalmon fillet
CF1_16S\t1\tSRR7127620\tSAMN09070436\t16S V3-V4\tSRX4048645\tCod fillet
CF2_16S\t2\tSRR7127628\tSAMN09070437\t16S V3-V4\tSRX4048637\tCod fillet
CF3_16S\t3\tSRR7127629\tSAMN09070438\t16S V3-V4\tSRX4048636\tCod fillet
GB1_16S\t1\tSRR7127630\tSAMN09070439\t16S V3-V4\tSRX4048635\tGround beef
GB2_16S\t2\tSRR7127631\tSAMN09070440\t16S V3-V4\tSRX4048634\tGround beef
GB3_16S\t3\tSRR7127632\tSAMN09070441\t16S V3-V4\tSRX4048633\tGround beef
MC1_16S\t1\tSRR7127633\tSAMN09070442\t16S V3-V4\tSRX4048632\tMock community
MC2_16S\t2\tSRR7127634\tSAMN09070443\t16S V3-V4\tSRX4048631\tMock community
MC3_16S\t3\tSRR7127635\tSAMN09070444\t16S V3-V4\tSRX4048630\tMock community
MC4_16S\t4\tSRR7127636\tSAMN09070445\t16S V3-V4\tSRX4048629\tMock community
MC5_16S\t5\tSRR7127637\tSAMN09070446\t16S V3-V4\tSRX4048628\tMock community" >> metadata.tsv
conda activate sra-tools-2.11.0
awk '{print "fasterq-dump --split-files --progress --force --outdir . --threads 1", $3}' <(grep SRR metadata.tsv) >> download.sh
bash download.sh
conda deactivate
Some steps are needed to use these FASTQ
files as FROGS inputs. FROGS needs to know which files belong to the which samples. FROGS will search the patterns _R1.fastq and _R2.fastq. Moreoever, sample names are the characters preceeding _R1.fastq and _R2.fastq. We have to rename the files: from SRR7127616_1.fastq
and SRR7127616_2.fastq
to PS3_16S_R1.fastq
and PS3_16S_R2.fastq
. Finally, we can compress them to save disk space.
The following commands will compress and add the expected tag to all files:
for i in *.fastq ; do gzip $i;mv $i.gz $(echo $i | sed "s/_/_R/g").gz ; done
The following command will rename files from the metadata file:
awk -F $'\t' '{id = $1 ; oldr1 = $3"_R1.fastq.gz" ; oldr2 = $3"_R2.fastq.gz" ; r1 = id"_R1.fastq.gz" ; r2 = id"_R2.fastq.gz" ; system("mv " oldr1 " " r1 ) ; system("mv " oldr2 " " r2 )}' <(grep SRR metadata.tsv)
Quality control
This step is crucial. You have to assess the quality of your data to avoid (or at least understand) surprises in results.
We can check rapidly if R1 and R2 files have the same number of reads. If not, the files may be corrupted during the download process.
for i in *.fastq.gz ; do echo $i $(zcat $i | echo $((`wc -l`/4))) ; done
CF1_16S_R1.fastq.gz 40690
CF1_16S_R2.fastq.gz 40690
CF2_16S_R1.fastq.gz 18890
CF2_16S_R2.fastq.gz 18890
CF3_16S_R1.fastq.gz 66972
CF3_16S_R2.fastq.gz 66972
CS1_16S_R1.fastq.gz 87279
CS1_16S_R2.fastq.gz 87279
CS2_16S_R1.fastq.gz 95018
CS2_16S_R2.fastq.gz 95018
CS3_16S_R1.fastq.gz 66144
CS3_16S_R2.fastq.gz 66144
GB1_16S_R1.fastq.gz 58147
GB1_16S_R2.fastq.gz 58147
GB2_16S_R1.fastq.gz 43704
GB2_16S_R2.fastq.gz 43704
GB3_16S_R1.fastq.gz 112853
GB3_16S_R2.fastq.gz 112853
MC1_16S_R1.fastq.gz 40134
MC1_16S_R2.fastq.gz 40134
MC2_16S_R1.fastq.gz 26931
MC2_16S_R2.fastq.gz 26931
MC3_16S_R1.fastq.gz 51451
MC3_16S_R2.fastq.gz 51451
MC4_16S_R1.fastq.gz 79859
MC4_16S_R2.fastq.gz 79859
MC5_16S_R1.fastq.gz 80048
MC5_16S_R2.fastq.gz 80048
PS1_16S_R1.fastq.gz 90288
PS1_16S_R2.fastq.gz 90288
PS2_16S_R1.fastq.gz 106242
PS2_16S_R2.fastq.gz 106242
PS3_16S_R1.fastq.gz 81591
PS3_16S_R2.fastq.gz 81591
SF1_16S_R1.fastq.gz 63966
SF1_16S_R2.fastq.gz 63966
SF2_16S_R1.fastq.gz 81230
SF2_16S_R2.fastq.gz 81230
SF3_16S_R1.fastq.gz 82574
SF3_16S_R2.fastq.gz 82574
The number of reads varies from 18 890 to 112 853 reads.
It is useful to keep track of the initial number of reads. We can add it to the metadata file:
head -n 1 metadata.tsv | tr -d "\n" > header.txt
echo -e "\tReads" >> header.txt
grep SRR metadata.tsv | sort -k1,1 > file1
awk -F $'\t' '{system("zcat " $1"_R1.fastq.gz | echo $((`wc -l`/4))" )}' file1 >> reads
cat header.txt <(paste file1 reads) >> metadata2.txt
rm file1 header.txt reads
We now use FastQC
mkdir FASTQC
for i in *.fastq.gz ; do echo "conda activate fastqc-0.11.9 && fastqc $i -o FASTQC && conda deactivate" >> fastqc.sh ; done
qarray -cwd -V -N fastqc -o LOGS -e LOGS fastqc.sh
and then MultiQC to aggregate all reports into one.
qsub -cwd -V -N multiqc -o LOGS -e LOGS -b y "conda activate multiqc-1.12 && multiqc FASTQC -o MULTIQC && conda deactivate"
Let look at some particular graphics provided in the HTML report:
firefox --no-remote MULTIQC/multiqc_report.html
- Sequence Quality Histograms
- Per Sequence Quality Scores
- Per Base Sequence Content
- Sequence Length Distribution
Sequencing quality is good. Nothing wrong detected at this step !
FROGS
A good practice is now to create an archive containing all FASTQ files. It is easier to manipulate than the 40 individual files.
tar zcvf data.tar.gz *.fastq.gz
FASTQ
files can be deleted because they are stored in the archive.
rm -f *.fastq.gz
Preprocess
The knowledge of your data is essential. You have to answer the following questions to choose the parameters:
- Sequencing technology?
- Targeted region and the expected amplicon length?
- Have reads already been merged?
- Have primers already been deleted?
- What are the primers sequences?
Here are the answers for this dataset:
-
Sequencing technology
-
Type of data
-
Amplicon expected length
-
Primers sequences
-
Reads size
If one information is missing, you will have to find it to your colleagues or to the sequencing facility.
During the preprocess, paired-end reads are merged, filtered on length (according to min and max) and removed if they contain ambigous bases. Finally sequences are dereplicated to keep only one copy of each sequence. Counts per sample of each unique sequence are stored in the count matrix.
mkdir FROGS
qsub -cwd -V -N preprocess -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-4.0.1 && preprocess.py illumina --input-archive data.tar.gz --min-amplicon-size 200 --max-amplicon-size 490 --merge-software pear --five-prim-primer ACGGRAGGCWGCAGT --three-prim-primer AGGATTAGATACCCTGGTA --R1-size 250 --R2-size 250 --nb-cpus 8 --output-dereplicated FROGS/preprocess.fasta --output-count FROGS/preprocess.tsv --summary FROGS/preprocess.html --log-file FROGS/preprocess.log && conda deactivate"
The HTML file gives you very interesting information. You have to check if all is fine:
firefox --no-remote FROGS/preprocess.html
The above graphic shows that 89.48% of raw reads are kept. No overlap between reads was found for ~8% of reads (not dramatic).
The length distribution of sequences show that some sequences do not have the expected size. We can run this tool again by increasing min amplicon size and reducing max amplicon size. The aim is to remove too short sequences.
qsub -cwd -V -N preprocess -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-4.0.1 && preprocess.py illumina --input-archive data.tar.gz --min-amplicon-size 420 --max-amplicon-size 470 --merge-software pear --five-prim-primer ACGGRAGGCWGCAGT --three-prim-primer AGGATTAGATACCCTGGTA --R1-size 250 --R2-size 250 --nb-cpus 8 --output-dereplicated FROGS/preprocess.fasta --output-count FROGS/preprocess.tsv --summary FROGS/preprocess.html --log-file FROGS/preprocess.log && conda deactivate"
Now we are more precise:
Reads clustering
Following FROGS guidelines, swarm is used with d=1
.
qsub -cwd -V -N clustering -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-4.0.1 && clustering.py --input-fasta FROGS/preprocess.fasta --input-count FROGS/preprocess.tsv --distance 1 --fastidious --nb-cpus 8 --log-file FROGS/clustering.log --output-biom FROGS/clustering.biom --output-fasta FROGS/clustering.fasta --output-compo FROGS/clustering_otu_compositions.tsv && conda deactivate"
qsub -cwd -V -N clusters_stats -o LOGS -e LOGS -b y "conda activate frogs-4.0.1 && clusters_stat.py --input-biom FROGS/clustering.biom --output-file FROGS/clusters_stats.html --log-file FROGS/clusters_stats.log && conda deactivate"
firefox --no-remote FROGS/clusters_stats.html &
We see that we have built a lot of OTUs! More than 120,000 !
Surprisingly, it is an expected result. We will reduce this number later. Indeed, we can see that 88% of the OTUs are composed of only 1 sequence (singletons). The abundance filters will remove them.
Chimera removal
The chimera detection is performed with vsearch .
qsub -cwd -V -N chimera -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-4.0.1 && remove_chimera.py --input-fasta FROGS/clustering.fasta --input-biom FROGS/clustering.biom --non-chimera FROGS/remove_chimera.fasta --nb-cpus 8 --log-file FROGS/remove_chimera.log --out-abundance FROGS/remove_chimera.biom --summary FROGS/remove_chimera.html && conda deactivate"
firefox --no-remote FROGS/remove_chimera.html &
The report shows classical results for chimera detection in 16S data. ~10% of sequences (20% of OTUs) are chimeric and chimeric OTUs are made of few sequences.
Abundance and prevalence-based filters
We now apply filters to remove low-abundant OTUs that are likely to be chimeras or artifacts. We check also if some phiX sequences are still present. Low-abundant OTUs are difficult to estimate. Following FROGS guidelines, we choose 0.005% of overall abundance. More stringent filters, including filters based on the prevalence across samples, can be made later if needed.
qsub -cwd -V -N filters -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-4.0.1 && otu_filters.py --input-fasta FROGS/remove_chimera.fasta --input-biom FROGS/remove_chimera.biom --output-fasta FROGS/filters.fasta --nb-cpus 8 --log-file FROGS/filters.log --output-biom FROGS/filters.biom --summary FROGS/filters.html --excluded FROGS/filters_excluded.tsv --contaminant /db/outils/FROGS/contaminants/phi.fa --min-sample-presence 1 --min-abundance 0.00005 && conda deactivate"
firefox --no-remote FROGS/filters.html &
The report allows to show the impact of our filters.
180,483 OTUs are filtered out; 195 OTUs are kept! ~16% of sequences are lost but they mostly correspond to low-abundances OTUs, good news
Taxonomic affiliation
It is now time to give our OTUs a taxonomic affiliation. We use the latest available version of Silva (v.138.1) among all databanks available in the dedicated repository: /db/outils/FROGS/assignation/
.
qsub -cwd -V -N affiliation -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-4.0.1 && affiliation_OTU.py --input-fasta FROGS/filters.fasta --input-biom FROGS/filters.biom --nb-cpus 8 --log-file FROGS/affiliation.log --output-biom FROGS/affiliation.biom --summary FROGS/affiliation.html --reference /db/outils/FROGS/assignation/silva_138.1_16S_pintail100/silva_138.1_16S_pintail100.fasta && conda deactivate"
firefox --no-remote FROGS/affiliation.html &
The report gives important information.
First, all OTUs were affiliated . Nevertheless, ~85% of our sequences (~67 of the OTUs) are multi-affiliated at Species level. It means that there is an ambiguity at Species level (but it’s ok at Genus level). We will check later if you can improve it.
It is expected for 16S amplicon data!
The affiliations stats tool gives complementary information:
qsub -cwd -V -N affiliations_stats -o LOGS -e LOGS -b y "conda activate frogs-4.0.1 && affiliations_stat.py --input-biom FROGS/affiliation.biom --output-file FROGS/affiliations_stats.html --log-file FROGS/affiliations_stats.log --multiple-tag blast_affiliations --tax-consensus-tag blast_taxonomy --identity-tag perc_identity --coverage-tag perc_query_coverage && conda deactivate"
firefox --no-remote FROGS/affiliations_stats.html &
We can see that the 175 OTUs have a blast coverage of 100% and a blast percentage identity > 99%. It can give you indications on supplementary filters to perform (remove OTUs with too low coverage…).
We can also look at the compositions (gloabl or per sample) with the Krona charts:
Now, we can extract information of the BIOM file in a text file:
qsub -cwd -V -N biom_to_tsv -o LOGS -e LOGS -b y "conda activate frogs-4.0.1 && biom_to_tsv.py --input-biom FROGS/affiliation.biom --input-fasta FROGS/filters.fasta --output-tsv FROGS/affiliation.tsv --output-multi-affi FROGS/multi_aff.tsv --log-file FROGS/biom_to_tsv.log && conda deactivate"
We can see details of our OTUs:
head -n 3 FROGS/affiliation.tsv
#comment |
blast_taxonomy |
blast_subject |
blast_perc_identity |
blast_perc_query_coverage |
blast_evalue |
blast_aln_length |
seed_id |
seed_sequence |
observation_name |
observation_sum |
CF1_16S |
CF2_16S |
CF3_16S |
CS1_16S |
CS2_16S |
CS3_16S |
GB1_16S |
GB2_16S |
GB3_16S |
MC1_16S |
MC2_16S |
MC3_16S |
MC4_16S |
MC5_16S |
PS1_16S |
PS2_16S |
PS3_16S |
SF1_16S |
SF2_16S |
SF3_16S |
no data |
Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;Multi-affiliation |
multi-subject |
100.0 |
100.0 |
0.0 |
425 |
SRR7127620.9806 |
AGGGAATCTTCCACAATGGACGAAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAACTCTGTTGTTGGAGAAGAATGTATCTGATAGTAACTGATCAGGTAGTGACGGTATCCAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCTTCGGCTCAACCGAAGAAGTGCATCGGAAACTGGGAAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCATGGGTAGCAAAC |
Cluster_1 |
298483 |
11 |
37 |
14 |
41033 |
35771 |
27580 |
817 |
30 |
467 |
9577 |
3415 |
157 |
19072 |
1310 |
55757 |
61185 |
39991 |
1385 |
18 |
856 |
no data |
Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Photobacterium;Multi-affiliation |
multi-subject |
100.0 |
100.0 |
0.0 |
425 |
SRR7127628.20 |
GGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGTTGTGAGGAAGGCGTTGGAGTTAATAGCTTCAGCGCTTGACGTTAGCAACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCATGCAGGCGGTCTGTTAAGCAAGATGTGAAAGCCCGGGGCTCAACCTCGGAACAGCATTTTGAACTGGCAGACTAGAGTCTTGTAGAGGGGGGTAGAATTTCAGGTGTAGCGGTGAAATGCGTAGAGATCTGAAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGATGCGAAAGCGTGGGGAGCAAAC |
Cluster_2 |
125178 |
1 |
7458 |
16946 |
17373 |
11485 |
1036 |
5 |
3 |
88 |
52 |
1233 |
4920 |
2 |
5562 |
0 |
1 |
0 |
887 |
56225 |
1901 |
We find in this file the taxonomic affiliation, the information about the best hit, the sequence of the OTU and the counts per sample.
Multi-affiliations
FROGS uses blast tool to align OTU sequence against a reference databank to give him a taxonomic affiliation. Particularly with 16S amplicon data, different species can harbor a very closed or even identical 16S sequence. This is a very common phenomenon which explains why 16S analyses often do not discriminate between species within the same Genus. FROGS gives you the ability to view the conflicting affiliations of a given OTU. These are called multi-affiliations. Here is an example of a multi-affiliation:
observation_name |
blast_taxonomy |
blast_subject |
blast_perc_identity |
blast_perc_query_coverage |
blast_evalue |
blast_aln_length |
Cluster_4 |
Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;Lactobacillus sakei |
JX275803.1.1516 |
100.0 |
100.0 |
0.0 |
425 |
Cluster_4 |
Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;Lactobacillus curvatus |
KT351722.1.1500 |
100.0 |
100.0 |
0.0 |
425 |
In this case, we can not choose between Lactobacillus sakei
and Lactobacillus curvatus
. We keep Multi-affiliation
at Species level.
Sometimes it is useful to modify a multi-affiliation:
observation_name |
blast_taxonomy |
blast_subject |
blast_perc_identity |
blast_perc_query_coverage |
blast_evalue |
blast_aln_length |
Cluster_2 |
Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Photobacterium;unknown species |
FJ456537.1.1524 |
100.0 |
100.0 |
0.0 |
425 |
Cluster_2 |
Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Photobacterium;unknown species |
FJ456356.1.1570 |
100.0 |
100.0 |
0.0 |
425 |
Cluster_2 |
Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Photobacterium;Photobacterium phosphoreum |
AB681911.1.1470 |
100.0 |
100.0 |
0.0 |
425 |
Cluster_2 |
Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Photobacterium;Photobacterium phosphoreum |
AY341437.1.1467 |
100.0 |
100.0 |
0.0 |
425 |
Cluster_2 |
Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Photobacterium;Photobacterium phosphoreum |
X74687.1.1457 |
100.0 |
100.0 |
0.0 |
425 |
In this case, we can change the affiliation with Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Photobacterium;Photobacterium phosphoreum
because there is not ambiguity within several species.
You can use a dedicated Shiny application to do this easily through a nice interface: https://shiny.migale.inrae.fr/app/affiliationexplorer
Diversity analyses
Phyloseq is a R package dedicated to diversity analyses. It must be loaded in your R session prior to any analysis. You can do so using the following commands on the Migale Rstudio server: https://rstudio.migale.inrae.fr/.
Make a phyloseq object
To create a phyloseq object, we need the BIOM file, the metadata file and eventually a tree file (not generated here).
Go to your work directory:
library(phyloseq)
library(phyloseq.extended)
setwd("~/work/FROGS_16S")
biomfile <- "FROGS/affiliation.biom"
frogs <- import_frogs(biomfile, taxMethod = "blast")
metadata <- read.table("metadata2.txt", row.names = 1, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
sample_data(frogs) <- metadata
frogs
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 195 taxa and 20 samples ]
## sample_data() Sample Data: [ 20 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 195 taxa by 7 taxonomic ranks ]
Here are the number of sequences before (red) and after (blue) the bioinformatics analyses, without additional curation or normalization.
samples <- rownames(sample_data(frogs))
final <- sample_sums(frogs)
initial <- metadata$Reads
final <- as.vector(t(final))
df <- data.frame(initial,final,samples)
df <- melt(df, id.vars='samples')
ggplot(df, aes(x=samples, y=value, fill=variable)) +
geom_bar(stat='identity', position='dodge') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Phyloseq functions
From this object, you can apply a lot of functions to explore it. The full documentation is available here.
We can plot the composition of our samples at Phylum level with the function plot_bar()
, ordered by EnvType
metadata:
plot_bar(frogs, fill="Phylum") + facet_wrap(~EnvType, scales= "free_x", nrow=1)
plot_composition()
function allows to plot relative abundances:
plot_composition(frogs, taxaRank1 = NULL, taxaSet1 = NULL, taxaRank2 = "Phylum", numberOfTaxa = 10) +
scale_fill_brewer(palette = "Paired") +
facet_grid(~EnvType, scales = "free_x", space = "free_x")
We can rarefy samples to get equal depths for all samples before computing diversity indices with function rarefy_even_depth()
.
frogs_rare <- rarefy_even_depth(frogs, rngseed = 20200831)
Be careful with rarefaction, in this case a lot of sequences are lost because the smallest sample has so few sequences. Always check sample depths before rarefaction! You can remove poorly sequenced samples.
samples <- rownames(sample_data(frogs_rare))
non_rarefied <- sample_sums(frogs)
rarefied <- sample_sums(frogs_rare)
initial <- metadata$Reads
df <- data.frame(initial,non_rarefied,rarefied,samples)
df <- melt(df, id.vars='samples')
p <- ggplot(df, aes(x=samples, y=value, fill=variable)) +
geom_bar(stat='identity', position='dodge') +
ylab("sequences")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p
The rarefaction curves allow to see it too:
p <- ggrare(physeq = frogs, step = 100, se = FALSE, plot = FALSE) +
ggtitle("Rarefaction curves") +
aes(color = factor(EnvType)) +
facet_wrap(~EnvType, scales = "free_y")
p
You can use our dedicated Shiny dedicated called easy16S to explore your phyloseq object easily and rapidly.
The R commands used to generate the figures are available thanks to the button Show code
. Once your figure is ready, you can copy the R instructions in your report for reproductibility and tweak it to adapt it to your needs.
Conclusions
You have learned how to run FROGS on the front
server and to explore the results with R or with easy16S. A small fraction of FROGS tools are presented here and some can be usefull for your own data (demultiplexing, phylogenetic tree, additional filters…). If you have any questions, you can contact us at help-migale@inrae.fr or at frogs-support@inrae.fr for FROGS specific questions.
tags:
tutoriel
Migale
Metabarcoding data analysis (16S rRNA) with FROGS
The purpose of this post is to show you how to analyze 16S metabarcoding datasets (Illumina 16S V3-V4 region) from the command line with FROGS on the migale server and how to explore data in a BIOM file with phyloseq.
Introduction
FROGS [1] is a tool dedicated to metabarcoding data analysis, available on a
Galaxy
server and from command line. Phyloseq [2] is the R package of reference for dealing with metabarcoding data.The following analyses have been performed on the Migale infrastructure (front.migale.inrae.fr and rstudio.migale.inrae.fr). You can easily reproduce the analyses if you have got an account on our infrastructure. If you are not familiar with the Migale infrastructure, you can read the dedicated post.
This post is intended neither to provide an in-depth analysis of the dataset nor to answer biological questions. It is rather an illustration of the technical possibilities and various tools offered by the Migale infrastructure for this kind of data. Please be aware that the parameters of the tools used here are tailored to this particular dataset and should be adapted to your own needs.
FROGS is also available on our Galaxy instance : galaxy.migale.inrae.fr
Metabarcoding data
The dataset we will analyze in this post is a part of the samples used in this publication [3]. These are 16S rRNA amplicons of meat and seafood products, as well as synthetic communities, sequenced with the Illumina MiSeq sequencing technology.
The following table gives information on samples, commonly refered to as metadata and stored in a text file:
Bioinformatics analyses
Get and prepare data
The first step is to get the
FASTQ
files, containing the sequencing data. In our case they are available on a public repository and we will need to download them thanks to their accession ID with sra-tools [4].cd ~/work mkdir -p FROGS_16S/LOGS cd FROGS_16S # Write metadata file echo -e "Sample\tReplicate\tRun\tBioSample\tAmplicon\tExperiment\tEnvType CS2_16S\t2\tSRR7127609\tSAMN09070431\t16S V3-V4\tSRX4048655\tPoultry sausage CS3_16S\t3\tSRR7127610\tSAMN09070432\t16S V3-V4\tSRX4048654\tPoultry sausage SF1_16S\t1\tSRR7127612\tSAMN09070433\t16S V3-V4\tSRX4048653\tSalmon fillet SF2_16S\t2\tSRR7127613\tSAMN09070434\t16S V3-V4\tSRX4048652\tSalmon fillet PS1_16S\t1\tSRR7127614\tSAMN09070427\t16S V3-V4\tSRX4048651\tPork sausage PS2_16S\t2\tSRR7127615\tSAMN09070428\t16S V3-V4\tSRX4048650\tPork sausage PS3_16S\t3\tSRR7127616\tSAMN09070429\t16S V3-V4\tSRX4048649\tPork sausage CS1_16S\t1\tSRR7127617\tSAMN09070430\t16S V3-V4\tSRX4048648\tPoultry sausage SF3_16S\t3\tSRR7127619\tSAMN09070435\t16S V3-V4\tSRX4048646\tSalmon fillet CF1_16S\t1\tSRR7127620\tSAMN09070436\t16S V3-V4\tSRX4048645\tCod fillet CF2_16S\t2\tSRR7127628\tSAMN09070437\t16S V3-V4\tSRX4048637\tCod fillet CF3_16S\t3\tSRR7127629\tSAMN09070438\t16S V3-V4\tSRX4048636\tCod fillet GB1_16S\t1\tSRR7127630\tSAMN09070439\t16S V3-V4\tSRX4048635\tGround beef GB2_16S\t2\tSRR7127631\tSAMN09070440\t16S V3-V4\tSRX4048634\tGround beef GB3_16S\t3\tSRR7127632\tSAMN09070441\t16S V3-V4\tSRX4048633\tGround beef MC1_16S\t1\tSRR7127633\tSAMN09070442\t16S V3-V4\tSRX4048632\tMock community MC2_16S\t2\tSRR7127634\tSAMN09070443\t16S V3-V4\tSRX4048631\tMock community MC3_16S\t3\tSRR7127635\tSAMN09070444\t16S V3-V4\tSRX4048630\tMock community MC4_16S\t4\tSRR7127636\tSAMN09070445\t16S V3-V4\tSRX4048629\tMock community MC5_16S\t5\tSRR7127637\tSAMN09070446\t16S V3-V4\tSRX4048628\tMock community" >> metadata.tsv conda activate sra-tools-2.11.0 awk '{print "fasterq-dump --split-files --progress --force --outdir . --threads 1", $3}' <(grep SRR metadata.tsv) >> download.sh bash download.sh conda deactivate
Some steps are needed to use these
FASTQ
files as FROGS inputs. FROGS needs to know which files belong to the which samples. FROGS will search the patterns _R1.fastq and _R2.fastq. Moreoever, sample names are the characters preceeding _R1.fastq and _R2.fastq. We have to rename the files: fromSRR7127616_1.fastq
andSRR7127616_2.fastq
toPS3_16S_R1.fastq
andPS3_16S_R2.fastq
. Finally, we can compress them to save disk space.The following commands will compress and add the expected tag to all files:
for i in *.fastq ; do gzip $i;mv $i.gz $(echo $i | sed "s/_/_R/g").gz ; done
The following command will rename files from the metadata file:
awk -F $'\t' '{id = $1 ; oldr1 = $3"_R1.fastq.gz" ; oldr2 = $3"_R2.fastq.gz" ; r1 = id"_R1.fastq.gz" ; r2 = id"_R2.fastq.gz" ; system("mv " oldr1 " " r1 ) ; system("mv " oldr2 " " r2 )}' <(grep SRR metadata.tsv)
Quality control
A dedicated tutorial for analyzing the sequencing quality of Illimina data is available here: https://tutorials.migale.inrae.fr/posts/quality-control/
This step is crucial. You have to assess the quality of your data to avoid (or at least understand) surprises in results.
We can check rapidly if R1 and R2 files have the same number of reads. If not, the files may be corrupted during the download process.
for i in *.fastq.gz ; do echo $i $(zcat $i | echo $((`wc -l`/4))) ; done
The number of reads varies from 18 890 to 112 853 reads.
It is useful to keep track of the initial number of reads. We can add it to the metadata file:
head -n 1 metadata.tsv | tr -d "\n" > header.txt echo -e "\tReads" >> header.txt grep SRR metadata.tsv | sort -k1,1 > file1 awk -F $'\t' '{system("zcat " $1"_R1.fastq.gz | echo $((`wc -l`/4))" )}' file1 >> reads cat header.txt <(paste file1 reads) >> metadata2.txt rm file1 header.txt reads
We now use FastQC [5]
mkdir FASTQC for i in *.fastq.gz ; do echo "conda activate fastqc-0.11.9 && fastqc $i -o FASTQC && conda deactivate" >> fastqc.sh ; done qarray -cwd -V -N fastqc -o LOGS -e LOGS fastqc.sh
and then MultiQC [6] to aggregate all reports into one.
qsub -cwd -V -N multiqc -o LOGS -e LOGS -b y "conda activate multiqc-1.12 && multiqc FASTQC -o MULTIQC && conda deactivate"
Let look at some particular graphics provided in the HTML report:
Sequencing quality is good. Nothing wrong detected at this step !
FROGS
A good practice is now to create an archive containing all FASTQ files. It is easier to manipulate than the 40 individual files.
FASTQ
files can be deleted because they are stored in the archive.rm -f *.fastq.gz # To extract files: # tar xzvf data.tar.gz
Preprocess
The knowledge of your data is essential. You have to answer the following questions to choose the parameters:
Here are the answers for this dataset:
Sequencing technology
Type of data
Amplicon expected length
Primers sequences
Reads size
If one information is missing, you will have to find it to your colleagues or to the sequencing facility.
During the preprocess, paired-end reads are merged, filtered on length (according to min and max) and removed if they contain ambigous bases. Finally sequences are dereplicated to keep only one copy of each sequence. Counts per sample of each unique sequence are stored in the count matrix.
mkdir FROGS qsub -cwd -V -N preprocess -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-4.0.1 && preprocess.py illumina --input-archive data.tar.gz --min-amplicon-size 200 --max-amplicon-size 490 --merge-software pear --five-prim-primer ACGGRAGGCWGCAGT --three-prim-primer AGGATTAGATACCCTGGTA --R1-size 250 --R2-size 250 --nb-cpus 8 --output-dereplicated FROGS/preprocess.fasta --output-count FROGS/preprocess.tsv --summary FROGS/preprocess.html --log-file FROGS/preprocess.log && conda deactivate"
The HTML file gives you very interesting information. You have to check if all is fine:
The above graphic shows that 89.48% of raw reads are kept. No overlap between reads was found for ~8% of reads (not dramatic).
The length distribution of sequences show that some sequences do not have the expected size. We can run this tool again by increasing min amplicon size and reducing max amplicon size. The aim is to remove too short sequences.
Now we are more precise:
Reads clustering
Following FROGS guidelines, swarm [7] is used with
d=1
.qsub -cwd -V -N clustering -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-4.0.1 && clustering.py --input-fasta FROGS/preprocess.fasta --input-count FROGS/preprocess.tsv --distance 1 --fastidious --nb-cpus 8 --log-file FROGS/clustering.log --output-biom FROGS/clustering.biom --output-fasta FROGS/clustering.fasta --output-compo FROGS/clustering_otu_compositions.tsv && conda deactivate" qsub -cwd -V -N clusters_stats -o LOGS -e LOGS -b y "conda activate frogs-4.0.1 && clusters_stat.py --input-biom FROGS/clustering.biom --output-file FROGS/clusters_stats.html --log-file FROGS/clusters_stats.log && conda deactivate"
We see that we have built a lot of OTUs! More than 120,000 !
Surprisingly, it is an expected result. We will reduce this number later. Indeed, we can see that 88% of the OTUs are composed of only 1 sequence (singletons). The abundance filters will remove them.
Chimera removal
The chimera detection is performed with vsearch [8].
qsub -cwd -V -N chimera -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-4.0.1 && remove_chimera.py --input-fasta FROGS/clustering.fasta --input-biom FROGS/clustering.biom --non-chimera FROGS/remove_chimera.fasta --nb-cpus 8 --log-file FROGS/remove_chimera.log --out-abundance FROGS/remove_chimera.biom --summary FROGS/remove_chimera.html && conda deactivate"
The report shows classical results for chimera detection in 16S data. ~10% of sequences (20% of OTUs) are chimeric and chimeric OTUs are made of few sequences.
Abundance and prevalence-based filters
We now apply filters to remove low-abundant OTUs that are likely to be chimeras or artifacts. We check also if some phiX sequences are still present. Low-abundant OTUs are difficult to estimate. Following FROGS guidelines, we choose 0.005% of overall abundance. More stringent filters, including filters based on the prevalence across samples, can be made later if needed.
qsub -cwd -V -N filters -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-4.0.1 && otu_filters.py --input-fasta FROGS/remove_chimera.fasta --input-biom FROGS/remove_chimera.biom --output-fasta FROGS/filters.fasta --nb-cpus 8 --log-file FROGS/filters.log --output-biom FROGS/filters.biom --summary FROGS/filters.html --excluded FROGS/filters_excluded.tsv --contaminant /db/outils/FROGS/contaminants/phi.fa --min-sample-presence 1 --min-abundance 0.00005 && conda deactivate"
The report allows to show the impact of our filters.
180,483 OTUs are filtered out; 195 OTUs are kept! ~16% of sequences are lost but they mostly correspond to low-abundances OTUs, good news
Taxonomic affiliation
It is now time to give our OTUs a taxonomic affiliation. We use the latest available version of Silva [9] (v.138.1) among all databanks available in the dedicated repository:
/db/outils/FROGS/assignation/
.qsub -cwd -V -N affiliation -o LOGS -e LOGS -pe thread 8 -R y -b y "conda activate frogs-4.0.1 && affiliation_OTU.py --input-fasta FROGS/filters.fasta --input-biom FROGS/filters.biom --nb-cpus 8 --log-file FROGS/affiliation.log --output-biom FROGS/affiliation.biom --summary FROGS/affiliation.html --reference /db/outils/FROGS/assignation/silva_138.1_16S_pintail100/silva_138.1_16S_pintail100.fasta && conda deactivate"
The report gives important information.
First, all OTUs were affiliated . Nevertheless, ~85% of our sequences (~67 of the OTUs) are multi-affiliated at Species level. It means that there is an ambiguity at Species level (but it’s ok at Genus level). We will check later if you can improve it.
It is expected for 16S amplicon data!
The affiliations stats tool gives complementary information:
qsub -cwd -V -N affiliations_stats -o LOGS -e LOGS -b y "conda activate frogs-4.0.1 && affiliations_stat.py --input-biom FROGS/affiliation.biom --output-file FROGS/affiliations_stats.html --log-file FROGS/affiliations_stats.log --multiple-tag blast_affiliations --tax-consensus-tag blast_taxonomy --identity-tag perc_identity --coverage-tag perc_query_coverage && conda deactivate"
We can see that the 175 OTUs have a blast coverage of 100% and a blast percentage identity > 99%. It can give you indications on supplementary filters to perform (remove OTUs with too low coverage…).
We can also look at the compositions (gloabl or per sample) with the Krona charts:
Now, we can extract information of the BIOM file in a text file:
qsub -cwd -V -N biom_to_tsv -o LOGS -e LOGS -b y "conda activate frogs-4.0.1 && biom_to_tsv.py --input-biom FROGS/affiliation.biom --input-fasta FROGS/filters.fasta --output-tsv FROGS/affiliation.tsv --output-multi-affi FROGS/multi_aff.tsv --log-file FROGS/biom_to_tsv.log && conda deactivate"
We can see details of our OTUs:
head -n 3 FROGS/affiliation.tsv
We find in this file the taxonomic affiliation, the information about the best hit, the sequence of the OTU and the counts per sample.
Multi-affiliations
FROGS uses blast tool to align OTU sequence against a reference databank to give him a taxonomic affiliation. Particularly with 16S amplicon data, different species can harbor a very closed or even identical 16S sequence. This is a very common phenomenon which explains why 16S analyses often do not discriminate between species within the same Genus. FROGS gives you the ability to view the conflicting affiliations of a given OTU. These are called multi-affiliations. Here is an example of a multi-affiliation:
In this case, we can not choose between
Lactobacillus sakei
andLactobacillus curvatus
. We keepMulti-affiliation
at Species level.Sometimes it is useful to modify a multi-affiliation:
In this case, we can change the affiliation with
Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Photobacterium;Photobacterium phosphoreum
because there is not ambiguity within several species.You can use a dedicated Shiny application to do this easily through a nice interface: https://shiny.migale.inrae.fr/app/affiliationexplorer
Diversity analyses
Phyloseq is a R package dedicated to diversity analyses. It must be loaded in your R session prior to any analysis. You can do so using the following commands on the Migale Rstudio server: https://rstudio.migale.inrae.fr/.
Make a phyloseq object
To create a phyloseq object, we need the BIOM file, the metadata file and eventually a tree file (not generated here).
Go to your work directory:
Here are the number of sequences before (red) and after (blue) the bioinformatics analyses, without additional curation or normalization.
Phyloseq functions
From this object, you can apply a lot of functions to explore it. The full documentation is available here.
We can plot the composition of our samples at Phylum level with the function
plot_bar()
, ordered byEnvType
metadata:plot_composition()
function allows to plot relative abundances:We can rarefy samples to get equal depths for all samples before computing diversity indices with function
rarefy_even_depth()
.Be careful with rarefaction, in this case a lot of sequences are lost because the smallest sample has so few sequences. Always check sample depths before rarefaction! You can remove poorly sequenced samples.
The rarefaction curves allow to see it too:
You can use our dedicated Shiny dedicated called easy16S to explore your phyloseq object easily and rapidly.
The R commands used to generate the figures are available thanks to the button
Show code
. Once your figure is ready, you can copy the R instructions in your report for reproductibility and tweak it to adapt it to your needs.Conclusions
You have learned how to run FROGS on the
front
server and to explore the results with R or with easy16S. A small fraction of FROGS tools are presented here and some can be usefull for your own data (demultiplexing, phylogenetic tree, additional filters…). If you have any questions, you can contact us at help-migale@inrae.fr or at frogs-support@inrae.fr for FROGS specific questions.Maria Bernard, Olivier Rué, Mahendra Mariadassou, Géraldine Pascal, FROGS: a powerful tool to analyse the diversity of fungi with special management of internal transcribed spacers, Briefings in Bioinformatics, 2021;, bbab318, https://doi.org/10.1093/bib/bbab318 ↩︎
McMurdie PJ, Holmes S. Phyloseq: An r package for reproducible interactive analysis and graphics of microbiome census data. PloS one. 2013;8. ↩︎
Poirier S, Rué O, Peguilhan R, Coeuret G, Zagorec M, Champomier-Vergès M-C, et al. (2018) Deciphering intra-species bacterial diversity of meat and seafood spoilage microbiota using gyrB amplicon sequencing: A comparative analysis with 16S rDNA V3-V4 amplicon sequencing. PLoS ONE 13(9): e0204629. https://doi.org/10.1371/journal.pone.0204629 ↩︎
Team STD. SRA tools. http://ncbi.github.io/sra-tools/. ↩︎
Andrews S. FastQC a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. ↩︎
Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32:3047–8. ↩︎
Mahé F, Rognes T, Quince C, Vargas C de, Dunthorn M. Swarm v2: Highly-scalable and high-resolution amplicon clustering. PeerJ. 2015;3:e1420. ↩︎
Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: A versatile open source tool for metagenomics. PeerJ. 2016;4:e2584. ↩︎
Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, et al. SILVA: A comprehensive online resource for quality checked and aligned ribosomal rna sequence data compatible with arb. Nucleic acids research. 2007;35:7188–96. ↩︎