My Project
|
This program reads two paired end fastq
files as an input and filters them according to the following criteria:
fasta
file or in an idx
file created by makeTree
, makeBloom
If one of the two rads is discarded, the corresponding paired read is automatically discarded.
Usage C
executable (in folder bin
):
NOTE: the parameters -l or –length are meant to identify the length of the reads in the input data. Actually, trimFilterDS
also copes with data holding reads with different lengths. The length parameter must hold the length of the longest read in the dataset.
[O_PREFIX1 | O_PREFIX2]_good.fq.gz
: contains reads that passed all filters (maybe trimmed).[O_PREFIX1 | O_PREFIX2]_adap.fq.gz
: contains reads discarded due to the presence of adapters.[O_PREFIX1 | O_PREFIX2]_cont.fq.gz
: contains contamination reads.[O_PREFIX1 | O_PREFIX2]_lowQ.fq.gz
: contains reads discarded due to low quality issues.[O_PREFIX1 | O_PREFIX2]_NNNN.fq.gz
: contains reads discarded due to N's issues.[O_PREFIX1 | O_PREFIX2]_summary.bin
: binary file where information about the filtering process is stored. Structure of the file.4*sizeof(int) Bytes
: array of int with entries i = {ADAP(0), CONT(1), LOWQ(2), NNNN(3)}
. A given entry takes the value of the filter it was applied to and 0 otherwise. filters[ADAPT] = {0,1}
, filters[CONT] = {NO(0), TREE(1), BLOOM(2)}
, filters[LOWQ] = {NO(0), ALL(1), ENDS(2), FRAC(3), ENDSFRAC(4), GLOBAL(5)}
, filters[trimN] = {NO(0), ALL(1), ENDS(2), STRIPS(2)}
.4*sizeof(int) Bytes
: array of integers with entries i = {ADAP(0), CONT(1), LOWQ(2), NNNN(3)}, containing how many reads were trimmed due to the corresponding filter.4*sizeof(int) Bytes
: array of integers with entries i = {ADAP(0), CONT(1), LOWQ(2), NNNN(3)}, containing how many reads were discarded due to the corresponding filter.sizeof(int) Bytes
: number of accepted reads (maybe trimmed).sizeof(int) Bytes
: total number of reads.Technical sequences within the reads are detected if the option --adapters <ADAPTERS1.fa>:<ADAPTERS2.fa>:<mismatches>:<score>
is given. The adapter(s) sequence(s) are read from the fasta files, and then prepended to their respective reads. Then, a 'seed and extend' approach is used to look for overlaps following the same rules followed in the single end case. See README_trimFilter.md
for details on how two matching subsequences are detected and how the score is computed. The paired reads are correspondingly trimmed, removed or left as is. The following figure describes possible cases:
Contaminations are removed if a fasta file or an index file are given as an input. The methods provided to look for contaminations work in the very same way as they work for single end data. If one of the reads is discarded, then, the other read is discarded as well. See README_trimFilter.md
for more details on how the contaminations are handled.
Again, the detection and trimming/removal of reads containing low quality nucleotides is done following the same procedure as for single end data. We list the options below, see README_trimFilter.md
for more details.
--trimQ NO
or flag absent: nothing is done to the reads with low quality.--trimQ ALL
: all reads containing at least one low quality nucleotide are redirected to *_lowq.fq.gz
--trimQ ENDS
: look for low quality (below MINQ) base callings at the beginning and at the end of the read. Trim them at both ends until the quality is above the threshold. Keep the read in *_good.fq.gz
and annotate in the fourth line where the read has been trimmed (starting to count from 0) if the length of the remaining part is larger than MINL
. Redirect the read to *_lowq.fq.gz
otherwise.--trim FRAC [--percent p]
: redirect the read to *_lowq.fq.gz
if there are more than p%
nucleotides whose quality lies below the threshold. p=5
per default.--trim ENDSFRAC --percent p
: first trim the ends as in the ENDS
option. Accept the trimmed read if the number of low quality nucleotides does not exceed p%
(default p = 5
). Redirect the read to *_lowq.fq.gz
otherwise.--trim GLOBAL --global n1:n2
: cut all read globally n1
nucleotides from the left and n2
from the right.Note: qualities are evaluated assuming the reads to follow the L - Illumina 1.8+ Phred+33, convention, see Wikipedia. Adjust the values for a different convention.
We allow for the following options (see README_trimFilter.md for examples and more details):
--trimN NO
(or flag absent): Nothing is done to the reads containing N's.--trimN ALL
: All reads containing at least one N are redirected to *_NNNN.fq.gz
--trimN ENDS
: N's are trimmed if found at the ends, left "as is" otherwise. If the trimmed read length is smaller than MINL, it is discarded.--trimN STRIP
: Obtain the largest N free subsequence of the read. Accept it if is longer than the half of the original read length, redirect it to *_NNNN.fq.gz
otherwise.The examples in folder examples/trimFilterDS_SReport/
work in the following way:
fa_fq_files
. The files EColi_rRNA_DS.read1.fq.gz
and EColi_rRNA_DS.read2.fq.gz
were created with create_fq.sh
and contain:EColi_genome.fa
with NO errors.rRNA_modified.fa
with NO errrors (rRNA contaminations).create_fq.sh
)create_fq.sh
).run_example_TREE.sh
: the code was tested with flags: `../../bin/trimFilter -l 50 –ifq\ –ifq ../fa_fq_files/EColi_rRNA_DS.read1.fq.gz:../fa_fq_files/EColi_rRNA_DS.read1.fq.gz –method TREE –ifa ../fa_fq_files/rRNA_modified.fa:0.2:30 \ –trimQ ENDSFRAC –trimN ENDS -o treeDS –adapters \ ../fa_fq_files/ad_read1.fa:../fa_fq_files/ad_read2.fa:2:40 i.e., we check for contaminations from rRNA, trim reads with lowQ at the ends and less than 5% in the remaining part, and strip reads containing N's at the ends.run_example_BLOOM.sh
: trimFilterDS is run like in 2. but passing a bloom filter to look for contaminations with score=0.4
and the –trimN STRIP option.adapters
for examples on adapter contaminations (and its corresponding README file).NOTE: rRNA_modified.fa
is the rRNA_CRUnit.fa
sequence, where we have removed the lines containing N's for testing purposes.
Paula Pérez Rubio
GPL v3 (see LICENSE.txt)