My Project
|
This program reads a fastq
file as an input, and filters it according to the following criteria:
fasta
file or in an idx
file created by makeTree
, makeBloom
Usage C
executable (in folder bin
):
NOTE: the parameters -l or –length are meant to identify the length of the reads in the input data. Actually, trimFilter
also copes with data holding reads with different lengths. The length parameter must hold the length of the longest read in the dataset.
O_PREFIX_good.fq.gz
: contains reads that passed all filters (maybe trimmed).O_PREFIX_adap.fq.gz
: contains reads discarded due to the presence of adapters.O_PREFIX_cont.fq.gz
: contains contamination reads.O_PREFIX_lowQ.fq.gz
: contains reads discarded due to low quality issues.O_PREFIX_NNNN.fq.gz
: contains reads discarded due to N's issues.O_PREFIX_summary.bin
: binary file where information about the filtering process is stored. Structure of the file.4*sizeof(int) Bytes
: array of int with entries i = {ADAP(0), CONT(1), LOWQ(2), NNNN(3)}
. A given entry takes the value of the filter it was applied to and 0 otherwise. filters[ADAPT] = {0,1}
, filters[CONT] = {NO(0), TREE(1), BLOOM(2)}
, filters[LOWQ] = {NO(0), ALL(1), ENDS(2), FRAC(3), ENDSFRAC(4), GLOBAL(5)}
, filters[trimN] = {NO(0), ALL(1), ENDS(2), STRIPS(2)}
.4*sizeof(int) Bytes
: array of integers with entries i = {ADAP(0), CONT(1), LOWQ(2), NNNN(3)}, containing how many reads were trimmed due to the corresponding filter.4*sizeof(int) Bytes
: array of integers with entries i = {ADAP(0), CONT(1), LOWQ(2), NNNN(3)}, containing how many reads were discarded due to the corresponding filter.sizeof(int) Bytes
: number of accepted reads (maybe trimmed).sizeof(int) Bytes
: total number of reads.Technical sequences within the reads are detected if the option --adapters <ADAPTERS.fa>:<mismatches>:<score>
is given. The adapter(s) sequence(s) are read from the fasta file, and the search is done using an 'seed and extend' approach. It starts by looking for 16-nucleotides long seeds, for which a user defined number of mismatches is allowed (mismatches
). If found, a score is computed. If the score is larger than the user defined threshold (score
) and the number of matched nucleotides exceeds 12, then the read is trimmed if the remaining part is longer than MINL
(user defined) and discarded otherwise. If no 16-nucleotides long seeds are found, we proceed with 8-nucleotides long seeds and apply the same criteria to trim/discard a read. A list of possible situations follows, to illustrate how it works (MINL=25
, mismatches=2
):
The score is calculated as follows:
score += log_10(4)
score -= Q/10
, where Q is the quality score.Contaminations are removed if a fasta or an index file are given as an input. Three methods have been implemented to check for contaminations:
--ifa <INPUT.fa>:<score>:<lmer_len>
: the file INPUT.fa
is read and a tree of depth lmer_len
is constructed on the flight.--idx <INDEX_FILE>:<score>
: in INDEX_FILE
the tree structure was stored with ./makeTree
. A read will be considered to be a contamination if the score is bigger than score
. The score is computed as the proportion of Lmers of a given read found in the tree. This method is very fast, every search is O( 2 * Lmer * (L - Lmer + 1))
(the reverse complement has to be checked also), but has the drawback that is very memory intensive. This is why we have limited the construction of a tree to sequences whose size does not exceed 10 MB. Constructing a tree is very fast, that is why most of the times it might be not worth it to create the index file and then read it for every sample. Nevertheless, both options are provided.makeBloom
, so that it is computed only once for all samples. The index file name and the threshold score have to be specified through the following option:--idx <INDEX_FILE>:<score>
The score is computed as for the previous options.--trimQ NO
or flag absent: nothing is done to the reads with low quality.--trimQ ALL
: all reads containing at least one low quality nucleotide are redirected to *_lowq.fq.gz
--trimQ ENDS
: look for low quality (below MINQ) base callings at the beginning and at the end of the read. Trim them at both ends until the quality is above the threshold. Keep the read in *_good.fq.gz
and annotate in the fourth line where the read has been trimmed (starting to count from 0) if the length of the remaining part at least MINL
. Redirect the read to *_lowq.fq.gz
otherwise. Examples (-q 27 [<] -m 25): --trimQ FRAC [--percent p]
: redirect the read to *_lowq.fq.gz
if there are more than p%
nucleotides whose quality lies below the threshold. p=5
per default. Examples (-q 27 [<] -m 25 -p 5): --trimQ ENDSFRAC --percent p
: first trim the ends as in the ENDS
option. Accept the trimmed read if the number of low quality nucleotides does not exceed p%
(default p = 5
). Redirect the read to *_lowq.fq.gz
otherwise. Examples (-q 27 [<] -m 25 -p 5): Missing reads would land in *_lowq.fq.gz
. --trimQ GLOBAL --global n1:n2
: cut all reads globally n1
nucleotides from the left and n2
from the right.Note: qualities are evaluated assuming the reads to follow the L - Illumina 1.8+ Phred+33, convention, see Wikipedia. Adjust the values for a different convention.
We allow for the following options:
--trimN NO
(or flag absent): Nothing is done to the reads containing N's.--trimN ALL
: All reads containing at least one N are redirected to *_NNNN.fq.gz
--trimN ENDS
: N's are trimmed if found at the ends, left "as is" otherwise. If the trimmed read length is smaller than MINL, it is discarded. Example: --trimN STRIP
: Obtain the largest N free subsequence of the read. Accept it if is at least MINL nucleotides long, redirect it to *_NNNN.fq.gz
otherwise. Example: The examples in folder examples/trimFilter_SReport/
work in the following way:
examples/fa_fq_files
. The file EColi_rRNA.fq
was created with create_fq.sh
and contains: EColi_genome.fa
with NO errors.rRNA_modified.fa
with NO errrors (rRNA contaminations). create_fq.sh
)create_fq.sh
).adapter_even_long.fa
, adapter_odd_long.fa
, adapter_even_short.fa
, adapter_odd_short.fa
. Fasta files containing one adapter sequence each, longer/shorter than 16 nucleotides and with an even/odd length.human_[even/odd]_wad_[even/odd]_[long/short].fq
. Short fastq files where adapters contaminations have been inserted in all possible ways: even/odd positions, at the beginning/middle/end of the reads. Read lengths are even or odd as the first suffix indicates. The adapter contaminations included are suggested by the second even/odd suffix, and the long/short suffix.adapters/run_example.sh
: runs examples of reads containing adapters contaminations. A set of different possibilities is covered. See README file inside the folder adapters
run_example_TREE.sh
: the code was tested with flags: example_TREE*
run_example_BLOOM.sh
: rRNA_modified.fa
with FPR = 0.0075 and kmersize=25
. The output should coincide with rRNA_example.bf*
.score=0.4
.rRNA_modified.fa
is the rRNA_CRUnit.fa
sequence, where we have Paula Pérez Rubio
GPL v3 (see LICENSE.txt)