My Project
trimFilter user manual

This program reads a fastq file as an input, and filters it according to the following criteria:

Running the program

Usage C executable (in folder bin):

Usage: trimFilter --ifq <INPUT_FILE.fq> --length <READ_LENGTH>
q --output [O_PREFIX] --gzip [y|n]
--adapter [<ADAPTERS.fa>:<mismatches>:<score>]
--method [TREE|BLOOM]
(--idx [<INDEX_FILE>:<score>:<lmer_len>] |
--ifa [<INPUT.fa>:<score>:[lmer_len]])
--trimQ [NO|ALL|ENDS|FRAC|ENDSFRAC|GLOBAL]
--minL [MINL] --minQ [MINQ]
(--percent [percent] | --global [n1:n2])
--trimN [NO|ALL|ENDS|STRIP]
Reads in a fq file (gz, bz2, z formats also accepted) and removes:
* low quality reads,
* reads containing N base callings,
* reads representing contaminations, belonging to sequences in INPUT.fa
Outputs 4 [O_PREFIX]_fq.gz files containing: "good" reads, discarded
low Q reads discarded reads containing N's, discarded contaminations.
Options:
-v, --version prints package version.
-h, --help prints help dialog.
-f, --ifq fastq input file [*fq|*fq.gz|*fq.bz2], mandatory option.
-l, --length read length: length of the reads, mandatory option.
-o, --output output prefix (with path), optional (default ./out).
-z, --gzip gzips output files: yes or no (default yes).
-A, --adapter adapter input three fields separated by colons:
<ADAPTERS.fa>: fasta file containing adapters,
<mismatches>: maximum mismatch count allowed,
<score>: score threshold for the aligner.
-x, --idx index input file. To be included with any method. 3 fields
3 fields separated by colons:
<INDEX_FILE>: output of makeTree, makeBloom,
<score>: score threshold to accept a match [0,1],
[lmer_len]: correspond to the length of the lmers to be
looked for in the reads [1,READ_LENGTH].
-a, --ifa fasta input file. To be included only with method TREE
(it excludes the option --idx). Otherwise, an
index file has to be precomputed and given as parameter
(see option --idx). 3 fields separated by colons:
<INPUT.fa>: fasta input file [*fa|*fa.gz|*fa.bz2],
<score>: score threshold to accept a match [0,1],
<lmer_len>: depth of the tree: [1,READ_LENGTH]. It will
correspond to the length of the lmers to be
looked for in the reads.
-C, --method method used to look for contaminations:
TREE: uses a 4-ary tree. Index file optional,
BLOOM: uses a bloom filter. Index file mandatory.
-Q, --trimQ NO: does nothing to low quality reads (default),
ALL: removes all reads containing at least one low
quality nucleotide.
ENDS: trims the ends of the read if their quality is
below the threshold -q,
FRAC: discards a read if the fraction of bases whose
quality lies below
the threshold -q is over 5 percent or a user
defined percentage in -p.
ENDSFRAC: trims the ends and then discards the read if
there are more low quality nucleotides than the
allowed by the option -p.
GLOBAL: removes n1 cycles on the left and n2 on the
right, specified in -g.
All reads are discarded if they are shorter than MINL.
-m, --minL minimum length allowed for a read before it is discarded
(default 25).
-q, --minQ minimum quality allowed (int), optional (default 27).
-p, --percent percentage of low quality bases to be admitted before
discarding a read (default 5),
-g, --global required option if --trimQ GLOBAL is passed. Two int,
n1:n2, have to be passed specifying the number of cycles
to be globally cut from the left and right, respectively.
-N, --trimN NO: does nothing to reads containing N's,
ALL: removes all reads containing N's,
ENDS: trims ends of reads with N's,
STRIPS: looks for the largest substring with no N's.
All reads are discarded if they are shorter than minL.

NOTE: the parameters -l or –length are meant to identify the length of the reads in the input data. Actually, trimFilter also copes with data holding reads with different lengths. The length parameter must hold the length of the longest read in the dataset.

Output description

Filters

Adapters

Technical sequences within the reads are detected if the option --adapters <ADAPTERS.fa>:<mismatches>:<score> is given. The adapter(s) sequence(s) are read from the fasta file, and the search is done using an 'seed and extend' approach. It starts by looking for 16-nucleotides long seeds, for which a user defined number of mismatches is allowed (mismatches). If found, a score is computed. If the score is larger than the user defined threshold (score) and the number of matched nucleotides exceeds 12, then the read is trimmed if the remaining part is longer than MINL (user defined) and discarded otherwise. If no 16-nucleotides long seeds are found, we proceed with 8-nucleotides long seeds and apply the same criteria to trim/discard a read. A list of possible situations follows, to illustrate how it works (MINL=25, mismatches=2):

ADAPTER: GGCATACGAGCTCTTCCGATCT
REV_COM: AGATCGGAAGAGCTCGTATGCC
CASE1A: CACAGTCGATCAGCGAGCAGGCATTCATGCTGAGATCGGAAGAGATCGTATG
||||||||||||X|||----
AGATCGGAAGAGCTCGTATG
- Seed: 16 Nucleotides
- Return: trimmed, TRIMA:0:31
CASE1B: CACATCATCGCTAGCTATCGATCGATCGATGCTATGCAAGATCGGAAGAGCT
||||||||------
AGATCGGAAGAGCT
- Seed: 8 Nucleotides
- Return: trimmed, TRIMA:0:37
CASE1C: CACATCATCGCTAGCTATCGATCGATCGATGCTATGCACGAAGATCGGAAGA
||||||||---
AGATCGGAAGA
- Seed: 8 Nucleotides
- Return: nothing done, reason: Match length < 12
CASE2A: CATACATCACGAGCTAGCTAGAGATCGGAAGAGCTCGTATGCCCAGCATCGA
||||||||||||||||------
AGATCGGAAGAGCTCGTATGCC
- Seed: 16 Nucleotides
- Return: discarded, reason: remaining read too short.
CASE2B: CCACAGTACAATACATCACGAGCTAGCTAGAGATCGGAAGAGCTCGTATGCC
||||||||||||||||||||||
AGATCGGAAGAGCTCGTATGCC
- Seed: 16 Nucleotides
- Return: trimmed, TRIMA:0:28
CASE3A: TATGCCGTCTTCTGCTTGCAGTGCATGCTGATGCATGCTGCATGCTAGCTGC
||||||||||||||||--
TATGCCGTCTTCTGCTTG
- Seed: 16 Nucleotides
- Return: discarded, reason: remaining read too short
CASE3B: CGTCTTCTGCTTGCCGATCGATGCTAGCTACGATCGTCGAGCTAGCTACGTG
||||||||-----
CGTCTTCTGCTTG
- Seed: 8 Nucleotides
- Return: discarded, reason: remaining read too short
CASE3C: TCTTCTGCTTGCCGATCGATGCTAGCTACGATCGTCGAGCTAGCTACGTGCG
||||||||---
TCTTCTGCTTG
- Seed: 8 Nucleotides
- Return: nothing done, reason: Match length < 12

The score is calculated as follows:

Impurities

Contaminations are removed if a fasta or an index file are given as an input. Three methods have been implemented to check for contaminations:

LowQ

Note: qualities are evaluated assuming the reads to follow the L - Illumina 1.8+ Phred+33, convention, see Wikipedia. Adjust the values for a different convention.

N trimming

We allow for the following options:

Test/examples

The examples in folder examples/trimFilter_SReport/ work in the following way:

  1. See folder examples/fa_fq_files. The file EColi_rRNA.fq was created with create_fq.sh and contains:
    * 2e4 reads of length 50 from EColi_genome.fa with NO errors.
    • 5e3 reads of length 50 from rRNA_modified.fa with NO errrors (rRNA contaminations).
      * Artificially generated reads with low quality score (see create_fq.sh)
    • Artificially generated reads with Ns (see create_fq.sh).
    • Adapter files: adapter_even_long.fa, adapter_odd_long.fa, adapter_even_short.fa, adapter_odd_short.fa. Fasta files containing one adapter sequence each, longer/shorter than 16 nucleotides and with an even/odd length.
    • Example files to test the adapter contamination searchs: human_[even/odd]_wad_[even/odd]_[long/short].fq. Short fastq files where adapters contaminations have been inserted in all possible ways: even/odd positions, at the beginning/middle/end of the reads. Read lengths are even or odd as the first suffix indicates. The adapter contaminations included are suggested by the second even/odd suffix, and the long/short suffix.
  2. adapters/run_example.sh: runs examples of reads containing adapters contaminations. A set of different possibilities is covered. See README file inside the folder adapters
  3. run_example_TREE.sh: the code was tested with flags:
    $ ../../bin/trimFilter -l 50 --ifq PATH/TO/EColi_rRNA.fq.gz
    --method TREE --ifa PATH/TO/rRNA_modified.fa:0.9:50
    --trimQ ENDSFRAC --trimN STRIP -o tree
    i.e., we check for contaminations from rRNA, trim reads with lowQ at the ends and less than 5% in the remaining part, and strip reads containing N's. The output should coincide with the files example_TREE*
    4. run_example_BLOOM.sh:
    * bloom filter is generated for rRNA_modified.fa with FPR = 0.0075 and kmersize=25. The output should coincide with rRNA_example.bf*.
    • trimFilter is run like in 2. but passing a bloom filter to look for contaminations with score=0.4.
  4. With this set up, it is possible to run further customized tests.
    NOTE: rRNA_modified.fa is the rRNA_CRUnit.fa sequence, where we have
    removed the lines containing N's for testing purposes.

Contributors

Paula Pérez Rubio

License

GPL v3 (see LICENSE.txt)