FastqPuri
Macros | Functions | Variables
trim.c File Reference

trims/filter sequences after Quality, N's contaminations. More...

#include <string.h>
#include <stdlib.h>
#include "trim.h"
#include "str_manip.h"
#include "defines.h"
#include "config.h"
#include "struct_trimFilter.h"
Include dependency graph for trim.c:

Macros

#define TRIM_STRING   20
 

Functions

static int no_N (Fq_read *seq)
 checks if a sequence contains any non standard base callings (N's) More...
 
static int Nfree_Lmer (Fq_read *seq, int minL)
 Finds the largest Nfree sub-seq and keeps it if larger than minL. More...
 
static int Ntrim_ends (Fq_read *seq, int minL)
 trims a read if N's are at the ends and the remaining sub-seq >= minL More...
 
static int no_lowQ (Fq_read *seq, int minQ)
 checks if a sequence contains lowQ nucleotides More...
 
static int Qtrim_ends (Fq_read *seq, int minQ, int minL)
 trims a read if lowQs are at the ends and remaining sub-seq >= minL More...
 
static int Qtrim_frac (Fq_read *seq, int minQ, int nlowQ)
 accepts the sequence as is if there are less than nlowQ More...
 
static int Qtrim_endsfrac (Fq_read *seq, int minQ, int minL, int nlowQ)
 
int Qtrim_global (Fq_read *seq, int left, int right, char type)
 trims left from the left and right from the right More...
 
static int align_uint32 (Fq_read *seq, Ad_seq *ptr_adap, bool all)
 alignment search between a fq read, and an adapter sequence, with a seed of 8 nucleotides. More...
 
static int align_uint64 (Fq_read *seq, Ad_seq *ptr_adap)
 Alignment search between a fq read, and an adapter sequence, w with a seed of 8 nucleotides. More...
 
int trim_adapter (Fq_read *seq, Ad_seq *adap_list)
 trims sequence based on presence of N nucleotides More...
 
int trim_sequenceN (Fq_read *seq)
 trims sequence based on presence of N nucleotides More...
 
int trim_sequenceQ (Fq_read *seq)
 trims sequence based on lowQ base callings More...
 
bool is_read_inTree (Tree *tree_ptr, Fq_read *seq)
 check if Lread is contained in tree. It computes the score for the read and its reverse complement; if one ot them exceeds the user selected threshold, it returns true. Otherwise, it returns false. More...
 
bool is_read_inBloom (Bfilter *ptr_bf, Fq_read *seq, Bfkmer *ptr_bfkmer)
 checks if a read is in Bloom filter. It computes the score for the read and returns true if it exceeds the user selected threshold. Returns false othersise. More...
 

Variables

int Nencode
 
Iparam_trimFilter par_TF
 

Detailed Description

trims/filter sequences after Quality, N's contaminations.

Author
Paula Perez paula.nosp@m.pere.nosp@m.zrubi.nosp@m.o@gm.nosp@m.ail.c.nosp@m.om
Date
24.08.2017

Macro Definition Documentation

◆ TRIM_STRING

#define TRIM_STRING   20

maximal length of trimming info string.

Function Documentation

◆ align_uint32()

static int align_uint32 ( Fq_read seq,
Ad_seq ptr_adap,
bool  all 
)
static

alignment search between a fq read, and an adapter sequence, with a seed of 8 nucleotides.

This function checks whether there is adapter contamination in a given read. It works stand alone if the adapter is shorter than 16 nucleotides, and is called from align_uint64 when no 16-nucleotides long seeds are found. The criteria are the same as in align_uint64, the seed length being 8-nucleotides long instead of 16. See the align_uint64 documentation for more details.

Parameters
seqpointer to Fq_read
ptr_adappointer to Ad_seq
alltrue if the whole read has to be sweeped, false if only the ends. When this function is called from align_uint64, only the ends need to be considered.
Returns
-1 error, 0 discarded, 1 accepted as is, 2 accepted and trimmed
Note
Global input parameters from par_TF are also used
See also
Adapter
Iparam_trimFilter
align_uint64
pack_adapter
obtain_score

◆ align_uint64()

static int align_uint64 ( Fq_read seq,
Ad_seq ptr_adap 
)
static

Alignment search between a fq read, and an adapter sequence, w with a seed of 8 nucleotides.

Parameters
seqpointer to Fq_read
ptr_adappointer to Ad_seq
Returns
-1 error, 0 discarded, 1 accepted as is, 2 accepted and trimmed
Note
Global input parameters from par_TF are used
See also
Adapter
Iparam_trimFilter
align_uint32
pack_adapter
obtain_score

This function checks whether there is adapter contamination in a given read. We start by looking for 16-nucleotides long seeds, where a user defined number of mismatches is allowed. If found, a score is computed. If the score is larger than the user defined threshold and the number of matched nucleotides exceeds MIN_NMATCHES (12), then the read is trimmed if the remaining part is longer than minL (user defined) and discarded otherwise. If no 16-nucleotides long seeds are found, we proceed with 8-nucleotides long seeds (see align_uint32) and apply the same criteria to trim/discard a read. A list of possible situations follows, to illustrate how it works (minL=25, mismatches=2):

ADAPTER: CAAGCAGAAGACGGCATACGAG
REV_COM: AGATCGGAAGAGCTCGTATGCC
CASE1A: CACAGTCGATCAGCGAGCAGGCATTCATGCTGAGATCGGAAGAGATCGTATG
||||||||||||X|||----
AGATCGGAAGAGCTCGTATG
- Seed: 16 Nucleotides
- Return: 2, TRIMA:0:31
CASE1B: CACATCATCGCTAGCTATCGATCGATCGATGCTATGCAAGATCGGAAGAGCT
||||||||------
AGATCGGAAGAGCT
- Seed: 8 Nucleotides
- Return: 2, TRIMA:0:37
CASE1C: CACATCATCGCTAGCTATCGATCGATCGATGCTATGCACGAAGATCGGAAGA
||||||||---
AGATCGGAAGA
- Seed: 8 Nucleotides
- Return: 1, reason: Match length < 12
CASE2A: CATACATCACGAGCTAGCTAGAGATCGGAAGAGCTCGTATGCCCAGCATCGA
||||||||||||||||------
AGATCGGAAGAGCTCGTATGCC
- Seed: 16 Nucleotides
- Return: 0, reason: remaining read too short.
CASE2B: CCACAGTACAATACATCACGAGCTAGCTAGAGATCGGAAGAGCTCGTATGCC
||||||||||||||||||||||
AGATCGGAAGAGCTCGTATGCC
- Seed: 16 Nucleotides
- Return: 2, TRIMA:0:28
CASE3A: TATGCCGTCTTCTGCTTGCAGTGCATGCTGATGCATGCTGCATGCTAGCTGC
||||||||||||||||--
TATGCCGTCTTCTGCTTG
- Seed: 16 Nucleotides
- Return: 0, reason: remaining read too short
CASE3B: CGTCTTCTGCTTGCCGATCGATGCTAGCTACGATCGTCGAGCTAGCTACGTG
||||||||-----
CGTCTTCTGCTTG
- Seed: 8 Nucleotides
- Return: 0, reason: remaining read too short
CASE3C: TCTTCTGCTTGCCGATCGATGCTAGCTACGATCGTCGAGCTAGCTACGTGCG
||||||||---
TCTTCTGCTTG
- Seed: 8 Nucleotides
- Return: 1, reason: Match length < 12

◆ is_read_inBloom()

bool is_read_inBloom ( Bfilter ptr_bf,
Fq_read seq,
Bfkmer ptr_bfkmer 
)

checks if a read is in Bloom filter. It computes the score for the read and returns true if it exceeds the user selected threshold. Returns false othersise.

Parameters
ptr_bfpointer to Bfilter
seqfastq read
ptr_bfkmerpointer to Procs_kmer structure (will store global)
Returns
true if read was found, false otherwise

◆ is_read_inTree()

bool is_read_inTree ( Tree tree_ptr,
Fq_read seq 
)

check if Lread is contained in tree. It computes the score for the read and its reverse complement; if one ot them exceeds the user selected threshold, it returns true. Otherwise, it returns false.

Parameters
tree_ptrpointer to Tree structure
seqfastq read
Returns
true if read was found, false otherwise

◆ Nfree_Lmer()

static int Nfree_Lmer ( Fq_read seq,
int  minL 
)
static

Finds the largest Nfree sub-seq and keeps it if larger than minL.

Parameters
seqfastq read
minLminimum accepted trimmed length
Returns
0 if not used, 1 if accepted as is, 2 if accepted and trimmed

◆ no_lowQ()

static int no_lowQ ( Fq_read seq,
int  minQ 
)
static

checks if a sequence contains lowQ nucleotides

Parameters
seqfastq read
minQminimum accepted quality value
Returns
0 if seq contains lowQ nucleotides, 1 otherwise

◆ no_N()

static int no_N ( Fq_read seq)
static

checks if a sequence contains any non standard base callings (N's)

Returns
0 if no N's found, 1 if N's found

This function checks if any of the base callings in a given fastq read is different from A, C, G, T. Basically, any char different from the former ones is classified as N.

◆ Ntrim_ends()

static int Ntrim_ends ( Fq_read seq,
int  minL 
)
static

trims a read if N's are at the ends and the remaining sub-seq >= minL

Parameters
seqfastq read
minLminimum accepted trimmed length
Returns
0 if not used, 1 no N's found, 2 if accepted and trimmed

◆ Qtrim_ends()

static int Qtrim_ends ( Fq_read seq,
int  minQ,
int  minL 
)
static

trims a read if lowQs are at the ends and remaining sub-seq >= minL

Parameters
seqfastq read
minQminimum accepted quality value
minLminimum accepted trimmed length
Returns
0 if not used, 1 if accepted as is, 2 if accepted and trimmed

◆ Qtrim_frac()

static int Qtrim_frac ( Fq_read seq,
int  minQ,
int  nlowQ 
)
static

accepts the sequence as is if there are less than nlowQ

Parameters
seqfastq read
minQminimum accepted quality value
nlowQthreshold on lowQ nucleotides (>= NOT allowed)
Returns
0 if not used, 1 if accepted as is

◆ Qtrim_global()

int Qtrim_global ( Fq_read seq,
int  left,
int  right,
char  type 
)

trims left from the left and right from the right

Parameters
seqfastq read
leftnumber of nucleotides to be trimmed from the left
rightnumber of nucleotides to be trimmed from the right
typechar indicating the type of trimming (Q,A).
Returns
2, since they are all accepted and trim

◆ trim_adapter()

int trim_adapter ( Fq_read seq,
Ad_seq adap_list 
)

trims sequence based on presence of N nucleotides

if (adapter length < 16) -> search for seeds 8 nucleotides long else -> search for seeds 16 nucleotides long if (seed found) -> calculate score if score > threshold -> aligner found, trim / discard and exit. else -> search for seeds 8 nucleotides long

Parameters
seqpointer to Fq_read
adap_listarray of Ad_seq
Returns
-1 error, 0 discarded, 1 accepted as is, 2 accepted and trimmed
Note
Global input parameters from par_TF are also used

◆ trim_sequenceN()

int trim_sequenceN ( Fq_read seq)

trims sequence based on presence of N nucleotides

Parameters
seqfastq read
Returns
-1 error, 0 discarded, 1 accepted as is, 2 accepted and trimmed

This function calls a different function depending on the method passed as input par_TF.trimN:

  • NO(0): accepts it as is, (1),
  • ALL(1): accepts it as is if NO N's found (1), rejects it otherwise (0),
  • ENDS(2): trims the ends and accepts it if it is longer than minL (2 if trimming, 1 if no trimming), rejects it otherwise (0),
  • STRIP(3): finds the longest N-free subsequence and trims it if it is at least minL nucleotides long (2 if trimming, 1 if no N's are found), rejects it otherwise (0).

◆ trim_sequenceQ()

int trim_sequenceQ ( Fq_read seq)

trims sequence based on lowQ base callings

Parameters
seqfastq read
Returns
-1 error, 0 discarded, 1 accepted as is, 2 accepted and trimmed

This function calls a different function depending on the method passed as input par_TF.trimQ:

  • NO(0): accepts is as is , (1),
  • FRAC(1): accepts it if less than par_TF.nlowQ are found (1), rejects it otherwise (0),
  • ENDS(2): trims the ends and accepts it if it is longer than minL (2 if triming, 1 if no trimming), rejects it otherwise (0),
  • ENDSFRAC(3): trims the ends and accepts if the remaining sequence is at least minL bases long and if it contains less than nlowQ lowQ nucleotides (2 if trimming, 1 if no trimming). Otherwise, it is rejected, (0).
  • GLOBAL(4): it trims globally globleft nucleotides from the left and globright from the right, (returns 2).

Variable Documentation

◆ Nencode

int Nencode

global variable. Encoding for N's(\004)

◆ par_TF

global variable: Input parameters trimFilter.

global variable: Input parameters of makeTree.