FastqPuri
Classes | Typedefs | Functions
bloom.h File Reference

functions that implement the bloom filter More...

#include "city.h"
#include "fa_read.h"
#include "defines.h"
Include dependency graph for bloom.h:
This graph shows which files directly or indirectly include this file:

Go to the source code of this file.

Classes

struct  _bfilter
 Bloom filter structure. More...
 
struct  _bfkmer
 stores a processed kmer (2 bits pro nucleotide) More...
 

Typedefs

typedef struct _bfilter Bfilter
 Bloom filter structure.
 
typedef struct _bfkmer Bfkmer
 stores a processed kmer (2 bits pro nucleotide)
 

Functions

void init_LUTs ()
 look up table initialization More...
 
Bfilterinit_Bfilter (int kmersize, uint64_t bfsizeBits, int hashNum, double falsePosRate, uint64_t nelem)
 initialization of a Bfilter structure More...
 
Bfkmerinit_Bfkmer (int kmersize, int hashNum)
 initializes a Bfkmer structure, given the kmersize and the number of hash functions More...
 
void free_Bfilter (Bfilter *ptr_bf)
 free Bfilter memory
 
void free_Bfkmer (Bfkmer *ptr_bfkmer)
 free Bfkmer
 
int compact_kmer (const unsigned char *sequence, uint64_t position, Bfkmer *ptr_bfkmer)
 compactifies a kmer for insertion in the bloomfilter More...
 
void multiHash (Bfkmer *ptr_bfkmer)
 obtains the hashNum hashvalues for a compactified kmer More...
 
bool insert_and_fetch (Bfilter *pr_bf, Bfkmer *ptr_bfkmer)
 inserts the hashvalues of a kmer in filter More...
 
bool contains (Bfilter *ptr_bf, Bfkmer *ptr_bfkmer)
 check if kmer is contained in the filter More...
 
Bfiltercreate_Bfilter (Fa_data *ptr_fasta, int kmersize, uint64_t bfsizeBits, int hashNum, double falsePosRate, uint64_t nelem)
 creates a bloom filter from a fasta structure. More...
 
void save_Bfilter (Bfilter *ptr_bf, char *filterfile, char *paramfile)
 saves a bloomfilter to disk More...
 
Bfilterread_Bfilter (char *filterfile, char *paramfile)
 reads a bloom filter from a file More...
 

Detailed Description

functions that implement the bloom filter

Author
Paula Perez paula.nosp@m.pere.nosp@m.zrubi.nosp@m.o@gm.nosp@m.ail.c.nosp@m.om
Date
04.09.2017

Function Documentation

◆ compact_kmer()

int compact_kmer ( const unsigned char *  sequence,
uint64_t  position,
Bfkmer ptr_bfkmer 
)

compactifies a kmer for insertion in the bloomfilter

Parameters
sequenceunsigned char DNA sequence (or cDNA)
positionposition in the sequence where the kmer starts
ptr_bfkmerinitialized Bfkmer

The compactified sequence is computed in the following way:

  • We start compactifying both, the forward and backward (reverse complement). The outer loop covers up until half of the sequence.
  • As soon as one of the two is lexicographically smaller, we continue only with it. In that way, the "smaller" sequence is consistently returned.
  • If the sequence is palindromic, we continue with the forward sequence.
  • kmersize should be > 3.

    We illustrate the compactification with an example:

    kmer = TTTT|GGAT
    m_fw = 00000000 | 00000000 // 2 bytes
    m_bw = 00000000 | 00000000 // 2 bytes
    m_fw[0] |= fw0['T'] = 0xC0|0x00; m_bw[0] |= bw0['T'] = 0x00|0x00;
    m_fw[0] |= fw1['T'] = 0xF0|0x00; m_bw[0] |= bw1['A'] = 0x30|0x00;
    m_fw[0] |= fw2['T'] = 0xFC|0x00; m_bw[0] |= bw2['G'] = 0x34|0x00;
    m_fw[0] |= fw3['T'] = 0xFF|0x00; m_bw[0] |= bw3['G'] = 0x35|0x00;
    m_fw[1] |= fw0['G'] = 0xC0|0x80; m_bw[1] |= bw0['T'] = 0x35|0x00;
    m_fw[1] |= fw1['G'] = 0xF0|0xA0; m_bw[1] |= bw1['T'] = 0x35|0x00;
    m_fw[1] |= fw2['A'] = 0xFC|0xA0; m_bw[1] |= bw2['T'] = 0x35|0x00;
    m_fw[1] |= fw3['T'] = 0xFF|0xA3; m_bw[1] |= bw3['T'] = 0x35|0x00;

    (In this case, we would store m_bw)

◆ contains()

bool contains ( Bfilter ptr_bf,
Bfkmer ptr_bfkmer 
)

check if kmer is contained in the filter

Parameters
ptr_bfpointer to a Bfilter structure, where a bloomfilter is stored
ptr_bfkmerpointer to a Bfkmer structure containing the hash values
Returns
true if all corresponding bits were set to 1 in the filter

◆ create_Bfilter()

Bfilter* create_Bfilter ( Fa_data ptr_fasta,
int  kmersize,
uint64_t  bfsizeBits,
int  hashNum,
double  falsePosRate,
uint64_t  nelem 
)

creates a bloom filter from a fasta structure.

Parameters
ptr_fastapointer to fasta structure
kmersizelength of kmers to be inserted in the filter
bfsizeBitssize of Bloom filter in bits
hashNumnumber of hash functions to be used
falsePosRatefalse positive rate
nelemnumber of elemens (kmers in the sequece) contained in the filter
Returns
pointer to Bloom filter structure, where the fasta file was encoded.

◆ init_Bfilter()

Bfilter* init_Bfilter ( int  kmersize,
uint64_t  bfsizeBits,
int  hashNum,
double  falsePosRate,
uint64_t  nelem 
)

initialization of a Bfilter structure

Parameters
kmersizenumber of elements of the kmer
bfsizeBitssize of the bloomfilter (in Bits)
hashNumnumber of hash functions to be computed
falsePosRatefalse positive rate
nelemnumber of elemens (kmers in the sequece) contained in the filter
Returns
pointer to initialized Bfilter structure

Given a kmersize, bfsizeBits, number of hash functions, we assign these values to the struture and the two additional values: kmersizeBytes = (kmersize + BASESINCHAR - 1 )/BASESINCHAR

◆ init_Bfkmer()

Bfkmer* init_Bfkmer ( int  kmersize,
int  hashNum 
)

initializes a Bfkmer structure, given the kmersize and the number of hash functions

Parameters
kmersizenumber of elements of the kmer
hashNumnumber of hash functions to be computed
Returns
pointer to a Bfkmer structure

kmersizeBytes, halfsizeBytes, hangingBases, hasOverhead hashNum are assigned and memory is allocated and set to 0 for compact and hashValues

◆ init_LUTs()

void init_LUTs ( )

look up table initialization

It initializes: fw0, fw1, fw2, fw3, bw0, bw2, bw3, bw4. They are uint8_t arrays with 256 elements. All elements are set to 0xFF excepting the ones corresponding to 'a', 'A', 'c', 'C', 'g', 'G', 't', 'T':

Var a,A c,C g,G t,T Var a,A c,C g,G t,T
fw0 0x00 0x40 0x80 0xC0 bw0 0xC0 0x80 0x40 0x00
fw1 0x00 0x10 0x20 0x30 bw1 0x30 0x20 0x10 0x00
fw2 0x00 0x04 0x08 0x0C bw2 0x0C 0x08 0x04 0x00
fw3 0x00 0x01 0x02 0x03 bw3 0x03 0x02 0x01 0x00

With these variables, we will be able to encode a Sequence using 2 bits per nucleotide.

◆ insert_and_fetch()

bool insert_and_fetch ( Bfilter ptr_bf,
Bfkmer ptr_bfkmer 
)

inserts the hashvalues of a kmer in filter

Parameters
ptr_bfpointer to Bfilter structure, where we will include the new entry
ptr_bfkmerpointer to Bfkmer structure, where the hashvalues are stored
Returns
true if the positions of the hash values were already set to one previously.

The hash values are inserted in the following way.

  • modValue = hashvalue mod(filter size) is calculated.
  • the bit in position modValue of the filter is set to 1.

◆ multiHash()

void multiHash ( Bfkmer ptr_bfkmer)

obtains the hashNum hashvalues for a compactified kmer

The hash values are computed using the CityHash64 hash functions.

◆ read_Bfilter()

Bfilter* read_Bfilter ( char *  filterfile,
char *  paramfile 
)

reads a bloom filter from a file

Parameters
filterfilepath to file containing the filter
paramfilepath to file containing the filter
Returns
a pointer to a filter structure containing the bloomfilter

This function reads two files, the auxiliar inputfile where kmersize, hashNum and bfsizeBits are stored, and the actual filter file. If one of them is missing, the program exits with an error. If successful, a pointer to a Bfilter structure with the bloom filter is return

◆ save_Bfilter()

void save_Bfilter ( Bfilter ptr_bf,
char *  filterfile,
char *  paramfile 
)

saves a bloomfilter to disk

Parameters
ptr_bfpointer to Bfilter structure (contains the filter)
filterfilepath to file where the output will be stored
paramfilepath to file where the prameters will be stored

This function will save the bloomfilter in the path filterfile. The paramfile will store the following data:

  • kmersize
  • hashNum
  • bfsizeBits
  • falsePosRate
  • nelem