FastqPuri
|
functions that implement the bloom filter More...
Go to the source code of this file.
Classes | |
struct | _bfilter |
Bloom filter structure. More... | |
struct | _bfkmer |
stores a processed kmer (2 bits pro nucleotide) More... | |
Typedefs | |
typedef struct _bfilter | Bfilter |
Bloom filter structure. | |
typedef struct _bfkmer | Bfkmer |
stores a processed kmer (2 bits pro nucleotide) | |
Functions | |
void | init_LUTs () |
look up table initialization More... | |
Bfilter * | init_Bfilter (int kmersize, uint64_t bfsizeBits, int hashNum, double falsePosRate, uint64_t nelem) |
initialization of a Bfilter structure More... | |
Bfkmer * | init_Bfkmer (int kmersize, int hashNum) |
initializes a Bfkmer structure, given the kmersize and the number of hash functions More... | |
void | free_Bfilter (Bfilter *ptr_bf) |
free Bfilter memory | |
void | free_Bfkmer (Bfkmer *ptr_bfkmer) |
free Bfkmer | |
int | compact_kmer (const unsigned char *sequence, uint64_t position, Bfkmer *ptr_bfkmer) |
compactifies a kmer for insertion in the bloomfilter More... | |
void | multiHash (Bfkmer *ptr_bfkmer) |
obtains the hashNum hashvalues for a compactified kmer More... | |
bool | insert_and_fetch (Bfilter *pr_bf, Bfkmer *ptr_bfkmer) |
inserts the hashvalues of a kmer in filter More... | |
bool | contains (Bfilter *ptr_bf, Bfkmer *ptr_bfkmer) |
check if kmer is contained in the filter More... | |
Bfilter * | create_Bfilter (Fa_data *ptr_fasta, int kmersize, uint64_t bfsizeBits, int hashNum, double falsePosRate, uint64_t nelem) |
creates a bloom filter from a fasta structure. More... | |
void | save_Bfilter (Bfilter *ptr_bf, char *filterfile, char *paramfile) |
saves a bloomfilter to disk More... | |
Bfilter * | read_Bfilter (char *filterfile, char *paramfile) |
reads a bloom filter from a file More... | |
functions that implement the bloom filter
int compact_kmer | ( | const unsigned char * | sequence, |
uint64_t | position, | ||
Bfkmer * | ptr_bfkmer | ||
) |
compactifies a kmer for insertion in the bloomfilter
sequence | unsigned char DNA sequence (or cDNA) |
position | position in the sequence where the kmer starts |
ptr_bfkmer | initialized Bfkmer |
The compactified sequence is computed in the following way:
kmersize should be > 3.
We illustrate the compactification with an example:
(In this case, we would store m_bw)
check if kmer is contained in the filter
ptr_bf | pointer to a Bfilter structure, where a bloomfilter is stored |
ptr_bfkmer | pointer to a Bfkmer structure containing the hash values |
Bfilter* create_Bfilter | ( | Fa_data * | ptr_fasta, |
int | kmersize, | ||
uint64_t | bfsizeBits, | ||
int | hashNum, | ||
double | falsePosRate, | ||
uint64_t | nelem | ||
) |
creates a bloom filter from a fasta structure.
ptr_fasta | pointer to fasta structure |
kmersize | length of kmers to be inserted in the filter |
bfsizeBits | size of Bloom filter in bits |
hashNum | number of hash functions to be used |
falsePosRate | false positive rate |
nelem | number of elemens (kmers in the sequece) contained in the filter |
Bfilter* init_Bfilter | ( | int | kmersize, |
uint64_t | bfsizeBits, | ||
int | hashNum, | ||
double | falsePosRate, | ||
uint64_t | nelem | ||
) |
initialization of a Bfilter structure
kmersize | number of elements of the kmer |
bfsizeBits | size of the bloomfilter (in Bits) |
hashNum | number of hash functions to be computed |
falsePosRate | false positive rate |
nelem | number of elemens (kmers in the sequece) contained in the filter |
Given a kmersize, bfsizeBits, number of hash functions, we assign these values to the struture and the two additional values: kmersizeBytes = (kmersize + BASESINCHAR - 1 )/BASESINCHAR
Bfkmer* init_Bfkmer | ( | int | kmersize, |
int | hashNum | ||
) |
initializes a Bfkmer structure, given the kmersize and the number of hash functions
kmersize | number of elements of the kmer |
hashNum | number of hash functions to be computed |
kmersizeBytes, halfsizeBytes, hangingBases, hasOverhead hashNum are assigned and memory is allocated and set to 0 for compact and hashValues
void init_LUTs | ( | ) |
look up table initialization
It initializes: fw0, fw1, fw2, fw3, bw0, bw2, bw3, bw4. They are uint8_t arrays with 256 elements. All elements are set to 0xFF excepting the ones corresponding to 'a', 'A', 'c', 'C', 'g', 'G', 't', 'T':
Var | a,A | c,C | g,G | t,T | Var | a,A | c,C | g,G | t,T |
---|---|---|---|---|---|---|---|---|---|
fw0 | 0x00 | 0x40 | 0x80 | 0xC0 | bw0 | 0xC0 | 0x80 | 0x40 | 0x00 |
fw1 | 0x00 | 0x10 | 0x20 | 0x30 | bw1 | 0x30 | 0x20 | 0x10 | 0x00 |
fw2 | 0x00 | 0x04 | 0x08 | 0x0C | bw2 | 0x0C | 0x08 | 0x04 | 0x00 |
fw3 | 0x00 | 0x01 | 0x02 | 0x03 | bw3 | 0x03 | 0x02 | 0x01 | 0x00 |
With these variables, we will be able to encode a Sequence using 2 bits per nucleotide.
inserts the hashvalues of a kmer in filter
ptr_bf | pointer to Bfilter structure, where we will include the new entry |
ptr_bfkmer | pointer to Bfkmer structure, where the hashvalues are stored |
The hash values are inserted in the following way.
void multiHash | ( | Bfkmer * | ptr_bfkmer | ) |
obtains the hashNum hashvalues for a compactified kmer
The hash values are computed using the CityHash64 hash functions.
Bfilter* read_Bfilter | ( | char * | filterfile, |
char * | paramfile | ||
) |
reads a bloom filter from a file
filterfile | path to file containing the filter |
paramfile | path to file containing the filter |
This function reads two files, the auxiliar inputfile where kmersize, hashNum and bfsizeBits are stored, and the actual filter file. If one of them is missing, the program exits with an error. If successful, a pointer to a Bfilter structure with the bloom filter is return
void save_Bfilter | ( | Bfilter * | ptr_bf, |
char * | filterfile, | ||
char * | paramfile | ||
) |
saves a bloomfilter to disk
ptr_bf | pointer to Bfilter structure (contains the filter) |
filterfile | path to file where the output will be stored |
paramfile | path to file where the prameters will be stored |
This function will save the bloomfilter in the path filterfile. The paramfile will store the following data: