<<

NAME

Sanger::CGP::TraFiC::Formatter

SYNOPSIS

 use Sanger::CGP::TraFiC::Formatter;
 my $formatter = Sanger::CGP::TraFiC::Formatter->new;
 $formatter->set_output($output_dir);
 $formatter->format_rm_results($repeatmasker_files, $fasta_files);

GENERAL

Designed to take one or more repeatmasker *.out files along with the input files to generate two output files (positive anchors, negative anchors) containing the full information about the mapped end of those reads that were successfully processed by repeat masker.

All the data required is encoded in the fasta header for each read:

1. readname
2. chr
3. pos
4. repeat family
5. strand (+/-)

2,3 and 5 are based on the data for the mapped end of the pair.

This means we only need to store the readname and family of repeat in memory from the rm output as the rest can be read back from the rm input files with no memory overhead.

METHODS

Constructor/configuration

new

Description

Initialises the object and ensures that unix sort is available in path.

 my $formatter = Sanger::CGP::TraFiC::Formatter->new;
Errors

Will croak if unable to find sort in path with this message:

 Unable to find standard unix 'sort' in path

set_output

Description

Specify the output folder for collated results.

 $formatter->set_output($output_path);

silence

Description

Allows user to silence internal messages. Mainly added for testing system

 $formatter->silence;

unsilence

Decription

Allows user resume internal messages.

 $formatter->unsilence;

Processing

format_rm_results

Description

Main processing function. set_output must be called prior to this.

 $formatter->format_rm_results(\@rm_out_files, \@rm_in_fasta);
Args
 rm_out_files - \@ of repeat masker output files (*.out).
 rm_in_fasta  - \@ reference of the fasta files presented to repeat masker.
Returns

Nothing is returned, results are written to the specified output location as pos/neg_hits.txt

load_rm_output

Description

Load the repeat-masker output file data. Only best/longest hits are retained. Not really intended for external use, but no reason you can't use it if you want to parse a RM file for the basic data retrieved.

Args
 rm_out_files - \@ of repeat masker output files (*.out).
Returns
 \% where key is readname and value is the family that masked this read.

<<