Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

5.4 Drawing Restriction Maps

One
of the most important lessons of scientific programming is the
importance of a good display of a program's results.

In this section, I'm going to add the ability to
output a restriction map by inheriting
Restriction.pm and making a new derived class
Restrictionmap.pm. The restriction map will be
shown very simply as the sequence printed in lines of simple text.
The locations of restriction sites are written over the lines of
text, giving the names of the restriction enzymes at the restriction
sites. In Chapter 8, this simple text-based
graphic output is replaced by a real picture (with colors, different
fonts, and whatever bells and whistles you choose to add). I designed
my base class Restriction.pm to represent the
restriction map as a simple list of recognition site locations
because I wanted my software to be flexible enough to be extended to
accommodate any of the many different graphic formats that might be
desired.

The difference between an unreadable mass of data, and a good clean
graphic display that leads the eye towards an interesting result, is
profound. It's the difference between a successful
program and a dud, between hours or days spent sorting through
columns of data and a quick discovery of a region of interest.
It's even the difference between a scientific
discovery, and none at all.

Graphics programming is a bit of an advanced topic.
(You'll get your feet wet in Chapter 8.) But even if you need a program
that's restricted to text output, you still need to
spend programming time displaying that output in a useful manner. So,
in this section, I'll add a very simple map drawing
capability to the software, drawn as simple text.

5.4.1 Storing Graphics Output in an Attribute

I want to display a graphic. Do I need to add
_graphic and _graphictype
attributes to my object? The question boils down to: shall I compute
and display a graphic whenever needed, or shall I compute a graphic
and store it in an attribute? If you're new to
computer graphics, just think of a graphic as a mass of data, which
can be stored in a variable, or in a file on disk, and can be used to
generate a graphical display on the computer screen by the right
software.

I do already compute the restriction map, by which I mean the actual
locations of the recognition sites for each enzyme requested, and
stored it in the _map attribute. I can also
compute a graphic at the same time and store it in the proposed
_graphic attribute.

Let's
think about the pluses and minuses of storing graphics output in an
attribute.

I'm not sure which graphics system
I'll use in the future; for now, a simple text
output may work as a stored attribute, but there are a lot of
graphics outputs possible. Also, it seems likely that
I'll be able to compute the graphics output very
quickly, so the need for storing it is less compelling. And storing a
large image for possibly hundreds or thousands of objects will be a
strain on the computer system.

However, I may not be able to calculate quickly for fancier, full
color, high resolution graphics that I might want in the future. And
perhaps I'll need to flip between graphics very
quickly, in which case having them precalculated would be a
necessity!

Also, how about multiple digests? Should I just create one graphic
for the entire set of enzymes in an object or add a single digest
graphic for each enzyme?

Maybe I can't answer all these questions yet,
either. At this point, I'm not sure of the kind of
graphics I'll be producing, or of their size, or of
the time to generate them. Perhaps I can decide to make just one
graphic per object (the user can always make a new object with just
one enzyme if desired).

Answering these questions is a judgment call. When the answers to
such questions aren't known, you should at least try
to make the code flexible enough to accommodate whatever decisions
are eventually reached.

For now, let's add a graphics attribute and
precalculate the graphics output. After all, if in the future I
decide that, for instance, I can generate graphics quickly enough on
the fly, but that they're a bit large,
I'll just add a method that can generate the images
when needed.

5.4.2 The Restrictionmap Class

Here's a simple
extension of the Restriction class
I'll call Restrictionmap, which
adds a new _mapgraphics attribute. It
doesn't compute the graphics output until the first
time it's called; it then stores the result for
future lookups:

package Restrictionmap;
use base ( "Restriction" );
#
# A class to find locations of restriction enzyme recognition sites in
#  DNA sequence data, and to display them.
#
use strict;
use warnings;
use Carp;
# Class data and methods
{
# A list of all attributes with default values.
my %_attributes = (
#    key   = restriction enzyme name
#    value = space-separated string of recognition sites => regular expressions
_rebase      => { },    # A Rebase.pm object
_sequence    => '',     # DNA sequence data in raw format (only bases)
_enzyme      => '',     # space separated string of one or more enzyme names
_map         => { },    # hash: enzyme names => arrays of locations
_graphictype => 'text', # one of 'text' or 'png' or some other
_graphic     => '',     # a graphic display of the restriction map
);
# Return a list of all attributes
sub _all_attributes {
keys %_attributes;
}
}

Notice that I've declared a new class
Restrictionmap using the base class
Restriction. Notice also that no
AUTOLOAD mechanism is provided.

Also, in the block containing the class data and methods,
there's a new definition of the hash
%_attributes that adds the new attribute
_graphictype to indicate the type of graphic (at
first, this will be "text"), and the new attribute
_graphic as a simple scalar.

Why am I defining it as a scalar? As you'll see, the
routines that make the text-based graphics that I'm
adding here assemble the graphics output as an array. To store it as
a scalar requires an extra call to the Perl join
function.

The reason I'm storing the
graphics display as a scalar is that
image data comes in a variety of formats. If I plan on just saving an
image as a scalar, it will perhaps cover the most common
possibilities; it's always possible to read a file
into a scalar in Perl. Hopefully, this decision to store the image
data as a scalar gives the most flexibility for future development.

This opening block is similar to the opening block of the base class
Restriction. Notice the only helper method
I'm redefining is the
_all_attributes method, which uses the redefined
%_attributes data structure that appears in the
same block with it. The other methods in this block in the base class
count the number of objects in existence during the running of the
program; those methods and the _count variable
they depend on are inherited by this new derived class
Restrictionmap and will work fine.

5.4.2.1 Adding graphics capability to the class

Next, come the methods that add the graphics capability to the class:

sub get_graphic {
my($self) = @_;
# If the graphic is not stored, calculate and store it
unless($self->{_graphic}) {
unless($self->{_graphictype}) {
croak 'Attribute graphictype not set (default is "text")';
}
# if graphictype is "xyz", method that makes the graphic is "_drawmap_xyz"
my $drawmapfunctionname = "_drawmap_" . $self->{_graphictype};
# Calculate and store the graphic
$self->{_graphic} = $self->$drawmapfunctionname;
}
# Return the stored graphic
return $self->{_graphic};
}

This is the public method that is meant to be called from a program.
Recall that I decided to only calculate the graphics for an object
the first time the graphic is requested; it's then
saved in the _mapgraphic attribute for subsequent
calls.

5.4.2.2 Creation of the graphic

The first step in graphic creation is an initialization routine. This
doesn't do much here, but other graphics I might
create may well benefit from a separate initialization step, so I
plan ahead and add it here as well.

Each enzyme is called in turn, and its matching positions are
retrieved as you've seen before with the
get_enzyme_map method. The appropriate annotation
is added to the @annotation array. The resulting
annotation is then formatted for output and returned:

#
# Methods to output graphics in text format
#
sub _drawmap_text {
my($self) = @_;
my @annotation = ();
push(@annotation, _initialize_annotation_text($self->get_sequence));
foreach my $enzyme ($self->get_enzyme_names) {
_add_annotation_text(annotation, $enzyme, $self->get_enzyme_map($enzyme));
}
# Format the restriction map as sequence and annotation
my @output = _formatmaptext(50, $self->get_sequence, @annotation);
# Return output as a string, not an array of lines
return join('', @output);
}
#  Make a blank string of the same length as the given sequence string
sub _initialize_annotation_text {
my($seq) = @_;
return '' x length($seq);

The idea behind this text-based annotation is as follows. The lines
of sequence output are separated by blank lines just for readability.
Directly above the sequence lines are additional annotation lines
containing the names of enzymes, which appear with their names
starting exactly where the recognition site starts. Sometimes, enzyme
names collidetwo or more enzyme names might require the same
space on an annotation line. In that case, an additional annotation
line is created to print these overlapping enzyme names.
There's an example of this in the program output
following this discussion.

The code is fairly well commented. Check out the exercises: one will
challenge you to improve it or to start from scratch and invent a
better way to display this kind of text-based annotation.
I'll omit a detailed commentary on exactly how this
code accomplishes its job; as they say, it will be left as an
exercise for the reader. Or, as one of my mathematics professors used
to delight in saying, "A moment's
reflection should clear the matter up for you."

#   Add annotation to an annotation string
sub _add_annotation_text {
my($array, $enz, @pos) = @_;
# $array is a reference to an array of annotations
# Put the labels for the enzyme name at the correct positions in the annotation
foreach my $location (@pos) {
# Loop through all the annotation strings as necessary
for( my $i = 0 ; $i < @$array ; ++$i ) {
# If the annotation contains only space characters at that position,
# insert the annotation
if(substr($$array[$i], $location-1, length($enz)) eq (' ' x length($enz))){
substr($$array[$i], $location-1, length($enz)) = $enz;
last;
# If the annotation collides, add it to the next annotation string on the
# next iteration of the "for" loop.
# But first, if there is not another annotation string, make one
}elsif($i == (@$array - 1)) {
push(@$array, _initialize_annotation_text($$array[0]));
}
}
}
}
# Sequence with annotation lines formatted for the page with line breaks
sub _formatmaptext {
my($line_length, $seq, @annotation) = @_;
my(@output) = ();
# Split strings into lines of $line_length
for ( my $pos = 0 ; $pos < length($seq) ; $pos += $line_length ) {
# Print annotation on top of sequence, using reverse
foreach my $string ( reverse ($seq, @annotation) ) {
# Discard blank lines?
# if ( substr($string, $pos, $line_length) !~ /[^ \n]/ ) {
#     next;
# }
# Add line to output
push(@output, substr($string, $pos, $line_length) . "\n");
}
# separate the lines
push(@output,"\n");
}
# Return the merged annotation and sequence
return @output;
}
=head1 Restrictionmap
Restrictionmap: Given a Rebase object, sequence, and list of restriction enzyme
names, return the locations of the recognition sites in the sequence
=head1 Synopsis
use Restrictionmap;
use Rebase;
use strict;
use warnings;
my $rebase = Rebase->new(
dbmfile => 'BIONET',
bionetfile => 'bionet.212'
);
my $restrict = Restrictionmap->new(
rebase => $rebase,
enzyme => 'EcoRI',
enzyme => 'HindIII',
sequence => 'ACGAATTCCGGAATTCG',
graphictype => 'text',
);
print "Locations are ", join ' ', $restrict->get_enzyme_map('EcoRI'), "\n";
print $restrict->get_graphic;
=head1 AUTHOR
James Tisdall
=head1 COPYRIGHT
Copyright (c) 2003, James Tisdall
=cut
1;

5.4.2.3 Running the program

If
you run the short program given in the Synopsis
section of the documentation, you won't get very
interesting output; the input data is too short. The following is a
better example. There's some DNA sequence data that
contains EcoRI and HindIII recognition sites, saved in a file called
sampleecori.dna. The class
SeqFileIO is used to read it in, and then
Restrictionmap makes and displays a restriction
map of that sequence.

Here's the program, an extension, really, of the
short program that appears in the Synopsis section
of the POD documentation. Let's use it to test this
new Restrictionmap class:

use lib "/home/tisdall/MasteringPerlBio/development/lib";
use Restrictionmap;
use Rebase;
use SeqFileIO;
use strict;
use warnings;
my $rebase = Rebase->new(
dbmfile => 'BIONET',
bionetfile => 'bionet.212',
mode => '0666',
);
my $restrict = Restrictionmap->new(
rebase => $rebase,
enzyme => 'EcoRI HindIII',  # GAATTC # AAGCTT
sequence => 'ACGAATTCCGGAATTCG',
graphictype => 'text',
);
print "Locations are ", join ' ', $restrict->get_enzyme_map('EcoRI'), "\n";
print $restrict->get_graphic;
## Some bigger sequence
my $biggerseq = SeqFileIO->new;
#$biggerseq->read(filename => 'map.fasta');
$biggerseq->read(filename => 'sampleecori.dna');
my $restrict2 = Restrictionmap->new(
rebase => $rebase,
enzyme => 'EcoRI HindIII',  # GAATTC # AAGCTT
sequence => $biggerseq->get_sequence,
graphictype => 'text',
);
print "\nHere is the map of the bigger sequence:\n\n";
print $restrict2->get_graphic;

Notice that a use lib directive is added to tell Perl where
to find the modules; you can accomplish the same thing with the
PERL5LIB environmental variable or with
a command-line argument to Perl.

Here's is the output from the test program:

Locations are 3 11 
EcoRI   EcoRI  
ACGAATTCCGGAATTCG
Here is the map of the bigger sequence:
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGG
TTTGCAGCGGAGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCT
EcoRI                        
GGGAGGCGTGACTAGAAGCGGGAATTCAAGTAGTTGTGGGCGCCTTTGCA
ACCGCCTGGGACGCCGCCGAGTGGTCTGTGCAGGTTCGCGGGTCGCTGGC
HindIII                                  
GGGGGTCGTAAGCTTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAG
ATGGTTCAGACCCAGAGCCTCCAGATGCCGGGGAGGACAGCAAGTCCGAG
AATGGGGAGAATGCGCCCATCTACTGCATCTGCCGCAAACCGGACATCAA
HindIII                     
CTGCTTCATGATCGGGTGTGACAAGCTTAACTGCAATGAGTGGTTCCATG
GGGACTGCATCCGGATCAGCGGGATGGCAATGAGCGGGACAGCAGTGAGC
CCCGGGATGAGGGTGGAGGGCGCAAGAGGCCTGTCCCTGATCCAGACCTG
CAGCGCCGGGCAGGGTCAGGGACAGGGGTTGGGGCCATGCTTGCTCGGGG
EcoRI
CTCTGCTTCGCCCCACAAATCCTCTCCGCAGCCCTTGGTGGCCACGAATT                      
CACCCAGCCAGCATCACCAGCAGCAGCAGCAGCAGATCAAACGGTCAGCC
EcoRI                                       
HindIII                                           
AAGCTTGAATTCCGCATGTGTGGTGAGTGTGAGGCACCAGTGACGCCCTC
AGAGTCCCTGCCAAGGCCCCGCCGGCCACTGCCCACCCAACAGCAGCCAC
EcoRI                          
AGCCATCACAGAAGTTAGGGAATTCGCGCATCCGTGAAGATGAGGGGGCA
EcoRI              
GTGGCGTCATCAACAGTCAAGGAGCCTCCTGGAATTCAGGCTACAGCCAC
ACCTGAGCCACTCTCAGATGAGGACCTA

As you see, the sequence is printed out formatted for the page with
the names of restriction enzymes appearing above their recognition
sites.

Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی