Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

5.2 Rebase.pm: A Class Module

Here
is a very simple interface to the Rebase data contained in the
bionet file that is part of its distribution:

package Rebase;
#
# A simple class to provide access to restriction enzyme data from Rebase
#  including regular expression translations of recognition sites
#
use strict;
use warnings;
use Carp;
use DB_File;
# Class data and methods
{
# A hash of all attributes with default values
my %_attributes = (
_rebase      => { },
#    key   = restriction enzyme name
#    value = space-separated string of sites => regular expressions
_bionetfile  => '??',
_dbmfile     => '??',
_mode        => 0444,
);
# Return a list of all attributes
sub _all_attributes {
keys %_attributes;
}
# Return the value of an attribute
sub _attribute_value {
my($self,$attribute) = @_;
$_attributes{$attribute};
}
}

Notice that the opening block is considerably pared down, compared to
earlier classes. For instance, I've tossed the code
that keeps count of all objects. Why? Because it's
unlikely that more than one of these objects will be necessary in a
program: so why bother?

5.2.1 Attributes: Short and Sweet

Notice
that the list of attributes is short:

_rebase

A hash that will be populated to provide the lookup, with enzyme
names for keys, and recognition sites (and their translation to
regular expressions) for values. (Make sure you see how in the hash
%_attributes the value of the key
_rebase is itself an anonymous hash.)

_bionetfile

The name of the datafile from the Rebase
distribution. In my examples, I use the version numbered
bionet.212, and by the time you read this book,
more recent versions will be available (you can get
bionet.212 from this book's web
site).

_dbmfile

The DBM filename that resides on disk and stores the data in the hash
_rebase.[1]

[1] Recall that
DBM
files are tied to hashes in Perl and provide a simple,
easy-to-program database for key/value pairs. They serve as a way to
keep a hash on disk between invocations of a program, and so can help
you avoid the cost of recalculating a hash each time a program is
run. For more information, see
O'Reilly's Programming
Perl, the documentation for the DB_File
module, and the documentation for the dbmopen and
tie functions.

_mode

Contains the permissions with which you will create, or attempt to
read, the DBM file. This is important for security purposes.

With so few attributes, the class methods can easily handle each
attribute individually, without recourse to the use of
AUTOLOAD to define various accessors and mutators,
as seen in previous chapters.

5.2.2 Creating a Rebase Object

Here's how a
Rebase object is created and initialized:

# The constructor method
# Called from class, e.g. $obj = Rebase->new( dbmfile => 'DBMFILE' );
sub new {
my ($class, %arg) = @_;
# Create a new object
my $self = bless {  }, $class;
# DBM file must be given as "dbmfile" argument
unless($arg{dbmfile}) {
croak("No dbm file specified");
}
# Set the attributes for the provided arguments
foreach my $attribute ($self->_all_attributes(  )) {
# E.g. attribute = "_name",  argument = "name"
my($argument) = ($attribute =~ /^_(.*)/);
# Initialize to defaults
$self->{$attribute} = $self->_attribute_value($attribute);
# Override defaults with arguments
if (exists $arg{$argument}) {
if($argument eq 'rebase') {
croak "Cannot set attribute rebase";
}
$self->{$attribute} = $arg{$argument};
}
}
# Open or create the DBM file
unless(tie %{$self->{_rebase}}, 'DB_File', $arg{dbmfile}, O_RDWR|O_CREAT, $self->
{_mode}, $DB_HASH) {
my $permissions = sprintf "%lo", $self->{_mode};
croak "Cannot open DBM file $arg{dbmfile} with mode $permissions";
}
# If "bionetfile" argument given, calculate the hash from the bionet file
if($arg{bionetfile}) {
# Empty the hash
%{$self->{_rebase}} = (  );
# Recalculate the hash
$self->parse_rebase;
}
return $self;
}
# For this simple class I have no AUTOLOAD or DESTROY
# No get_rebase method, I don't want to pass around a huge hash
# No "set" mutators: all initialization done by way of "new" constructor

The new constructor method is short. It requires a
DBM database filename (existing or new), to which the hash data
structure _rebase is tied. If the DBM file
doesn't exist, the pathname of the
bionet Rebase file is also required
(that's where the data comes from that populates the
DBM file). If a bionet datafile is given, the
method calls the parse_rebase method that parses
the bionet file to create the
_rebase hash.

As the comments indicate, my class is so simple I've
even decided to do away with AUTOLOAD and
DESTROY, and I've dispensed with
the _set mutators as well.

5.2.3 Methods for the Rebase Class

Now,
let's continue by looking at the methods for the
Rebase class. Given an enzyme, the following two
methods,
get_recognition_sites and
get_regular_expressions, retrieve the
enzyme's recognition sites and the translations of
the recognition sites into regular expressions.

How do these two methods work? One method returns all recognition
sites for an enzyme as given in the Rebase database; the other
returns all the translations of the recognition sites into regular
expressions. They both work very similarly. First, the enzyme is
looked up in the _rebase hash. The value for each
enzyme in the _rebase hash is a space-separated
string that alternates recognition sites with their
regular-expression translations. In both methods, the space-separated
string is split to get a list of alternating
recognition sites and regular expressions. This list is then assigned
to the hash %sites to populate it with keys as
recognition sites (the data that's actually in the
Rebase bionet file) and values as regular
expressions. The Perl operators keys and
values are then used to generate the list of
recognition sites (keys) or regular expressions (values).

The get_bionetfile, get_dbmfile, and
get_mode methods just report on the arguments that
are set to specific filenames or mode strings when the object was
created:

sub get_regular_expressions {
my($self, $enzyme) = @_;
my(%sites) = split(' ', $self->{_rebase}{$enzyme});
# May have duplicate values
return values %sites;
}
sub get_recognition_sites {
my($self, $enzyme) = @_;
my(%sites) = split(' ', $self->{_rebase}{$enzyme});
return keys %sites;
}
sub get_bionetfile {
my($self) = @_;
return $self->{_bionetfile};
}
sub get_dbmfile {
my($self) = @_;
return $self->{_dbmfile};
}
sub get_mode {
my($self) = @_;
return $self->{_mode};
}

5.2.4 parse_rebase

The workhorse method of the class is
parse_rebase, which reads the
bionet Rebase datafile (with a suffix that
indicates the release version, such as
bionet.212). The bionet input
datafile begins like this:

REBASE version 212                                              bionet.212
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
REBASE, The Restriction Enzyme Database   http://rebase.neb.com
Copyright (c)  Dr. Richard J. Roberts, 2002.   All rights reserved.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rich Roberts                                                    Dec 01 2002
AaaI (XmaIII)                     C^GGCCG
AacI (BamHI)                      GGATCC
AaeI (BamHI)                      GGATCC
AagI (ClaI)                       AT^CGAT
AaqI (ApaLI)                      GTGCAC
AarI                              CACCTGCNNNN^
AarI                              ^NNNNNNNNGCAGGTG
AasI (DrdI)                       GACNNNN^NNGTC
AatI (StuI)                       AGG^CCT
AatII                             GACGT^C
AauI (Bsp1407I)                   T^GTACA

As you can see, the header information ends with a line containing
Rich Roberts. Apart from a
blank line or two, the rest of the file contains records, one per
line. Each record begins with a restriction enzyme name, optionally
followed by another enzyme name in parentheses. The last field of
each line is the recognition site. These are given using IUB codes
for nucleotides. (For the IUB codes, see the comments in the
program.)

They also contain cut sites, indicated by a caret symbol
^. Cut
sites
contribute very important information about a restriction enzyme;
they show where the enzyme makes the break when it cuts the DNA.
Among other things, they are needed to correctly perform restriction
digests in the computer when determining if there are overhangs that
will be useful when inserting vectors or otherwise reassembling the
fragments. However, in the code here we'll ignore
the cut sites taking as a goal the virtual fingerprinting of DNA by
just locating the recognition sites. Cut sites are omitted to
simplify the code. (See the exercises for more on handling cut
sites.)

sub parse_rebase {
my($self) = @_;
# handles multiple definition lines for an enzyme name
# also handles alternate enzyme names on a line
# Read in the bionet(Rebase) file
unless(open(BIONETFH, $self->{_bionetfile})) {
croak("Cannot open bionet file $self->{_bionetfile}");
}
while(<BIONETFH>) {
my @names = (  );
# Discard header lines
next if ( 1 .. /Rich Roberts/ );# discard all lines from the first line
# to the first line containing "Rich Roberts"
# Discard blank lines
next unless /\S/; # discard a line unless it contains something not
# whitespace
# Split the two (or three if includes parenthesized name) fields
my @fields = split;
# Get and store the recognition site
my $site = pop @fields;
# For the purposes of this exercise, I'll ignore cut sites (^).
# This is not something you'd want to do in general, however!
$site =~ s/\^//g;
# Get and store the name and the recognition site.
# Add alternate (parenthesized) names
# from the middle field, if any
foreach my $name (@fields) {
$name =~ tr/)(//d;  # delete parentheses
push @names, $name;
}
# Store the data into the hash, avoiding duplicates (ignoring ^ cut sites)
# and ignoring reverse complements
# Because these values are stored via DBM, I cannot use anything but
#  a scalar string to store the site/regularexpression pairs, space-separated
#  (but see the exercises)
foreach my $name (@names) {
# Add new enzyme definition
unless(exists $self->{_rebase}{$name}) {
$self->{_rebase}{$name} = "$site " . IUB_to_regexp($site);
next;
}
my(%defined_sites) = split(' ', $self->{_rebase}{$name});
# Omit already defined sites
if(exists $defined_sites{$site}) {
next;
# Omit reverse complements of already defined sites
}elsif(exists $defined_sites{revcomIUB($site)}) {
next;
# Add the additional site
}else{
$self->{_rebase}{$name}  .= " $site " . IUB_to_regexp($site);
}
}
}
return 1;
}

This subroutine is a bit complex, corresponding to the nature of the
data that it's processing. For instance, because
enzymes can appear on more than one line, it has to check if an
enzyme was already entered as a key in the hash.

Let me remind you of the range operator that is used here to skip
header lines:

 # Discard header lines
next if ( 1 .. /Rich Roberts/ );  # discard all lines from the first line
# to the first line containing "Rich Roberts"

The expression ( 1
.. /Rich
Roberts/ ) returns
true (and leads to the line being skipped) only
when the line being read is included in the range bordered by the
first line and the first line containing the regular expression
/Rich Roberts/. (See the
perlop section of the Perl manual for all the
details on the range operator.)

The parse_rebase subroutine, after skipping the
header and any blank lines, then processes each data line in a
while loop. Each line is split into either two
fields (name and recognition site) or three fields (name,
parenthesized alternate name, and recognition site). The name or
names are placed in the @names array and looped
through. In the last foreach loop, if the enzyme
name hasn't yet been defined in the DBM-tied hash,
it is added as a key. The value assigned to the key is a string with
recognition site followed by a translation of the recognition site to
a regular expression. The program passes to the
next name.

If the enzyme name has previously been entered as a key, the
previously entered recognition sites are examined, and if the new
site is there, the program passes to the next name. Similarly, if the
reverse complement of the site has been entered, the program passes
to the next name. But otherwise (if the enzyme name was entered, but
neither the site nor its reverse complement were in the list of sites
for that enzyme), the recognition site is added with its translation
to a regular expression.

This method has to handle reverse complements of recognition sites.
Many restriction enzymes are palindromic in the sense that their
reverse complements are the same. (For instance, the reverse
complement of GAATTC is GAATTC.) These biological palindromes
indicate that the enzyme can cut the site on both strands.

Although the code presented here ignores cut sites, the exercises
will ask you to reconsider them; note that if the cut site
isn't exactly in the middle, there will be
"sticky ends" of single stranded
DNA that make it possible to anneal the fragment with a complementary
sticky end. Refer to a standard molecular biology textbook for the
essential biology of restriction enzymes. (The logic used here to
handle reverse complements might not be ideal for all situations: see
the exercises.)

5.2.5 Methods to Translate Nucleotides to Regular Expressions

Finally, the remaining methods translate IUB-coded nucleotide
sequence data to Perl regular expressions. They also perform reverse
complementation on IUB-coded sequence data.

sub revcomIUB {
my($seq) = @_;
my $revcom = reverse complementIUB($seq);
return $revcom;
}
sub complementIUB {
my($seq) = @_;
(my $com = $seq) =~ tr [ACGTRYMKSWBDHVNacgtrymkswbdhvn]
[TGCAYRKMWSVHDBNtgcayrkmwsvhdbn];
return $com;
}
# Translate IUB ambiguity codes to regular expressions 
# IUB_to_regexp
#
# A subroutine that, given a sequence with IUB ambiguity codes,
# outputs a translation with IUB codes changed to regular expressions
#
# These are the IUB ambiguity codes
# (Eur. J. Biochem. 150: 1-5, 1985):
# R = G or A
# Y = C or T
# M = A or C
# K = G or T
# S = G or C
# W = A or T
# B = not A (C or G or T)
# D = not C (A or G or T)
# H = not G (A or C or T)
# V = not T (A or C or G)
# N = A or C or G or T 
sub IUB_to_regexp {
my($iub) = @_;
my $regular_expression = '';
my %iub2character_class = (
A => 'A',
C => 'C',
G => 'G',
T => 'T',
R => '[GA]',
Y => '[CT]',
M => '[AC]',
K => '[GT]',
S => '[GC]',
W => '[AT]',
B => '[CGT]',
D => '[AGT]',
H => '[ACT]',
V => '[ACG]',
N => '[ACGT]',
);
# Remove the ^ signs from the recognition sites
$iub =~ s/\^//g;
# Translate each character in the iub sequence
for ( my $i = 0 ; $i < length($iub) ; ++$i ) {
$regular_expression
.= $iub2character_class{substr($iub, $i, 1)};
}
return $regular_expression;
}
1;
=head1 Rebase
Rebase: A simple interface to recognition sites and translations of them into
regular expressions, from the Restriction Enzyme Database (Rebase)
=head1 Synopsis
use Rebase;
# Use "bionetfile" to create and populate dbm file
my $rebase = Rebase->new(
dbmfile => 'BIONET',
bionetfile => 'bionet.212',
mode => 0644
);
# Use without "bionetfile" to attach to existing dbm file
my $rebase = Rebase->new(
dbmfile => 'BIONET',
mode => 0444
);
my $enzyme = 'EcoRI';
print "Looking up restriction enzyme $enzyme\n";
my @sites = $rebase->get_recognition_sites($enzyme);
print "Sites are @sites\n";
my @res = $rebase->get_regular_expressions($enzyme);
print "Regular expressions are @res\n";
print "DBM file is ", $rebase->get_dbmfile, "\n";
print "Rebase bionet file is ", $rebase->get_bionetfile, "\n";
=head1 AUTHOR
James Tisdall
=head1 COPYRIGHT
Copyright (c) 2003, James Tisdall
=cut

5.2.6 Testing the Module

Ending
the module, as usual, is some POD documentation for the module.
Recall that you can view the output of this documentation in various
ways, as HTML on a web page, as PostScript, etc. However, the
simplest way is to say the following at the command line:

perldoc Rebase.pm

Let's try running the sample code given in the
documentation. Notice that the Rebase.pm module is
available, as is the bionet.212 file from the
Rebase distribution. Also, notice there are two alternate calls to
Rebase->new, so you should comment out the
first one, then the other, in tests. Save the sample code from the
documentation in a file called testRebase, and
when you run it with the command:

perl testRebase

you get the following output:

Looking up restriction enzyme EcoRI
Sites are GAATTC
Regular expressions are GAATTC
DBM file is BIONET
Rebase bionet file is bionet.212

Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی