Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

6.8 A Rebase Database Implementation

In
earlier chapters, I developed an object-oriented interface to Rebase
that stores the data in a simple DBM hash database, and in this
chapter I showed how to interact with a MySQL relational database
management system.

Now, let's put the two things together.

In this section, I'll take the
Rebase.pm module that implements the
Rebase class and modify it to use a relational
database instead of Perl's simple hash-based DBM
database. I'll call the resulting class
RebaseDB.

For a small problem like this, either approach works pretty well. In
fact, on my computer, the DBM approach is considerably faster than
the MySQL version.

The relational database implementation has a slower response due to
the more complicated DBMS system involved and increased overhead in
programming because a greater number of programming statements must
be written. However, the reason for using it is probably clear: the
relational database provides a lot more flexibility in making
queries, organizing the data, and, especially, expanding the
application to handle a greater variety of data.

This flexibility is very important. The Rebase database from
http://www.neb.com/rebase
includes several more datafiles, such as where to get the enzymes. It
wouldn't be difficult to add a table or tables to
store that information in the relational version of my Perl program;
it would be a major pain to keep adding new DBM hashes for various
complex relationships. So, I want a relational database because it
has good scalability as the database application
grows.

6.8.1 RebaseDB Class: Accessing Restriction Enzyme Data

Following is the code for a new class, RebaseDB,
which aims to provide the same functionality as the previous class
Rebase, but with a relational database instead of
a DBM hash data storage.

In the interest of space, the code for the following subroutines
isn't reproduced here; look for them in Chapter 5: revcomIUB,
complementIUB, and
IUB_to_regexp. They are, of course, included in
the RebaseDB.pm module available for downloading
from this book's web page.

package RebaseDB;
#
# A simple class to provide access to restriction enzyme data from Rebase
#  including regular expression translations of recognition sites
# Data is stored in a MySQL database
#
use strict;
use warnings;
use DBI;
use Carp;
# Class data and methods
{
# A hash of all attributes with default values
my %_attributes = (
_rebase      => { },  # unused in this implementation
#    key   = restriction enzyme name
#    value = pairs of sites and regular expressions
_mysql       => '??', #  e.g. mysql => 'rebase:localhost',
_dbh         => '',   # database handle from DBI->connect
_bionetfile  => '??', # source of data from e.g. "bionet.212" file
);
# Return a list of all attributes
sub _all_attributes {
keys %_attributes;
}
}
# The constructor method
# Called from class, e.g. $obj = Rebase->new( mysql => 'localhost:rebase' );
sub new {
my ($class, %arg) = @_;
# Create a new object
my $self = bless {  }, $class;
# Set the attributes for the provided arguments
foreach my $attribute ($self->_all_attributes(  )) {
# E.g. attribute = "_name",  argument = "name"
my($argument) = ($attribute =~ /^_(.*)/);
if (exists $arg{$argument}) {
if($argument eq 'rebase') {
croak "Cannot set attribute rebase";
}
$self->{$attribute} = $arg{$argument};
}
}
# MySQL host:database string must be given as "mysql" argument
unless($arg{mysql}) {
croak("No MySQL host:database specified");
}
# Connect to the Rebase database
my $user = 'tisdall';
my $passwd = 'NOTmyPASSWORD';
my $dbh;
unless($dbh = DBI->connect("dbi:mysql:$arg{mysql}", $user, $passwd)) {
carp "Cannot connect to MySQL database at $arg{dbmfile}";
return;
}
$self->setDBhandle($dbh);
# If "bionetfile" argument given, populate the database from the bionet file
if($arg{bionetfile}) {
$self->parse_rebase(  );
}
return $self;
}
# For this simple class I have no AUTOLOAD or DESTROY
# No "set" mutators: all initialization done by way of "new" constructor
sub get_regular_expressions {
my($self, $enzyme) = @_;
my $dbh = $self->getDBhandle;
my $sth = $dbh->prepare(
'select Regex from REGEXES, ENZYMES where
ENZYMES.EnzId = REGEXES.EnzId and ENZYMES.Enzyme=?'
);
$sth->execute($enzyme);
my @regexes;
while( my $row = $sth->fetchrow_arrayref) {
push(@regexes, $$row[0]);
}
return @regexes;
}
sub getDBhandle {
my($self) = @_;
return $self->{_dbh};
}
sub setDBhandle {
my($self, $dbh) = @_;
return $self->{_dbh} = $dbh;
}
sub get_recognition_sites {
my($self, $enzyme) = @_;
my $dbh = $self->getDBhandle;
my $sth = $dbh->prepare(
'select Site from SITES, ENZYMES
where ENZYMES.EnzId = SITES.EnzId and ENZYMES.Enzyme=?'
);
$sth->execute($enzyme);
my @sites;
while( my $row = $sth->fetchrow_arrayref) {
push(@sites, $$row[0]);
}
return @sites;
}
sub get_bionetfile {
my($self) = @_;
return $self->{_bionetfile};
}
sub parse_rebase {
my($self) = @_;
# handles multiple definition lines for an enzyme name
# also handles alternate enzyme names on a line
# Get database handle
my $dbh = $self->getDBhandle(  );
# Delete existing tables, recreate them
# Prepare statement handles with "bind" variables and autoincrement
# ENZYMES table
my $drop = $dbh->prepare('drop table if exists ENZYMES');
$drop->execute(  );
my $create = $dbh->prepare(
"CREATE TABLE ENZYMES ( EnzId int(11) NOT NULL auto_increment default '0',
Enzyme varchar(255) NOT NULL default '', PRIMARY KEY  (EnzId)) TYPE=MyISAM"
);
$create->execute(  );
# Prepare filehandles outside of "while" loop
my $enzymes_select = $dbh->prepare(
'select EnzId from ENZYMES where Enzyme=?'
);
my $enzymes_insert =  $dbh->prepare(
'insert ENZYMES ( EnzId, Enzyme ) values ( NULL, ? )'
); 
# SITES table
$drop = $dbh->prepare('drop table if exists SITES');
$drop->execute(  );
$create = $dbh->prepare(
"CREATE TABLE SITES ( SiteId int(11) NOT NULL auto_increment default '0',
EnzId int(11) NOT NULL default '0', Site varchar(255) NOT NULL default '',
PRIMARY KEY  (SiteId)) TYPE=MyISAM"
);
$create->execute(  );
# Prepare filehandles outside of "while" loop
my $sites_insert = $dbh->prepare(
'insert SITES ( SiteId, EnzId, Site ) values ( NULL, ?, ? )'
);
my $sites_select = $dbh->prepare(
'select EnzId, Site from SITES where EnzId=? and Site=?'
);
my $sitesrevcom_select = $dbh->prepare(
'select EnzId, Site from SITES where EnzId=? and Site=?'
);
# REGEXES table
$drop = $dbh->prepare('drop table if exists REGEXES');
$drop->execute(  );
$create = $dbh->prepare(
"CREATE TABLE REGEXES ( RegexId int(11) NOT NULL auto_increment default '0',
EnzId int(11) NOT NULL default '0', Regex varchar(255) NOT NULL default '',
PRIMARY KEY  (RegexId)) TYPE=MyISAM"
);
$create->execute(  );
# Prepare filehandles outside of "while" loop
my $regexes_insert = $dbh->prepare(
'insert REGEXES ( RegexId, EnzId, Regex ) values ( NULL, ?, ? )'
);
my $lastid =  $dbh->prepare('select LAST_INSERT_ID(  ) as pk');
# Read in the bionet(Rebase) file
unless(open(BIONETFH, $self->get_bionetfile)) {
croak("Cannot open bionet file " . $self->get_bionetfile);
}
while(<BIONETFH>) {
my @names = (  );
# Discard header lines
( 1 .. /Rich Roberts/ ) and next;
# Discard blank lines
/^\s*$/ and next;
# Split the two (or three if includes parenthesized name) fields
my @fields = split( " ", $_);
# Get and store the recognition site
my $site = pop @fields;
# For the purposes of this exercise, I'll ignore cut sites (^).
# This is not something you'd want to do in general, however!
$site =~ s/\^//g;
# Get and store the name and the recognition site.
# Add alternate (parenthesized) names
# from the middle field, if any
foreach my $name (@fields) {
if($name =~ /\(.*\)/) {
$name =~ s/\((.*)\)/$1/;
}
push @names, $name;
}
# Store the data, avoiding duplicates (ignoring ^ cut sites)
# and ignoring reverse complements
foreach my $name (@names) {
my $pk;
my $row;
# if enzyme exists
$enzymes_select->execute($name);
if($row = $enzymes_select->fetchrow_arrayref) {
# get its "pk"
$pk = $$row[0];
}else{
# Add new enzyme definition
$enzymes_insert->execute($name);
# Get last autoincremented primary id
$lastid->execute(  );
my $pkhash = $lastid->fetchrow_hashref;
$pk = $pkhash->{pk};
}
# if pk,site exist go to top of loop
$sites_select->execute($pk, $site);
if($row = $sites_select->fetchrow_arrayref) {
next;
}
# and if pk,revcomIUB(site) exist go to top of loop
$sitesrevcom_select->execute($pk, revcomIUB($site));
if($row = $sitesrevcom_select->fetchrow_arrayref) {
next;
}
# Add new site definition
#  since neither pk,site nor
#  pk,revcomIUB(site) exists.
$sites_insert->execute($pk, $site);
# Add new regex definition
$regexes_insert->execute($pk, IUB_to_regexp($site));
}
}
return 1;
}
1;
=head1 RebaseDB
Rebase: A simple interface to recognition sites and translations of them into
regular expressions, from the Restriction Enzyme Database (Rebase)
=head1 Synopsis
use RebaseDB;
my $rebase = RebaseDB->new(
mysql => 'rebase:localhost',
bionetfile => 'bionet.212'
);
my $enzyme = 'EcoRI';
print "Looking up restriction enzyme $enzyme\n";
my @sites = $rebase->get_recognition_sites($enzyme);
print "Sites are @sites\n";
my @res = $rebase->get_regular_expressions($enzyme);
print "Regular expressions are @res\n";
my $enzyme = 'HindIII';
print "Looking up restriction enzyme $enzyme\n";
my @sites = $rebase->get_recognition_sites($enzyme);
print "Sites are @sites\n";
my @res = $rebase->get_regular_expressions($enzyme);
print "Regular expressions are @res\n";
print "Rebase bionet file is ", $rebase->get_bionetfile, "\n";
=head1 AUTHOR
James Tisdall
=head1 COPYRIGHT
Copyright (c) 2003, James Tisdall
=cut

6.8.2 testRebaseDB: A Testing Program

The following test program
testRebaseDB is taken from the
RebaseDB.pm documentation and slightly altered.

    use lib "/home/tisdall/MasteringPerlBio/development/lib";
use RebaseDB;
my $rebase = RebaseDB->new(
mysql => 'rebase:localhost',
bionetfile => 'bionet.212'
);
my $enzyme = 'EcoRI';
print "Looking up restriction enzyme $enzyme\n";
my @sites = $rebase->get_recognition_sites($enzyme);
print "Sites are @sites\n";
my @res = $rebase->get_regular_expressions($enzyme);
print "Regular expressions are @res\n";
my $enzyme = 'HindIII';
print "Looking up restriction enzyme $enzyme\n";
my @sites = $rebase->get_recognition_sites($enzyme);
print "Sites are @sites\n";
my @res = $rebase->get_regular_expressions($enzyme);
print "Regular expressions are @res\n";
print "Rebase bionet file is ", $rebase->get_bionetfile, "\n";

Here's the output of testRebaseDB:

Looking up restriction enzyme EcoRI
Sites are GAATTC
Regular expressions are GAATTC
Looking up restriction enzyme HindIII
Sites are AAGCTT
Regular expressions are AAGCTT
Rebase bionet file is bionet.212

6.8.3 Analyzing RebaseDB

Let's take a walk through this
RebaseDB.pm
class module to see how it uses the DBI relational database
interface.

For starters, the %_attributes hash has changed to
reflect the new relational database method. The
_rebase key is no longer needed, but the
bionetfile is again used to load the database; in
the absence of this argument, the program attempts to use a
previously loaded database.

Two new argument keys are present.
_mysql gives the database name
(e.g., rebase) and the user's
computer (localhost if the user is running the
program on the same computer the database is served from).
_dbh holds the DBI object returned
from the DBI->connect call.

The new constructor method has changed in
significant ways. It checks the arguments a little bit differently,
of course, because this version of the class has different arguments
coming in. It then actually connects to the MySQL database using the
DBI calls seen previously in this chapter. In case of failure, it
doesn't carp or die; it prints an
error message and returnsa much better behavior than just
dying. If this were a web script, for instance, you might want to
keep running so you can return more input from the user in case of
failure.

In case of success, the DBI->connect object
reference is saved in the attribute reserved for this purpose,
$self->{_dbh}. (dbh stands
for "data base handle".) Elsewhere
in the module two methods are defined: the accessor method,
getDBhandle and the mutator method,
setDBhandle. They get and
set this MySQL database object handle in the class
attribute _dbh.

The program then calls the
parse_rebase program. Let's look
closely at this method and the workhorse that reads Rebase data in
from a file and populates the MySQL database with it.

After retrieving as $dbh the object that points to
the database, the method sends several SQL statements to the DBMS.
These statements delete each of the three tables in this MySQL
database called rebase, tossing out all their data
in the process. The calls then create the tables anew (and empty).

This section of the code also prepares SQL statements that insert new
rows with the values being passed in as bind
arguments and selects various sets of rows. These SQL statements are
used repeatedly in the while loop that follows,
and the program saves time by having the SQL statements prepared just
once before the loop.

I won't examine the various SQL statements in detail
here. However, I do want to point out the use of autoincrementing on
the ID field of each table. This option, applicable only to
single-field keys that serve as the primary key for a table, has the
effect of always picking the next value for the field, and is perfect
for the unique ID field I usually want in a table as the primary key.
It's just one less thing to worry about in the
module. An SQL statement is also prepared that uses an SQL function
to return the value of the last autoincremented key.

Now that the new database tables are fresh and empty, the program
opens the bionet file and prepares to read it
within a while loop. The beginning of this loop is
unchanged from the previous version Rebase.pm
because it discards the file header and blank lines and extracts the
names of the restriction enzymes and the associated recognition
sites.

First, the program checks to see if there is already a definition for
the enzyme entered in the database; if so, it just retrieves its
unique ID, the primary key $pk. If the enzyme
hasn't been entered, it is now, and the ID that is
automatically created for it is saved.

Next, the program checks if that primary key ID and site are already
paired in the SITES table; if so, it goes back to the top of the
while loop.

Again, the program checks if that primary key ID and the reverse
complement of the site are already paired in the SITES table; if so,
it goes back to the top of the while loop.

However, if the ID and the site are still unknown, they are entered
into the SITES table, and the ID and the regular expression generated
from the site are entered into the REGEXES table.

That wraps it up for the parse_rebase method.
Because of all the database interactions, this is a pretty slow bit
of code (see the exercises).

The only parts of the RebaseDB.pm module code we
haven't looked at are the two similar methods called
get_recognition_sites and
get_regular_expressions. They retrieve all
recognition sites (or regular expressions) associated with a given
enzyme. The SQL statement in each of these methods is an example of a
join of two tables:

select Regex from REGEXES, ENZYMES
where ENZYMES.EnzId = REGEXES.EnzId and ENZYMES.Enzyme=?;

(This is one SQL statement.) The statement asks for the regular
expressions from the REGEXES table that have the same EnzID as the
enzyme given in argument has in the ENZYMES table.

That's the end of my discussion of the
RebaseDB.pm class module. See the exercises for
suggestions on how to improve this module.

Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی