Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

6.7 Perl DBI and DBD Interface Modules

SQL
is a fairly simple and easy-to-learn
language, considered by most to be well-tailored to its task.
However, there are things available in most programming languages,
such as control flow (while,
for, foreach) and conditional
branches (if-else) that aren't
provided in most implementations of the language. The lack of these
abilities severely restricts the use of SQL as a standalone language.

Most applications that use a relational database are written in
another language such as Perl. Perl provides a link between the
application, the programmer, the user, the files, the web server, and
so on. Perl also provides the program logic. The interaction between
the application and the database is typically to execute some
database commands, such as fetching data from the database and
processing it using Perl's capabilities. The logic
of the program may depend on the data found in the database, but it
is Perl, not SQL, that provides this logic (for the most part).

In Perl, a set of modules have been written that allow interaction
with relational databases. The DataBase Independent (DBI) module
handles most of the interaction from the program code; the DataBase Dependent
(or DataBase Driver) (DBD) modules, different for each particular
DBMS, handle communicating with the DBMS.

6.7.1 Installing and Configuring Perl DBI and DBD Modules

To
use a MySQL database from Perl,
you need first to have installed and properly configured MySQL. This
is not a Perl job, but a database administration job; you have to get
MySQL and install it on your system and set up the appropriate user
accounts and permissions.

You have to then install the Perl DBI module (http://www.symbolstone.org/technology/perl/DBI)
using CPAN from the command line:

perl -MCPAN -e shell;

Then type:

install DBI

You can also install it by downloading the module from CPAN, for
example, via a web browser and following the QUICK
START GUIDE instructions shown
here:

QUICK START GUIDE:
The DBI requires one or more 'driver' modules to talk to databases.
Check that a DBD::* module exists for the database you wish to use.
Read the DBI README then Build/test/install the DBI by doing
perl Makefile.PL
make
make test
make install
Then delete the source directory tree since it's no longer needed.
Use the 'perldoc DBI' command to read the DBI documentation.
Fetch the DBD::* driver module you wish to use and unpack it.
http://search.cpan.org/ (or www.activestate.com if on Windows)
It is often important to read the driver README file carefully.
Generally the build/test/install/delete sequence is the same
as for the DBI module.
The DBI.pm file contains the DBI specification and other documentation.
PLEASE READ IT. It'll save you asking questions on the mailing list
which you will be told are already answered in the documentation.
For more information and to keep informed about progress you can join
the a mailing list via mailto:dbi-users-help@perl.org
To help you make the best use of the dbi-users mailing list,
and any other lists or forums you may use, I strongly
recommend that you read "How To Ask Questions The Smart Way"
by Eric Raymond:
http://www.tuxedo.org/~esr/faqs/smart-questionsl
Much useful information and online archives of the mailing lists can be
found at http://dbi.perl.org/
See also http://search.cpan.org/

Finally, you have to install the Perl DBD driver for MySQL, called
DBD::MySQL.
Look in CPAN at http://cpan.org/modules/by-module/DBD/ for
the latest version; at the time of writing, it's
http://cpan.org/modules/by-module/DBD/DBD-mysql-2.1026.tar.gz.

The combination of MySQL (the DBMS), DBD (the particular driver for
your DBMS), and DBI (the Perl interface to the DBI and DBMS), is what
gives the actual connection from Perl to the database and enables you
to send SQL statements to the database and retrieve results.

Getting these components installed is sometimes the most difficult
part of getting involved with database programming. Installing and
configuring MySQL has several steps, and if you are very new to
computers, you may find some of the instructions difficult to follow,
as they may assume that you know more about your computer system than
you do. DBI and DBD are typically much easier to install, but you may
run into snags with them as well. The help of experienced hands,
either directly or by means of the type of mailing list mentioned in
the QUICK START
GUIDE, can make the difference between days of
frustration and a successful installation.

6.7.2 Handling Tab-Delimited Input Files

Let's say you
have the components installed (MySQL, Perl, DBD, DBI), and you want
to write a program that talks to the database. I'll
assume you've implemented a new version of the
homologs database as shown. We'll
now walk through a small Perl example that shows how to read data in
from a file, populate a database, send queries, and retrieve results.

First, here is the data as you might find it in a file. All the
whitespace between the words is the tab character in the file, not
space characters:

TABLE        ORGANISM
OrgId        Organism
1            human
2            worm
3            mouse
TABLE      GENES
GeneId     Gene        Date
118        aging       1984-07-13
9223       wrinkle     1987-08-15
273        hairy       1990-09-30
TABLE    VARIANTS
VarId    OrgId    GeneId
1        1        118
2        2        118
3        1        9223
4        3        9223
5        3        273

There are several ways to find the data in a database.
It's common to have it in a plain file that has
tables represented by lines, one table row on each line, with the
field values separated by tabs or some other character that
doesn't appear in any of the values of any field.

You should bear in mind that this is just one of several
possibilities for the source and format of your input data. See the
interesting book Data Munging with Perl, by
David Cross (Manning) for lots of useful lore about getting data in
and out of various sources.

SQL itself provides a utility
for this purpose, called load, which assumes
that you have a file consisting of only rows of data. You can specify
what columns to load, what delimiter the file uses (tab by default),
and a few other options. Its performance is optimized, and it is much
faster than executing several SQL insert statements. However, you still need
to read data in from files in different formats: what better than
your own program that you can alter to suit any occasion?

To start developing such a utility, here is a short program to
populate the database. It reads the file and knows the table
it's reading, the field names, and the data for each
row:

#!/usr/bin/perl
use strict;
use warnings;
my $flag = 0;
my $table;
my @table;
my @fieldnames;
my @fields;
while(<>) {
if(/^\s*$/) {
# skip blank lines
;
}elsif(/^TABLE\t(\w+)/) {
# output previous table
print(@table) if $flag;
$flag = 1;
# begin new table
@table = (  );
$table = $1;
push(@table, "\nTable is $table\n");
} elsif($flag =  = 1) {
@fieldnames = split;
$flag = 2;
push(@table, "Fields are ", join("|", @fieldnames), "\n");
} elsif($flag =  = 2) {
@fields = split;
push(@table, join("|", @fields) . "\n");
}
}
# output last table
print @table;

This program understands the file format I gave previously, reads it
in, and then reformats it and prints it out. It's
just an example of how you might read in data. In the following,
I'll modify this program to read in the file, but
instead of printing out the (reformatted) tables, it sends SQL
commands to the MySQL database to insert the data into the
appropriate tables.

As you see, this first version of the program uses the
$flag
variable to keep track of what it's reading. Every
time the input line begins with TABLE\t (that \t
is a tab that actually shows up as whitespace), the program outputs
the previously read table (if $flag indicates
there was one). It then saves the next word as the
table's name, sets the $flag to
1, and prepares some output in the array @table.

Otherwise, if the $flag variable is set to 1, the
program knows it's on the second line of a table
(remember, this program is specially written for the input file
format I gave previously). In this case, it saves the names of the
fields in an array, and then reformats them and adds them to the
@table output array.

Finally, if the $flag variable is set to 2, the
program knows it's reading rows of the table; it
reformats them and adds them to the @table output
array.

When all the input is done, and the while loop
finishes, there will be the last table's reformatted
output ready to be printed from the @table output
array.

If I call this program homologs.getdata and give it my data file
homologs.tabs like so:

% perl homologs.getdata homologs.tabs

I get the following output:

Table is ORGANISM
Fields are OrgId|Organism
1|human
2|worm
3|mouse
Table is GENES
Fields are GeneId|Gene|Date
118|aging|1984-07-13
9223|wrinkle|1987-08-15
273|hairy|1990-09-30
Table is VARIANTS
Fields are VarId|OrgId|GeneId
1|1|118
2|2|118
3|1|9223
4|3|9223
5|3|273

Notice that all I've really done here is read in the
data and print it out in a slightly different format; among other
things, I've changed the delimiter between fields
from a tab to a vertical bar, a common type of task with these
database dumps. But now, let's see how to interact
with an actual database.

6.7.3 DBI Examples

Let's
take the homologs.getdata program from the
previous section and add the DBI calls to the MySQL database that
will populate the MySQL database with the read-in data.

6.7.3.1 homologs.tabs

For starters, let's
just see a Perl program that connects to the database, asks a simple
question ("What tables are in this
database?"), displays the results, and disconnects:

#!/usr/bin/perl
use strict;
use warnings;
# Make connection with MySQL database
use DBI;
my $database = 'homologs';
my $server   = 'localhost';
my $user     = 'tisdall';
my $passwd   = 'NOTmyPASSWORD';
my $homologs = DBI->connect("dbi:mysql:$database:$server", $user, $passwd);
# prepare an SQL statement
my $query    = "show tables";
my $sql      = $homologs->prepare($query);
# execute an SQL statement
$sql->execute(  );
# retrieve and print results
while (my $row = $sql->fetchrow_arrayref) {
print join("\t", @$row), "\n";
}
# Break connection with MySQL database
$homologs->disconnect;
exit;

Here's the result of running this program:

GENES
ORGANISM
VARIANTS

This program does the basic tasks that all DBI programs have to do,
so let's examine them in useful detail.

After the obligatory use DBI; that loads the DBI module, I declare some variables to
hold the string that specifies to DBI who is connecting to what and
where. The actual connection happens here:

my $homologs = DBI->connect("dbi:mysql:$database:$server", $user, $passwd);

The connect method is asked to connect to the
particular database (in this case the homologs
database) on the local computer (localhost; this
can be replaced by a URL of another computer) as the user with the
given password. mysql is specified; if you are
using another DBMS, you should change this to reflect your database
system. Other optional arguments, not used here, are possible (see
the documentation). The connect method is the DBI
new method that creates (and initializes) a DBI
object.

The output of the connect call is saved as a new
DBI object in the reference variable $homologs.

Next, the DBI object prepares an SQL statement, and the result is
saved as a new statement object, which I call $sql
here. This statement object $sql then calls its
own execute method which actually does the job of
sending the SQL to the database.

The actual SQL here is a very simple one. You've
already seen that DBI->connect specifies the
particular database on your MySQL (homologs in
this case). This SQL statement show tables simply
asks for a list of the names of the tables that are defined in that
database.

After execute, the program retrieves the results
of executing the SQL statement. There are several ways to retrieve
results. Here, the statement object method
fetchrow_arrayref is called in a loop to fetch all
the rows of the result; at each pass through the loop, the return
value in $row points to the array of fields in
that row. Here, I simply separate the fields with tab characters with
the help of the Perl join function on the
dereferenced array @$row and print the row with a
newline.

The last DBI call is to
disconnect from the database. This is actually an
important call to make; depending on your implementation and how
you're running your program, it is sometimes
possible to open a number of connections and eventually tax the MySQL
DBMS to the point where it has to refuse any more
connect requests. Especially if you have an active
database with regular queries coming in, you want to
disconnect as soon as possible from each
connect.

The program you've just seen is the kind of sample
program you should run when you first try to install and use the DBI
module. If it works, you're in business. If not, you
have to closely examine the error messages you get to identify where
the problem is. One thing that can help is to add an additional
argument to the connect call that asks for more
error reporting, like so:

my $homologs = DBI->connect(
"dbi:mysql:$database:$server", $user, $passwd, {RaiseError=>1}
);

This terminates the program with extra error messages if the
connect call fails; this is usually where things
go wrong when you first try to use this software. (See the
documentation for other such options to connect.)
Remember that you need a username and password for MySQL itself, and
that these aren't related in any way to the username
and password on your computer outside of MySQL. You also need to have
the database defined (e.g., the SQL statement
create database
life creates a database called
life), and you have to have your MySQL permissions
set properly to allow you to do the things you want to do, such as
create or modify databases, or see other databases on the system. If
there's a problem, ask your database administrator,
consult your MySQL documentation, or visit the MySQL web
site.

6.7.3.2 homologs.load

This next program does a little more than
the previous; it reads in the tab-delimited file
homologs.tabs, extracts the data, and uses it to
load the tables in a MySQL database. Read it over: most of it will be
familiar from previous programs in this chapter.

#!/usr/bin/perl
use strict;
use warnings;
# Make connection with MySQL database
use DBI;
my $database = 'homologs';
my $server   = 'localhost';
my $user     = 'tisdall';
my $passwd   = 'NOTmyPASSWORD';
my $homologs = DBI->connect("dbi:mysql:$database:$server", $user, $passwd);
my $sqlinit  = $homologs->prepare("show tables");
$sqlinit->execute(  );
while (my $row = $sqlinit->fetchrow_arrayref) {
print join("\t", @$row), "\n";
}
my $flag = 0;
my $table;
my @tables;
my $sql;
while(<>) {
# skip blank lines
if(/^\s*$/) {
next;
# begin new table
}elsif(/^TABLE\t(\w+)/) {
$flag = 1;
$table = $1;
push(@tables, $table);
# Delete all rows in database table
my $droprows = $homologs->prepare("delete from $table");
$droprows->execute(  );
# get fieldnames, prepare SQL statement
} elsif($flag =  = 1) {
$flag = 2;
my @fieldnames = split;
my $query = "insert into  $table ("
. join(",", @fieldnames)
. ") values ("
. "?, " x (@fieldnames-1)
. "?)";
$sql = $homologs->prepare($query);
# get row, execute SQL statement
} elsif($flag =  = 2) {
my @fields = split;
$sql->execute( @fields);
}
}
# Check if tables were updated
foreach my $table (@tables) {
my $query = "select * from $table";
my $sql = $homologs->prepare($query);
$sql->execute(  );
while (my $row = $sql->fetchrow_arrayref) {
print join("\t", @$row), "\n";
}
}
# Break connection with MySQL database
$homologs->disconnect;
exit;

This program is called by giving it the name of the tab-delimited
file on the command line (the same file used previously with the
homologs.getdata program):

% perl homologs.load homologs.tabs

This is the output of the program:

GENES
ORGANISM
VARIANTS
Table: ORGANISM
1        human
2        worm
3        mouse
Table: GENES
118        aging        1984-07-13
9223       wrinkle     1987-08-15
273        hairy        1990-09-30
Table: VARIANTS
1        1        118
2        2        118
3        1        9223
4        3        9223
5        3        273

This homologs.load program is very much like the
previous homologs.getdata program. However,
instead of building an output array @tables and
printing the text to the screen, homologs.load
puts the data into the MySQL database using SQL statements. When the
database is loaded, it retrieves the data from the tables and prints
it to the screen.

Notice that each time homologs.load finds a new
table, it first empties all rows of that table in the database and
then proceeds to read in the lines of data and insert new rows into
the table.

Notice also that when the program reads the line of the input file
that names the fields (when $flag equals 1), it
prepares the SQL statement, saving the object in the
$sql variable. Then, when the actual lines of data
are read (when $flag equals 2), the values of the
fields are passed to the execute command, which
sends the SQL command by the $sql object to the
database system.

6.7.3.3 An SQL query

The
SQL statement that is the argument to
the prepare method is built up from information
that the program knows at that point. Here it is again:

my $query = "insert into  $table ("
. join(",", @fieldnames)
. ") values ("
. "?, " x (@fieldnames-1)
. "?)";

This is a bit hard to read at first sight. However, it is typical of
what happens when you use one language (Perl) to make a statement in
another language (SQL). So I'll explain this one
carefully as a good example of the breed.

The SQL query is formed by five strings that are joined by the
dot (.) string operatorrecall
that "r" .
"DNA" has the value "rDNA".
Here are the five strings being joined:

"insert into  $table ("
join(",", @fieldnames)
") values ("
"?, " x (@fieldnames-1)
"?)"

The question marks in the SQL statement are bind
variables that are passed the values from the
@fields array when the execute
statement is called with:

$sql->execute( @fields );

If the query is called with:

 $table = EXONS

and:

@fieldnames = (Exon, Position)

the resulting SQL statement is:

"insert into  EXONS (Exon,Position) values (?, ?)"

How does this statement get constructed? Let's look
at it in detail:

The first string just interpolates the table name EXONS into the
string.

The second string joins the field names with commas.

The third string appears as is.

The fourth string uses the Perl
x string operator to make a new string
that has a certain number of copies of the original string
"?, ". The desired number of
copies is specified on the right side of the x
operator. The string itself is on the left side of the
x operator, "?,
".

@fieldnames in a scalar context returns the number
of elements of the @fieldnames array; I need one
less "?, " than that plus an
additional question mark without a comma (because no commas are
allowed after the last item in the list in SQL). So,
@fieldnames-1 is the desired number of copies of
the string "?, ".

The fifth string "?)" is just the last question
mark and the closing parenthesis.

The final result is:

insert into EXONS (Exon,Position) values (?, ?)

As the Perl program reads in the data rows from the input file (when
$flag equals 2), the values are placed in the
array @fields and then passed as variables (to
take the place of the question marks in the SQL statement
that's been prepared) to the
execute method, which sends the SQL statement to
the database system. These question marks are the bind variables;
here, I need one for each field, because I'll pass
in the field values when execute is eventually
called on this statement.

To check what actually happened to the database after the reading in
and processing of the file is complete, the program sends an SQL
query for each table to see what's in it. This is
done with the SQL select commanda general-purpose command
to get information out of a database that has a great many options.

Here, I'm asking to see all fields (*) from the
database table, and because no restrictions are added, SQL shows us
all the fields of all the rows. As before, the result of this SQL
query is read using the fetchrow_arrayref DBI
method in a while loop, and each resulting row is
printed with tab-separated fields.

This last program homologs.load is a typical DBI
program, interacting with the world through Perl (e.g., reading in
files and displaying the results to the user) and also interacting
with the database through the Perl DBI module and SQL
statements.

Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی