Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Mastering Perl for Bioinformatics [Electronic resources] - نسخه متنی

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید












9.3 Testing Bioperl



To check that things
were working okay with my new Bioperl installation, I first wrote a
little test to see if a Perl program could find the
Bio::Perl module:


use Bio::Perl;


I ran it by putting it in a file bp0.pl and giving
it to the Perl interpreter:


[tisdall]# perl bp0.pl
[tisdall]#


As you see, it didn't complain, which means Perl
found the Bio::Perl module. If it
can't find it, it will complain. When I copied the
bp0.pl program to a file called
bp0.pl.broken and changed the module call of
Bio::Perl to a call for the nonexistent module
Bio::Perrl, I got the following (slightly
truncated) error output:


[tisdall@coltrane development]$ perl bp0.pl.broken
Can't locate Bio/Perrl.pm in @INC
BEGIN failed--compilation aborted at bp0.pl.broken line 1.


9.3.1 Second Test



Now I knew that Bio::Perl
could be found and loaded. I next tried a couple of test programs
that are given in the bptutorial.pl document. I
found a link to that tutorial document on the Web from the
http://bioperl.org page.
Alternately, I could have opened a window and typed at the command
prompt:


perldoc bptutorial.pl


I went to the section near the beginning of the document called
"I.2 Quick getting started scripts"
and created a file tut1.pl on my computer by
pasting in the text of the first tutorial script:


  use Bio::Perl;
# this script will only work with an internet connection
# on the computer it is run on
$seq_object = get_sequence('swissprot',"ROA1_HUMAN");
write_sequence(">roa1.fasta",'fasta',$seq_object);


I then ran the program and looked at the output file:


[tisdall]$ perl tut1.pl
[tisdall]$ cat roa1.fasta
>ROA1_HUMAN Heterogeneous nuclear ribonucleoprotein A1 (Helix-destabilizing protein)
(Single-strand binding protein) (hnRNP core protein A1).
SKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVT
YATVEEVDAAMNARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHL
RDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKAL
SKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSG
DGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDSYNNGGGRGFGGG
SGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSS
SSSSYGSGRRF
[tisdall]$


That seemed to work perfectly.



9.3.2 Third Test



I tried the next short script from the same
section of the tutorial, pasting it into a file called
tut2.pl:


[tisdall]$ cat tut2.pl
use Bio::Perl;
# this script will only work with an internet connection
# on the computer it is run on
$seq_object = get_sequence('swissprot',"ROA1_HUMAN");
# uses the default database - nr in this case
$blast_result = blast_sequence($seq);
write_blast(">roa1.blast",$blast_report);
[tisdall]$ perl tut2.pl
-------------------- WARNING ---------------------
MSG: req was POST http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
User-Agent: libwww-perl/5.69
Content-Length: 178
Content-Type: application/x-www-form-urlencoded
...
---------------------------------------------------
Submitted Blast for [blast-sequence-temp-id]
[tisdall]$ cat roa1.blast
[tisdall]$ ls -l roa1.blast
-rw-rw-r-- 1 tisdall tisdall 0 Apr 30 11:28 roa1.blast
[tisdall]$


Here, I experienced a problem. Running the program created a page of
error output, mostly formatted in HTML, which I've
truncated in the output. When I tried to cat the
output file, there was nothing in it, which I verified (on Linux) by
examining the file with ls -l,
which showed that it had 0 bytes in it.


This wasn't very encouraging; the second of the two
short example scripts was failing.


Looking at the two programs tut1.pl and
tut2.pl, you can see that the second is an
extension of the first; after getting the sequence, the second
program also tries to blast it, and this is where it is failing.
It's running (printing those ... dots took a while
because the program was waiting for a reply from NCBI over the
Internet), but it's not printing out the BLAST
report to the file roa1.blast.
What's going wrong?


I started my investigation by looking at the documentation for the
Bio::Perl module by typing:


perldoc Bio::Perl


and searching for the function that is failing,
blast_sequence. Here's what I
found:


blast_sequence
Title : blast_sequence
Usage : $blast_result = blast_sequence($seq)
$blast_result = blast_sequence('MFVEGGTFASEDDDSASAEDE');
Function: If the computer has Internet accessibility, blasts
the sequence using the NCBI BLAST server against nrdb.
It choose the flavour of BLAST on the basis of the sequence.
This function uses Bio::Tools::Run::RemoteBlast, which itself
use Bio::SearchIO - as soon as you want to more, check out
these modules
Returns : Bio::Search::Result::GenericResult.pm
Args : Either a string of protein letters or nucleotides, or a
Bio::Seq object


That last sentence about the Args gave me a clue.
I went back and looked at the failing program, which I placed in a
file called tut2.pl, and, sure enough, the
argument for the blast_sequence function,
$seq, is neither a string of sequence nor a
Bio::Seq object. In fact, it's
not even defined! Clearly, what the tutorial writer meant instead of
$seq, was $seq_object, which is
defined and populated by the earlier get_sequence
method call.


I was visited by a brainwave, and I decided to put a
use strict and a
use warnings into the program
and to declare variables with my to see what
ensued:


[tisdall]$ cat tut2.pl
use Bio::Perl;
use strict;
use warnings;
# this script will only work with an internet connection
# on the computer it is run on
my $seq_object = get_sequence('swissprot',"ROA1_HUMAN");
# uses the default database - nr in this case
my $blast_result = blast_sequence($seq);
write_blast(">roa1.blast",$blast_report);
[tisdall]$ perl tut2.pl
Global symbol "$seq" requires explicit package name at tut2.pl line 11.
Global symbol "$blast_report" requires explicit package name at tut2.pl line 13.
Execution of tut2.pl aborted due to compilation errors.
[tisdall]$


Well, that's pretty clear. The variables
$seq and $blast_report are
wrong; apparently, the author intended to reuse the variables
$seq_object and $blast_result
instead. So, I edited the file and ran it again:


[tisdall]$ cat tut2.pl
use Bio::Perl;
use strict;
use warnings;
# this script will only work with an internet connection
# on the computer it is run on
my $seq_object = get_sequence('swissprot',"ROA1_HUMAN");
# uses the default database - nr in this case
my $blast_result = blast_sequence($seq_object);
write_blast(">roa1.blast",$blast_result);
[tisdall]$ perl tut2.pl
[tisdall]$ perl tut2.pl
Submitted Blast for [ROA1_HUMAN] ...................
[tisdall]$ ls -l roa1.blast
-rw-rw-r-- 1 tisdall tisdall 56888 May 5 15:05 roa1.blast
[tisdall]$


Examining the roa1.blast file convinced me that
the program had successfully called blast. I
decided that was a success, albeit a qualified one: this was a spot
in the documentation that could use a little attention, clearly. By
the time you're reading this, it may well have been
fixed.



9.3.3 Fourth Test



Now, let me show you one more test of my new
Bioperl installation.


Looking at the Bio::Perl documentation, I found
the following example and discussion at the very beginning.


Bio::Perl(3)   User Contributed Perl Documentation   Bio::Perl(3)
NAME
Bio::Perl - Functional access to BioPerl for people who
don't know objects
SYNOPSIS
use Bio::Perl;
# will guess file format from extension
$seq_object = read_sequence($filename);
# forces genbank format
$seq_object = read_sequence($filename,'genbank');
# reads an array of sequences
@seq_object_array = read_all_sequences($filename,'fasta');
# sequences are Bio::Seq objects, so the following methods work
# for more info see L<Bio::Seq>, or do 'perldoc Bio/Seq.pm'
print "Sequence name is ",$seq_object->display_id,"\n";
print "Sequence acc is ",$seq_object->accession_number,"\n";
print "First 5 bases is ",$seq_object->subseq(1,5),"\n";
# get the whole sequence as a single string
$sequence_as_a_string = $seq_object->seq( );
# writing sequences
write_sequence(">$filename",'genbank',$seq_object);
write_sequence(">$filename",'genbank',@seq_object_array);
# making a new sequence from just strings you have
# from something else
$seq_object = new_sequence("ATTGGTTTGGGGACCCAATTTGTGTGTTATATGTA",
"myname","AL12232");
# getting a sequence from a database (assumes internet connection)
$seq_object = get_sequence('swissprot',"ROA1_HUMAN");
$seq_object = get_sequence('embl',"AI129902");
$seq_object = get_sequence('genbank',"AI129902");
# BLAST a sequence (assummes an internet connection)
$blast_report = blast_sequence($seq_object);
write_blast(">blast.out",$blast_report);
DESCRIPTION
Easy first time access to BioPerl via functions
FEEDBACK
Mailing Lists
User feedback is an integral part of the evolution of this
and other Bioperl modules. Send your comments and sugges-
tions preferably to one of the Bioperl mailing lists.
Your participation is much appreciated.
bioperl-l@bio.perl.org
Reporting Bugs
Report bugs to the Bioperl bug tracking system to help us
keep track the bugs and their resolution. Bug reports can
be submitted via email or the web:
bioperl-bugs@bio.perl.org
http://bugzilla.bioperl.org/
AUTHOR - Ewan Birney
Email bioperl-l@bio.perl.org
Describe contact details here
APPENDIX
The rest of the documentation details each of the object
methods. Internal methods are usually preceded with a _
...


I took the example code from the SYNOPSIS section
and put it in a file called bp1.pl. I noticed that
the code was not meant to be a running program. The variable
$filename refers to some sequence file in the
first line of code without being initialized, and then the same
variable name is used in several other lines for different purposes.
This is not an uncommon situation in such SYNOPSIS
sections; the point is to show how individual calls can be made to
methods in the class, not to demonstrate a complete working program
(although sometimes the code can be run exactly as is).


I had to make some changes to make this code a working, runnable
program. I declared the strict and
warnings pragmas and declared each variable with
my. I created variables for the different sequence
filenames I needed as both input and output, and I made sure that the
input sequence files were on disk and of the correct variety (for
example, I created the array.fasta file with three
FASTA headers and sequences one after the other). In the end I had
this code:


use Bio::Perl;
use strict;
use warnings;
my $gbfilename = 'AI129902.genbank';
# will guess file format from extension
my $seq_object0 = read_sequence($gbfilename);
# forces genbank format
my $seq_object1 = read_sequence($gbfilename,'genbank');
my $fastafilename = 'array.fasta';
# reads an array of sequences
my @seq_object_array = read_all_sequences($fastafilename,'fasta');
# sequences are Bio::Seq objects, so the following methods work
# for more info see L<Bio::Seq>, or do 'perldoc Bio/Seq.pm'
print "Sequence name is ",$seq_object1->display_id,"\n";
print "Sequence acc is ",$seq_object1->accession_number,"\n";
print "First 5 bases is ",$seq_object1->subseq(1,5),"\n";
# get the whole sequence as a single string
my $sequence_as_a_string = $seq_object1->seq( );
# writing sequences
my $gbfilenameout = 'bpout.genbank';
write_sequence(">$gbfilenameout",'genbank',$seq_object1);
write_sequence(">$gbfilenameout",'genbank',@seq_object_array);
# making a new sequence from just strings you have
# from something else
my $seq_object2 = new_sequence("ATTGGTTTGGGGACCCAATTTGTGTGTTATATGTA",
"myname","AL12232");
# getting a sequence from a database (assumes internet connection)
my $seq_object3 = get_sequence('swissprot',"ROA1_HUMAN");
my $seq_object4 = get_sequence('embl',"AI129902");
my $seq_object5 = get_sequence('genbank',"AI129902");
# BLAST a sequence (assummes an internet connection)
my $blast_report = blast_sequence($seq_object3);
write_blast(">blast.out",$blast_report);


As you see, I didn't check on the output of all the
method calls, but the real purpose of this test of my newly installed
Bioperl modules was just to see if all the modules and methods could
be found and would run when called.


As mentioned previously, I also added use
strict; and use
warnings; and declared all the variables with
my. Often, you don't see that
usage in this kind of documentation; it's not
required in Perl, and it may distract some readers of the
documentation from the main point of showing how the methods are
called. This is not an unusual omission to find in Perl class
documentation, but of course, it's recommended and
often very helpful, as you saw in the last section.


So, I ran my slightly edited version of the example code from the
beginning of the Bio::Perl manpage, with the
following results:


[tisdall]$ perl bp1.pl
Sequence name is AI129902
Sequence acc is AI129902
First 5 bases is CTCCG
Submitted Blast for [ROA1_HUMAN] .........
[tisdall]$


I looked at each of the input and output files and verified the
expected contents. The following output from a listing of the files
will give you an idea of what to expect (although you may get
different results by running the program with different input files
or database lookups):


-rw-rw-r--    1 tisdall  tisdall      2391 May  5 10:37 AI129902.genbank
-rw-rw-r-- 1 tisdall tisdall 2485 May 5 13:00 array.fasta
-rw-rw-r-- 1 tisdall tisdall 1513 May 5 13:02 bp1.pl
-rw-rw-r-- 1 tisdall tisdall 3653 May 5 13:02 bpout.genbank
-rw-rw-r-- 1 tisdall tisdall 56888 May 5 13:04 blast.out


Notice that at the end of the program I call the
blast_sequence method on the protein sequence
object $seq_object3. I discovered that trying to
run the blast_sequence method on a nucleotide
sequence object (such as $seq_object5) failed.
Although the documentation for the method said that the sequence type
would be examined and the appropriate BLAST program called (for
example, blastp for protein sequence and
blastx for nucleotide sequence, against the nr
nonredundant protein database), it always seemed to call
blastp no matter what the input sequence, and
therefore it failed unless called with protein sequence as I had
stored in $seq_object3. Perhaps this bug, a
disconnect between the code and the documentation, has been fixed by
the time you read this.



/ 156