Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

9.7 bptutorial.pl: sequence_manipulation Demo

In this section, I'll go through the code for the
demo subroutine sequence_manipulation that was
shown in the last section.

The subroutine is actually an anonymous subroutine; a reference to
the subroutine is saved in the scalar reference variable
$sequence_manipulation:

$sequence_manipulations = sub {
...
}

The first few lines of code declare some variables with
my. Notice that these are not being passed in as
arguments; this method uses no arguments but does occasionally use
global variables such as $dna_seq_file, which, as
you've just seen, contain the pathname of the input
sequence file the demo will use:

my ($infile, $in, $out, $seqobj);
$infile = $dna_seq_file;
print "\nBeginning sequence_manipulations and SeqIO example... \n";

The code is cross-referenced to the tutorial sections of the file.
The next comment line refers to the part of the document:

# III.3.1 Transforming sequence files (SeqIO)

which can be looked up in the table of contents to the document for
further reading:

III.3 Manipulating sequences
III.3.1 Manipulating sequence data with Seq methods (Seq)

Now, I'll take a look at the first section of
example code in the sequence_manipulations method:

# III.3.1 Transforming sequence files (SeqIO)
$in  = Bio::SeqIO->new('-file' => $infile ,'-format' => 'Fasta');
$seqobj = $in->next_seq(  );
# perl "tied filehandle" syntax is available to SeqIO,
# allowing you to use the standard <> and print operations
# to read and write sequence objects, eg:
#$out = Bio::SeqIO->newFh('-format' => 'EMBL');
$out = Bio::SeqIO->newFh('-format' => 'fasta');
print "First sequence in fasta format... \n";
print $out $seqobj;

The code starts with a call to the new object constructor of the
Bio::SeqIO class. The new method is being passed
the pathname to a FASTA file in $infile, and told
that the format is FASTA.

A quick look at the Bio::SeqIO documentation
explains that the call to Bio::SeqIO->new
returns a stream object for the specified format. So,
$out is a stream object (a stream is input or
output of data) for FASTA-formatted data, and $in
is a stream object for FASTA-formatted input from the file named in
the $infile variable. These $in
and $out objects are also filehandles.

After the $in object is initialized on the FASTA
file named in $infile, it calls the
next_seq method, which gets the next (in this
case, the first and perhaps only) FASTA record from the file, and it
creates a sequence object $seqobj. The output
$out object is created. The Perl
print statement is then called, using
$out as a filehandle, and printing
$seqobj. This prints the first FASTA record from
that file, as shown in the demo output in the last section and
repeated here:

First sequence in fasta format... 
>Test1
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC
TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG
TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC
ACAACATCCATGAAACGCATTAGCACCACC

The next part of the sequence_manipulation demo
code shows many of the methods for extracting information from a
sequence object:

# III.4 Manipulating individual sequences
# The following methods return strings
print "Seq object display id is ",
$seqobj->display_id(  ), "\n"; # the human read-able id of the sequence
print "Sequence is ",
$seqobj->seq(  )," \n";        # string of sequence
print "Sequence from 5 to 10 is ",
$seqobj->subseq(5,10)," \n"; # part of the sequence as a string
print "Acc num is ",
$seqobj->accession_number(  ), " \n"; # when there, the accession number
print "Moltype is ",
$seqobj->alphabet(  ), " \n";    # one of 'dna','rna','protein'
print "Primary id is ", $seqobj->primary_seq->primary_id(  )," \n";
# a unique id for this sequence irregardless
#print "Primary id is ", $seqobj->primary_id(  ), " \n";
# a unique id for this sequence irregardless
# of its display_id or accession number
# The following methods return an array of  Bio::SeqFeature objects
$seqobj->top_SeqFeatures; # The 'top level' sequence features
$seqobj->all_SeqFeatures; # All sequence features, including sub
# seq features
# The following methods returns new sequence objects,
# but do not transfer features across
# truncation from 5 to 10 as new object
print "Truncated Seq object sequence is ",
$seqobj->trunc(5,10)->seq(  ), " \n";
# reverse complements sequence
print "Reverse complemented sequence 5 to 10  is ",
$seqobj->trunc(5,10)->revcom->seq, "  \n";
# translation of the sequence
print "Translated sequence 6 to 15 is ",
$seqobj->translate->subseq(6,15), " \n";

This section of the tutorial produced the following output (slightly
reformatted to fit the pages of this book):

Seq object display id is Test1
Sequence is AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGA
TAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACC
AATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC 
Sequence from 5 to 10 is TTTCAT 
Acc num is unknown 
Moltype is dna 
Primary id is Test1 
Truncated Seq object sequence is TTTCAT 
Reverse complemented sequence 5 to 10  is ATGAAA  
Translated sequence 6 to 15 is LQRAICLCVD

This part of the demo code is self-explanatory. As you can see, the
methods return:

The FASTA ID (the part immediately following the > sign in the
first line)

The sequence alone as a string (reformatted with line breaks to fit
this book)

The accession number if it is known (FASTA format
won't have this, but Genbank format will, for
instance)

The type of molecule (dna, rna, or protein)

A unique ID compared to other IDs in the running program but drawn
from the file itself if possible

The sequence of a new truncated sequence object defined from the
original sequence object; a reverse complement

The peptide resulting from a translation of a specified part of a
sequence

The final section of the sequence_manipulation
demo demonstrates the ability to translate nucleotides into all six
reading frames using alternate codon translation tables if desired:

my $c = shift;
$c ||= 'ctgagaaaataa';
print "\nBeginning 3-frame and alternate codon translation example... \n";
my $seq = new Bio::PrimarySeq('-SEQ' => $c, '-ID' => 'no.One');
print "$c translated using method defaults   : ",
$seq->translate->seq, "\n";
# Bio::Seq uses same sequence methods as PrimarySeq
my $seq2 = new Bio::Seq('-SEQ' => $c, '-ID' => 'no.Two');
print "$c translated as a coding region (CDS): ",
$seq2->translate(undef, undef, undef, undef, 1)->seq, "\n";
print "\nTranslating in all six frames:\n";
my @frames = (0, 1, 2);
foreach my $frame (@frames) {
print  " frame: ", $frame, " forward: ",
$seq->translate(undef, undef, $frame)->seq, "\n";
print  " frame: ", $frame, " reverse-complement: ",
$seq->revcom->translate(undef, undef, $frame)->seq, "\n";
}
print "Translating with all codon tables using method defaults:\n";
my @codontables = qw( 1 2 3 4 5 6 9 10 11 12 13 14 15 16 21 );
foreach my $ct (@codontables) {
print $ct, " : ",
$seq->translate(undef, undef, undef, $ct)->seq, "\n";
}

This produces the output:

Beginning 3-frame and alternate codon translation example... 
ctgagaaaataa translated using method defaults   : LRK*
ctgagaaaataa translated as a coding region (CDS): MRK
Translating in all six frames:
frame: 0 forward: LRK*
frame: 0 reverse-complement: LFSQ
frame: 1 forward: *EN
frame: 1 reverse-complement: YFL
frame: 2 forward: EKI
frame: 2 reverse-complement: IFS
Translating with all codon tables using method defaults:
1 : LRK*
2 : L*K*
3 : TRK*
4 : LRK*
5 : LSK*
6 : LRKQ
9 : LSN*
10 : LRK*
11 : LRK*
12 : SRK*
13 : LGK*
14 : LSNY
15 : LRK*
16 : LRK*
21 : LSN*

Again, this code is fairly easy to follow, although you may find
yourself turning to the documentation for some of the finer points
when you try to use these objects and methods in your own code.

This part of the code starts by getting an argument of sequence data
if one is available to be shifted off of the @_
argument array and into the variable $c;
otherwise, it sets the variable $c to a preset
sequence:

my $c = shift;
$c ||= 'ctgagaaaataa';

Briefly, the methods that follow in this part of the
sequence_manipulation demo do the following:

Call the PrimarySeq::new constructor to create a
lightweight sequence object from $c that
doesn't have the extra annotation often present in a
standard sequence file (see perldoc
Bio::PrimarySeq for the whole story)

Translate the sequence (from $c)

Call the Bio::Seq constructor to create the
$seq2 sequence object (from the same sequence data
in $seq)

Translate the CDS in the sequence object $seq2

Translate the sequence in all six reading frames

Translate the sequence using all 21 defined codon tables

The translate method seems to have lots of
interesting options. However, if you try to look up the documentation
for this method, you may have difficulty finding it. The problem is
that Bioperl classes often make considerable use of inheritance. Say
that the code is calling a method on a
Bio::PrimarySeq object, as do the following lines
from the demo (somewhat separated in the original):

my $seq = new Bio::PrimarySeq('-SEQ' => $c, '-ID' => 'no.One');
$seq->translate(undef, undef, $frame)->seq, "\n";

You may try perldoc
Bio::PrimarySeq and not find a discussion of the
translate method because it is being inherited
from some other class. And since multiple inheritance is possible, it
can take considerable effort to track down this method documentation
by figuring out the parent classes and searching the documentation.

There's a very convenient way to find the
documentation for a method, even if it's only
inheritedit's built into the
bptutorial.pl script. In the example just
mentioned, you ask for the list of methods used by the
Bio::PrimarySeq method.

[tisdall]$ perl bptutorial.pl 100 Bio::PrimarySeq
***Methods for Object Bio::PrimarySeq ********  
Methods taken from package Bio::DescribableI  
description   display_name   
Methods taken from package Bio::IdentifiableI  
authority   lsid_string   namespace   namespace_string   object_id   version   
Methods taken from package Bio::PrimarySeq  
accession   direct_seq_set   validate_seq   
Methods taken from package Bio::PrimarySeqI  
accession_number   alphabet   can_call_new   desc   display_id   id   
is_circular   length   moltype   primary_id   revcom   seq   
subseq   translate   trunc   
Methods taken from package Bio::Root::Root  
DESTROY   debug   verbose   
Methods taken from package Bio::Root::RootI  
carp   confess   deprecated   new   stack_trace   stack_trace_dump   
throw   throw_not_implemented   warn   warn_not_implemented  
[tisdall]$

Sure enough, there under the heading "Methods taken
from package Bio::PrimarySeqI" appears the method
translate. You should then look at the
Bio::PrimarySeqI documentation and find the
following method description:

         translate
Title   : translate
Usage   : $protein_seq_obj = $dna_seq_obj->translate
#if full CDS expected:
$protein_seq_obj = $cds_seq_obj->translate(undef,undef,undef,undef,1);
Function:
Provides the translation of the DNA sequence using full
IUPAC ambiguities in DNA/RNA and amino acid codes.
The full CDS translation is identical to EMBL/TREMBL
database translation. Note that the trailing terminator
character is removed before returning the translation
object.
Note: if you set $dna_seq_obj->verbose(1) you will get a
warning if the first codon is not a valid initiator.
Returns : A Bio::PrimarySeqI implementing object
Args    : character for terminator (optional) defaults to '*'
character for unknown amino acid (optional) defaults to 'X'
frame (optional) valid values 0, 1, 2, defaults to 0
codon table id (optional) defaults to 1
complete coding sequence expected, defaults to 0 (false)
boolean, throw exception if not complete CDS (true) or
defaults to warning (false)

You will want to explore at least a few more of the tutorials
embedded in bptutorial.pl, as
we've done here with the first of the
tutorials.

Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Mastering Perl for Bioinformatics [Electronic resources] - نسخه متنی

آدرس پست الکترونیک گیرنده :

آدرس پست الکترونیک فرستنده :

نام و نام خانوارگی فرستنده :

پیغام برای گیرنده ( حداکثر 250 حرف ) :

کد امنیتی را وارد نمایید

فونت

اندازه قلم

حالت نمایش

9.7 bptutorial.pl: sequence_manipulation Demo