Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

4.3 SeqFileIO.pm: Sequence File Formats

Our
primary interest is bioinformatics.Can we extend the
FileIO class to handle biological sequence
datafiles? For example, can a class be written that takes a GenBank
file and writes the sequence out in FASTA format?

Using the technique of inheritance, in this section I present a
module for a new class SeqFileIO that performs
several basic functions on sequence files of various formats. When
you call this module''s read
method, in addition to reading the file''s contents
and setting the name, date, and write mode of the file, it
automatically determines the format of the sequence file, extracts
the sequence, and when available, extracts the annotation, ID, and
accession number. In addition, a set of put
methods makes it easy to present the sequence and annotation in other
formats.[1]

[1] Don Gilbert''s
readseq package (see http://iobio.bio.indiana.edu/soft/molbio/readseq
and )
is the classic program (written in C) for reading and writing
multiple sequence file formats.

4.3.1 Analysis of SeqFileIO.pm

The first part of the module SeqFileIO.pm contains
the block with definitions of the new, or revised, class data and
methods.

The first thing you should notice is the use
command:

use base ( "FileIO" );

This Perl command tells the current package
SeqFileIO it''s inheriting from
the base class FileIO. Here''s
another statement that''s often used for this
purpose:

@ISA = ( "FileIO" );

The @ISA predefined variable tells a package that
it "is a" version of some base
class; it then can inherit methods from that base class. The
use base directive sets the
@ISA array to the base class(es), plus a little
else besides. (Check perldoc
base for the whole story.) Without getting bogged
down in details use base works a little more
robustly than just setting the @ISA array, so
that''s what I''ll use here:

package SeqFileIO;
use base ( "FileIO" );
use strict;
use warnings;
#use vars ''$AUTOLOAD'';
use Carp;
# Class data and methods
{
# A list of all attributes with defaults and read/write/required/noinit properties
my %_attribute_properties = (
_filename    => [ '''',   ''read.write.required''],
_filedata    => [ [ ],  ''read.write.noinit''],
_date        => [ '''',   ''read.write.noinit''],
_writemode   => [ ''>'',  ''read.write.noinit''],
_format      => [ '''',   ''read.write''],
_sequence    => [ '''',   ''read.write''],
_header      => [ '''',   ''read.write''],
_id          => [ '''',   ''read.write''],
_accession   => [ '''',   ''read.write''],
);
# Return a list of all attributes
sub _all_attributes {
keys %_attribute_properties;
}
# Check if a given property is set for a given attribute
sub _permissions {
my($self, $attribute, $permissions) = @_;
$_attribute_properties{$attribute}[1] =~ /$permissions/;
}
# Return the default value for a given attribute
sub _attribute_default {
my($self, $attribute) = @_;
$_attribute_properties{$attribute}[0];
}
my @_seqfileformats = qw(
_raw
_embl
_fasta
_gcg
_genbank
_pir
_staden
);
sub isformat {
my($self) = @_;
for my $format (@_seqfileformats) {
my $is_format = "is$format";
if($self->$is_format) {
return $format;
}
}
return ''unknown'';
}
}

4.3.1.1 The power of inheritance

A comparison with the opening block in the base class
FileIO is instructive. You''ll see
that I''m redefining the
%_attribute_properties hash and the methods in the
block that access the hash as closures. This is necessary to extend
the new class to handle new attributes that relate specifically to
sequence datafiles:

_format

The format of the sequence datafile, such as FASTA or GenBank

_sequence

The sequence data extracted from the sequence datafile as a scalar
string

_header

The header part of the annotation; defined somewhat loosely in this
module

_id

The ID of the sequence datafile, such as a gene name or other
identifier

_accession

The accession number of the sequence datafile, when provided

You''ll notice, in comparing this opening block with
the opening block of the base class FileIO, that
the methods and variables relating to counting the number of objects
has been omitted in this new class.

Here''s where you see the power of inheritance for
the first time. When the code in the new
SeqFileIO.pm module tries to call, say, the
get_count method, and Perl sees that
it''s not defined in the module, it will go on a hunt
for the method in the base class and use the definition it finds
there. On the other hand, if it finds the method in
SeqFileIO.pm, it just uses that; if the
get_count method appeared in the base class but is
redefined in SeqFileIO.pm, it uses that
redefinition and ignores the older definition in the base class.

So, you don''t have to rewrite methods in the base
class(es); you can just call them. Perl not only finds them, it also
calls them as if they were defined in the new class. For example,
ref($self) returns SeqFileIO,
not FileIO, without regard to whether the method
is defined in the new class or inherited from the old class.

Finally, there are some new definitions in this new version of the
opening block. An array @_seqfileformats lists the
sequence file formats the module knows about, and a method
isformat calls the methods associated with each
format (such as is_fasta defined later in the
module) until it either finds what format the file is in or returns
unknown.

The @_seqfileformats array uses the
qw operator. This splits the words on whitespace
(including newlines) and returns a list of quoted words.
It''s a convenience for giving lists of words without
having to quote each one or separate them by commas. (Check the
perlop manpage for the section on
"RegExp Quote-Like Operators" for
all the variations and alternative quoting operators.)

Following the opening block, notice that there is no
new method defined in this class. The simple,
bare-bones new method from the
FileIO class serves just as well for this new
class; thanks to the inheritance mechanism, there''s
no need to write a new one. A program that calls
SeqFileIO->new will use the same method from
the FileIO class, but the class name will be
properly set to SeqFileIO for the new object
created.

4.3.1.2 A new read method

The next part of the program has a rewrite of the
read method. As a result, the
read method from the parent
FileIO class is being overridden:

# Called from object, e.g. $obj->read(  );
sub read {
my ($self, %arg) = @_;
# Set attributes
foreach my $attribute ($self->_all_attributes(  )) {
# E.g. attribute = "_filename",  argument = "filename"
my($argument) = ($attribute =~ /^_(.*)/);
# If explicitly given
if (exists $arg{$argument}) {
# If initialization is not allowed
if($self->_permissions($attribute, ''noinit'')) {
croak("Cannot set $argument from read: use set_$argument");
}
$self->{$attribute} = $arg{$argument};
# If not given, but required
}elsif($self->_permissions($attribute, ''required'')) {
croak("No $argument attribute as required");
# Set to the default
}else{
$self->{$attribute} = $self->_attribute_default($attribute);
}
}
# Read file data
unless( open( FileIOFH, $self->{_filename} ) ) {
croak("Cannot open file " .  $self->{_filename} );
}
$self->{''_filedata''} = [ <FileIOFH> ];
$self->{''_date''} = localtime((stat FileIOFH)[9]);
$self->{''_format''} = $self->isformat;
my $parsemethod = ''parse'' . $self->{''_format''};
$self->$parsemethod;
close(FileIOFH);
}

The new read method just shown is almost exactly
the same as the read function in the
FileIO class. There are only three new lines of
code, just before the end of the subroutine, preceding the call to
the close function:

$self->{''_format''} = $self->isformat;
my $parsemethod = ''parse'' . $self->{''_format''};
$self->$parsemethod;

These three new lines initialize the new attributes that were defined
for this class. First, a call to isformat
determines the format of the file and sets the
_format attribute appropriately. The appropriate
parse_ method name is then constructed, and
finally, a call to that method is made. As you''re
about to see, the parse_ methods extract the
sequence, header, ID, and accession number (or as many of these as
possible) from the file data and set the object''s
attributes accordingly.

So, by the end of the read method, the file data
has been read into the object, the format determined, and the
interesting parts of the data (such as the sequence data) parsed out.

4.3.2 New Methods: is, parse, and put

The rest of the module consists of three groups of methods that
handle the different sequence file formats. These methods
didn''t appear in the more simple and generic
FileIO module and must be defined here:

is_

Tests to see if data is in a particular sequence file format

parse_

Parses out the sequence, and when possible the header, ID, and
accession number

put_

Takes the sequence attribute, and when available the header, ID, and
accession number attributes, and writes them in the sequence file
format named in the format attribute

4.3.2.1 is_ methods

The first group of methods tests an
object''s file data to see if it conforms to a
particular sequence file format:

sub is_raw {
my($self) = @_;
my $seq = join('''', @{$self->{_filedata}} );
($seq =~ /^[ACGNT\s]+$/) ? return ''raw'' : 0;
}
sub is_embl {
my($self) = @_;
my($begin,$seq,$end) = (0,0,0);
foreach( @{$self->{_filedata}} ) {
/^ID\s/ && $begin++;
/^SQ\s/ && $seq++;
/^\/\// && $end++;
(($begin =  = 1) && ($seq =  = 1) && ($end =  = 1)) && return ''embl'';
}
return;
}
sub is_fasta {
my($self) = @_;
my($flag) = 0;
for(@{$self->{_filedata}}) {
#This to avoid confusion with Primer, which can have input beginning ">"
/^\*seq.*:/i && ($flag = 0) && last;
if( /^>/ && $flag =  = 1) {
last;
}elsif( /^>/ && $flag =  = 0) {
$flag = 1;
}elsif( (! /^>/) && $flag =  = 0) { #first line must start with ">"
last;
}
}
$flag ? return ''fasta'' : return;
}
sub is_gcg {
my($self) = @_;
my($i,$j) = (0,0);
for(@{$self->{_filedata}}) {
/^\s*$/ && next;
/Length:.*Check:/ && ($i += 1);
/^\s*\d+\s*[a-zA-Z\s]+/ && ($j += 1);
($i =  = 1) && ($j =  = 1) && return(''gcg'');
}
return;
}
sub is_genbank {
my($self) = @_;
my $Features = 0;
for(@{$self->{_filedata}}) {
/^LOCUS/ && ($Features += 1);
/^DEFINITION/ && ($Features += 2);
/^ACCESSION/ && ($Features += 4);
/^ORIGIN/ &&  ($Features += 8);
/^\/\// && ($Features += 16);
($Features =  = 31) && return ''genbank'';
}
return;
}
sub is_pir {
my($self) = @_;
my($ent,$ti,$date,$org,$ref,$sum,$seq,$end) = (0,0,0,0,0,0,0,0);
for(@{$self->{_filedata}}) {
/ENTRY/ && $ent++;
/TITLE/ && $ti++;
/DATE/ && $date++;
/ORGANISM/ && $org++;
/REFERENCE/ && $ref++;
/SUMMARY/ && $sum++;
/SEQUENCE/ && $seq++;
/\/\/\// && $end++;
$ent =  = 1 && $ti =  = 1 && $date >= 1 && $org >= 1
&& $ref >= 1 && $sum =  = 1 && $seq =  = 1 && $end =  = 1
&& return ''pir'';
}
return;
}
sub is_staden {
my($self) = @_;
for(@{$self->{_filedata}}) {
/<-+([^-]*)-+>/ && return ''staden'';
}
0;
}

is_ methods are designed to be fast. They
don''t check to see if the file conforms to the
official file format in every detail. Instead, they look to see if,
given that the file is supposed to be a sequence file, there is a
good indication that it is a particular file format. In other words,
these methods perform heuristic, not rigorous, tests. If they are
well conceived, these methods correctly identify the different
formats without confusion and with a minimum of code and computation.
However, they don''t ensure that a format is
conforming in every respect to the official format definition. (See
the exercises for other approaches to file format recognition.)

Also, note that these are fairly simple tests; for example, the
is_raw method doesn''t check for
the IUB ambiguity codes but only the four bases A, C, G, T, plus N
for an undetermined base. Furthermore, some software systems have
more than one format, for which only one format is shown here, such
as PIR and GCG. The bottom line is that this code does what it is
intended to do, but not more. It is, however, fairly short and easy
to modify and extend, and I encourage you to try your hand at doing
so in the exercises.

An interesting Perl note: the "leaning
toothpick" regular expression notation for three
forward slashes /// is a bit forbidding because each forward slash
needs a backslash escape in front of it:

    /\/\/\// && $end++;

An alternative would be to use m and a separator
other than /, which leads to more readable code:

    m!///! && $end++;

One more interesting Perl note: in this code, I return a false value
from a subroutine like so:

return;

This is a bit confusing because unless you already know or check in
the documentation for return,
it''s not clear what''s happening. In
a scalar context, return; returns the undefined
value undef; in a list context,
return; returns an empty list. So, no matter what
context the subroutine is called from, a valid value is
returned.

4.3.2.2 put_ methods

The next group of put_ methods returns an
object''s sequence and annotation data in a
particular sequence file format:

sub put_raw {
my($self) = @_;
my($out);
($out = $self->{_sequence}) =~ tr/a-z/A-Z/;
return($out);
}
sub put_embl {
my($self) = @_;
my(@out,$tmp,$len,$i,$j,$a,$c,$g,$t,$o);
$len = length($self->{_sequence});
$a=($self->{_sequence} =~ tr/Aa//);
$c=($self->{_sequence} =~ tr/Cc//);
$g=($self->{_sequence} =~ tr/Gg//);
$t=($self->{_sequence} =~ tr/Tt//);
$o=($len - $a - $c - $g - $t);
$i=0;
$out[$i++] = sprintf("ID   %s %s\n",$self->{_header}, $self->{_id} );
$out[$i++] = "XX\n";
$out[$i++] = sprintf("SQ   sequence %d BP; %d A; %d C; %d G; %d T; %d other;\n",
$len, $a, $c, $g, $t, $o);
for($j = 0 ; $j < $len ; ) {
$out[$i] .= sprintf("%s",substr($self->{_sequence},$j,10));
$j += 10;
if( $j < $len && $j % 60 != 0 ) {
$out[$i] .= " ";
}elsif ($j % 60 =  = 0 ) {
$out[$i++] .= "\n";
}
}
if($j % 60 != 0 ) {
$out[$i++] .= "\n";
}
$out[$i] = "//\n";
return @out;
}
sub put_fasta {
my($self) = @_;
my(@out,$len,$i,$j);
$len = length($self->{_sequence});
$i = 0;
$out[$i++] = "> " . $self->{_header} . "\n";
for($j=0; $j<$len ; $j += 50) {
$out[$i++]=sprintf("%.50s\n",substr($self->{_sequence},$j,50));
}
return @out;
}
sub put_gcg {
my($self) = @_;
my(@out,$len,$i,$j,$cnt,$sum);
$len = length($self->{_sequence});
#calculate Checksum
for($i=0; $i<$len ;$i++) {
$cnt++;
$sum += $cnt * ord(substr($self->{_sequence},$i,1));
($cnt =  = 57)&& ($cnt=0);
}
$sum %= 10000;
$i = 0;             
$out[$i++] = sprintf("%s\n",$self->{_header});
$out[$i++] = sprintf("    %s Length: %d (today)  Check: %d  ..\n", $self->{_id},
$len, $sum);
for($j = 0 ; $j < $len ; ) {
if( $j % 50 =  = 0) {
$out[$i] = sprintf("%8d  ",$j+1);
}
$out[$i] .= sprintf("%s",substr($self->{_sequence},$j,10));
$j += 10;
if( $j < $len && $j % 50 != 0 ) {
$out[$i] .= " ";
}elsif ($j % 50 =  = 0 ) {
$out[$i++] .= "\n";
}
}
if($j % 50 != 0 ) {
$out[$i] .= "\n";
}
return @out;
}
sub put_genbank {
my($self) = @_;
my(@out,$len,$i,$j,$cnt,$sum);
my($seq) = $self->{_sequence};
$seq =~ tr/A-Z/a-z/;
$len = length($seq);
for($i=0; $i<$len ;$i++) {
$cnt++;
$sum += $cnt * ord(substr($seq,$i,1));
($cnt =  = 57) && ($cnt=0);
}
$sum %= 10000;
$i = 0;
$out[$i++] = sprintf("LOCUS       %s       %d bp\n",$self->{_id}, $len);
$out[$i++] = sprintf("DEFINITION  %s , %d bases, %d sum.\n", $self->{_header},
$len, $sum);
$out[$i++] = sprintf("ACCESSION  %s\n", $self->{_accession}, );
$out[$i++] = sprintf("ORIGIN\n");
for($j = 0 ; $j < $len ; ) {
if( $j % 60 =  = 0) {
$out[$i] = sprintf("%8d  ",$j+1);
}
$out[$i] .= sprintf("%s",substr($seq,$j,10));
$j += 10;
if( $j < $len && $j % 60 != 0 ) {
$out[$i] .= " ";
}elsif($j % 60 =  = 0 ) {
$out[$i++] .= "\n";
}
}
if($j % 60 != 0 ) {
$out[$i] .= "\n";
++i;
}
$out[$i] = "//\n";
return @out;
}
sub put_pir {
my($self) = @_;
my($seq) = $self->{_sequence};
my(@out,$len,$i,$j,$cnt,$sum);
$len = length($seq);
for($i=0; $i<$len ;$i++) {
$cnt++;
$sum += $cnt * ord(substr($seq,$i,1));
($cnt=  =57) && ($cnt=0);
}
$sum %= 10000;
$i = 0;
$out[$i++] = sprintf("ENTRY           %s\n",$self->{_id});
$out[$i++] = sprintf("TITLE           %s\n",$self->{_header});
#JDT ACCESSION out if defined
$out[$i++] = sprintf("DATE            %s\n",'''');
$out[$i++] = sprintf("REFERENCE       %s\n",'''');
$out[$i++] = sprintf("SUMMARY #Molecular-weight %d  
#Length %d  #Checksum %dn",0,$len,$sum);
$out[$i++] = sprintf("SEQUENCE\n");
$out[$i++] = sprintf("              5      10      15      20      25      30\n");
for($j=1; $seq && $j < $len ; $j += 30) {
$out[$i++] = sprintf("%7d ",$j);
$out[$i++] = sprintf("%s\n", join('' '',split(//,substr($seq, $j - 1,length($seq) 
< 30 ? length($seq) : 30))) );
}
$out[$i++] = sprintf("///\n");
return @out;
}
sub put_staden {
my($self) = @_;
my($seq) = $self->{_sequence};
my($i,$j,$len,@out);
$i = 0;
$len = length($self->{_sequence});
$out[$i] = ";\<------------------\>\n";
substr($out[$i],int((20-length($self->{_id}))/2),length($self->{_id})) = $self->
{_id};
$i++;
for($j=0; $j<$len ; $j+=60) {
$out[$i++]=sprintf("%s\n",substr($self->{_sequence},$j,60));
}
return @out;
}

The put_ methods are, by necessity, more cognizant
of the detailed rules of a particular file format.

Note the Perl function ord in
put_gcg. This built-in function gives the numeric
value of a character. Other Perl functions such as
sprintf and substr can be
reviewed as necessary by typing perldoc
-f substr (for example) or by
visiting the http://www.perdoc.com web page.

4.3.2.3 parse_ methods

parse_, the third and final group of methods,
parses the contents of a file in a particular format, extracting the
sequence and, if possible, some additional header information.

sub parse_raw {
my($self) = @_;
## Header and ID should be set in calling program after this 
my($seq);
$seq = join('''',@{$self->{_filedata}});
if( ($seq =~ /^([acgntACGNT\-\s]+)$/)) {
($self->{_sequence} = $seq) =~ s/\s//g;
}else{
carp("parse_raw failed");
}
}
sub parse_embl {
my($self) = @_;
my($begin,$seq,$end,$count) = (0,0,0,0);
my($sequence,$head,$acc,$id);
for(@{$self->{_filedata}}) {
++$count;
if(/^ID/) {
$begin++;
/^ID\s*(.*\S)\s*/ && ($id = ($head = $1)) =~ s/\s.*//;
}elsif(/^SQ\s/){
$seq++;
}elsif(/^\/\//){
$end++;
}elsif($seq =  = 1){
$sequence .= $_;
}elsif(/^AC\s*(.*(;|\S)).*/){ #put this here - AC could be sequence
$acc .= $1;
}
if($begin =  = 1 && $seq =  = 1 && $end =  = 1) {
$sequence =~ tr/a-zA-Z//cd;
$sequence =~ tr/a-z/A-Z/;
$self->{_sequence} = $sequence;
$self->{_header} = $head; 
$self->{_id} = $id;
$self->{_accession} = $acc;
return 1;
}
}
return;
}
sub parse_fasta {
my($self) = @_;
my($flag,$count) = (0,0);
my($seq,$head,$id);
for(@{$self->{_filedata}}) {
#avoid confusion with Primer, which can have input beginning ">"
/^\*seq.*:/i && ($flag =  = 0) && last;
if(/^>/ && $flag =  = 1) { 
last;
}elsif(/^>/ && $flag =  = 0){
/^>\s*(.*\S)\s*/ && ($id=($head=$1)) =~ s/\s.*//;
$flag=1;
}elsif( (! /^>/) && $flag =  = 1) { 
$seq .= $_;
}elsif( (! /^>/) && $flag =  = 0) {
last;
}
++$count;
}
if($flag) {
$seq =~ tr/a-zA-Z-//cd;
$seq =~ tr/a-z/A-Z/;
$self->{_sequence} = $seq;
$self->{_header} = $head;
$self->{_id} = $id;
}
}
sub parse_gcg {
my($self) = @_;
my($seq,$head,$id);
my($count,$flag) = (0,0);
for(@{$self->{_filedata}}) {
if(/^\s*$/) { 
;
}elsif($flag =  = 0 && /Length:.*Check:/){
/^\s*(\S+).*Length:.*Check:/;
$flag=1;
($id=$1) =~ s/\s.*//;
}elsif($flag =  = 0 && /^\S/) {
($head = $_) =~ s/\n//; 
}elsif($flag =  = 1 && /^\s*[^\d\s]/) {
last;
}elsif($flag =  = 1 && /^\s*\d+\s*[a-zA-Z \t]+$/) {
$seq .= $_;
}
$count++;
}
$seq =~ tr/a-zA-Z//cd;
$seq =~ tr/a-z/A-Z/;
$head = $id unless $head;
$self->{_sequence} = $seq;
$self->{_header} = $head;
$self->{_id} = $id;
return 1;
}
#
#
sub parse_genbank {
my($self) = @_;
my($count,$features,$flag,$seqflag) = (0,0,0,0);
my($seq,$head,$id,$acc);
for(@{$self->{_filedata}}) {
if( /^LOCUS/ && $flag =  = 0 ) {
/^LOCUS\s*(.*\S)\s*$/;
($id=($head=$1)) =~ s/\s.*//;
$features += 1;
$flag = 1;
}elsif( /^DEFINITION\s*(.*)/ && $flag =  = 1) {
$head .= " $1";
$features += 2;
}elsif( /^ACCESSION/ && $flag =  = 1 ) {
/^ACCESSION\s*(.*\S)\s*$/;
$head .= " ".($acc=$1);
$features += 4;
}elsif( /^ORIGIN/ ) {
$seqflag = 1;
$features += 8;
}elsif( /^\/\// ) {
$features += 16;
}elsif( $seqflag =  = 0 ) { 
; 
}elsif($seqflag =  = 1) { 
$seq .= $_;
}
++$count;
if($features =  = 31) {
$seq =~ tr/a-zA-Z//cd;
$seq =~ tr/a-z/A-Z/;
$self->{_sequence} = $seq;
$self->{_header} = $head;
$self->{_id} = $id;
$self->{_accession} = $acc;
return 1;
}
}
return;
}
sub parse_pir {
my($self) = @_;
my($begin,$tit,$date,$organism,$ref,$summary,$seq,$end,$count) = 
(0,0,0,0,0,0,0,0,0);
my($flag,$seqflag) = (0,0);
my($sequence,$header,$id,$acc);
for(@{$self->{_filedata}}) {
++$count;
if( /^ENTRY\s*(.*\S)\s*$/ && $flag =  = 0 ) {
$header=$1;
$flag=1;
$begin++;
}elsif( /^TITLE\s*(.*\S)\s*$/ && $flag =  = 1 ) {
$header .= $1;
$tit++;
}elsif( /ORGANISM/ ) {
$organism++;
}elsif( /^ACCESSIONS\s*(.*\S)\s*$/ && $flag =  = 1 ) {
($id=($acc=$1)) =~ s/\s*//;
}elsif( /DATE/ ) {
$date++;
}elsif( /REFERENCE/ ) {
$ref++;
}elsif( /SUMMARY/ ) {
$summary++;
}elsif( /^SEQUENCE/ ) {
$seqflag = 1;
$seq++;
}elsif( /^\/\/\// && $flag =  = 1 ) {
$end++;
}elsif( $seqflag =  = 0) {
next;
}elsif( $seqflag =  = 1 && $flag =  = 1) {
$sequence .= $_;
}
if( $begin =  = 1 && $tit =  = 1 && $date >= 1 && $organism >= 1
&& $ref >= 1 && $summary =  = 1 && $seq =  = 1 && $end =  = 1 
) {
$sequence =~ tr/a-zA-Z//cd;
$sequence =~ tr/a-z/A-Z/;
$self->{_sequence} = $seq;
$self->{_header} = $header;
$self->{_id} = $id;
$self->{_accession} = $acc;
return 1;
}
}
return;
}
sub parse_staden {
my($self) = @_;
my($flag,$count) = (0,0);
my($seq,$head,$id);
for(@{$self->{_filedata}}) {
if( /<---*\s*(.*[^-\s])\s*-*--->(.*)/ && $flag =  = 0 ) {
$id = $head = $1;
$seq .= $2;
$flag = 1;
}elsif( /<---*(.*)-*--->/ && $flag =  = 1 ) {
$count--;
last;
}elsif( $flag =  = 1 ) {
$seq .= $_;
}
++$count;
}
if( $flag ) {
$seq =~ s/-/N/g;
$seq =~ tr/a-zA-Z-//cd;
$seq =~ tr/a-z/A-Z/;
$self->{_sequence} = $seq;
$self->{_header} = $head;
$self->{_id} = $id;
return 1;
}
return;
}
1;

That''s the end of the
SeqFileIO.pm module that defines the
SeqFileIO class.

You can see that to add the capability to handle a new sequence file
format, you simply have to write new is_,
put_, and parse_ methods, and
add the name of the new format to the
@_seqfiletypes array. So, extending this software
to handle more sequence file formats is relatively easy. (See the
exercises.)

To end this chapter, here is a small test program
testSeqFileIO that exercises the main parts of the
SeqFileIO.pm module, followed by its output. See
the exercises for a discussion of ways to improve this
test.

4.3.3 Testing SeqFileIO.pm

Here is the program
testSeqFileIO to test the
SeqFileIO module:

#!/usr/bin/perl
use strict;
use warnings;
use lib "/home/tisdall/MasteringPerlBio/development/lib";
use SeqFileIO;
#
# First test basic FileIO operations
#  (plus filetype attribute)
#
my $obj = SeqFileIO->new(  );
$obj->read(
filename => ''file1.txt''
);
print "The file name is ", $obj->get_filename, "\n";
print "The contents of the file are:\n", $obj->get_filedata;
print "\nThe date of the file is ", $obj->get_date, "\n";
print "The filetype of the file is ", $obj->get_filetype, "\n";
$obj->set_date(''today'');
print "The reset date of the file is ", $obj->get_date, "\n";
print "The write mode of the file is ", $obj->get_writemode, "\n";
print "\nResetting the data and filename\n";
my @newdata = ("line1\n", "line2\n");
$obj->set_filedata( \@newdata );
print "Writing a new file \"file2.txt\"\n";
$obj->write(filename => ''file2.txt'');
print "Appending to the new file \"file2.txt\"\n";
$obj->write(filename => ''file2.txt'', writemode => ''>>'');
print "Reading and printing the data from \"file2.txt\":\n";
my $file2 = SeqFileIO->new(  );
$file2->read(
filename => ''file2.txt''
);
print "The file name is ", $file2->get_filename, "\n";
print "The contents of the file are:\n", $file2->get_filedata;
print "The filetype of the file is ", $file2->get_filetype, "\n";
print <<''HEADER'';
##########################################
#
# Test file format recognizing and reading
#
##########################################
HEADER
my $genbank = SeqFileIO->new(  );
$genbank->read(
filename => ''record.gb''
);
print "The file name is ", $genbank->get_filename, "\n";
print "\nThe date of the file is ", $genbank->get_date, "\n";
print "The filetype of the file is ", $genbank->get_filetype, "\n";
print "The contents of the file are:\n", $genbank->get_filedata;
print "\n####################\n####################\n####################\n";
my $raw = SeqFileIO->new(  );
$raw->read(
filename => ''record.raw''
);
print "The file name is ", $raw->get_filename, "\n";
print "\nThe date of the file is ", $raw->get_date, "\n";
print "The filetype of the file is ", $raw->get_filetype, "\n";
print "The contents of the file are:\n", $raw->get_filedata;
print "\n####################\n####################\n####################\n";
my $embl = SeqFileIO->new(  );
$embl->read(
filename => ''record.embl''
);
print "The file name is ", $embl->get_filename, "\n";
print "\nThe date of the file is ", $embl->get_date, "\n";
print "The filetype of the file is ", $embl->get_filetype, "\n";
print "The contents of the file are:\n", $embl->get_filedata;
print "\n####################\n####################\n####################\n";
my $fasta = SeqFileIO->new(  );
$fasta->read(
filename => ''record.fasta''
);
print "The file name is ", $fasta->get_filename, "\n";
print "\nThe date of the file is ", $fasta->get_date, "\n";
print "The filetype of the file is ", $fasta->get_filetype, "\n";
print "The contents of the file are:\n", $fasta->get_filedata;
print "\n####################\n####################\n####################\n";
my $gcg = SeqFileIO->new(  );
$gcg->read(
filename => ''record.gcg''
);
print "The file name is ", $gcg->get_filename, "\n";
print "\nThe date of the file is ", $gcg->get_date, "\n";
print "The filetype of the file is ", $gcg->get_filetype, "\n";
print "The contents of the file are:\n", $gcg->get_filedata;
print "\n####################\n####################\n####################\n";
my $staden = SeqFileIO->new(  );
$staden->read(
filename => ''record.staden''
);
print "The file name is ", $staden->get_filename, "\n";
print "\nThe date of the file is ", $staden->get_date, "\n";
print "The filetype of the file is ", $staden->get_filetype, "\n";
print "The contents of the file are:\n", $staden->get_filedata;
print "\n####################\n####################\n####################\n";
print <<''REFORMAT'';
##########################################
#
# Test file format reformatting and writing
#
##########################################
REFORMAT
print "At this point there are ", $staden->get_count, " objects.\n\n";
print "######\n###### Testing put methods\n######\n\n";
print "\nPrinting staden data in raw format:\n";
print $staden->put_raw;
print "\nPrinting staden data in embl format:\n";
print $staden->put_embl;
print "\nPrinting staden data in fasta format:\n";
print $staden->put_fasta;
print "\nPrinting staden data in gcg format:\n";
print $staden->put_gcg;
print "\nPrinting staden data in genbank format:\n";
print $staden->put_genbank;
print "\nPrinting staden data in PIR format:\n";
print $staden->put_pir;

The test program depends on certain files being present on the
system; the contents of the files are clear from the program output,
and the files can be downloaded from this book''s web
site, along with the rest of the programs for this book.

4.3.4 Results

Here is the output from testSeqFileIO:

The file name is file1.txt
The contents of the file are:
> sample dna  (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
acacctgagccactctcagatgaggaccta
The date of the file is Thu Dec  5 11:22:56 2002
The filetype of the file is _fasta
The reset date of the file is today
The write mode of the file is >
Resetting the data and filename
Writing a new file "file2.txt"
Appending to the new file "file2.txt"
Reading and printing the data from "file2.txt":
The file name is file2.txt
The contents of the file are:
line1
line2
line1
line2
The filetype of the file is _unknown
##########################################
#
# Test file format recognizing and reading
#
##########################################
The file name is record.gb
The date of the file is Sun Mar 30 14:30:09 2003
The filetype of the file is _genbank
The contents of the file are:
LOCUS       AB031069     2487 bp    mRNA            PRI       27-MAY-2000
DEFINITION  Sequence severely truncated for demonstration.
ACCESSION   AB031069
VERSION     AB031069.1  GI:8100074
KEYWORDS    .
SOURCE      Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to
mRNA.
ORGANISM  Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (sites)
AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and
Takano,T.
TITLE     PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain,
is regulated by proteolysis
JOURNAL   Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000)
MEDLINE   20261256
REFERENCE   2  (bases 1 to 2487)
AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and
Takano,T.
TITLE     Direct Submission
JOURNAL   Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases.
Tadahiro Fujino, Keio University School of Medicine, Department of
Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan
(E-mail:fujino@microb.med.keio.ac.jp,
Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508)
FEATURES             Location/Qualifiers
source          1..2487
/organism="Homo sapiens"
/db_xref="taxon:9606"
/sex="male"
/cell_line="HuS-L12"
/cell_type="lung fibroblast"
/dev_stage="embryo"
gene            229..2199
/gene="PCCX1"
CDS             229..2199
/gene="PCCX1"
/note="a nuclear protein carrying a PHD finger and a CXXC
domain"
/codon_start=1
/product="protein containing CXXC domain 1"
/protein_id="BAA96307.1"
/db_xref="GI:8100075"
/translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD
NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP
RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ
QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY
FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP
EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE
KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS
DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR
FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK
YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC
PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT
AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR"
BASE COUNT      564 a    715 c    768 g    440 t
ORIGIN      
1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
241 aaaaaaaaaa aaaaaaaaaa aaaaaaa
//
####################
####################
####################
The file name is record.raw
The date of the file is Sun Mar 30 14:30:39 2003
The filetype of the file is _raw
The contents of the file are:
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGGAGACGACGCATGGGGCCTGCGCAAT
AGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCCGAGTGGTC
TGTGCAGGTTCGCGGGTCGCTGGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAAAAAAA
AAAAAAAAAAAA
####################
####################
####################
The file name is record.embl
The date of the file is Sun Mar 30 14:31:23 2003
The filetype of the file is _embl
The contents of the file are:
ID   AB031069     2487 bp    mRNA            PRI       27-MAY-2000 Sequence severely 
truncated for demonstration. AB031069 AB031069
XX
SQ   sequence 267 BP; 65 A; 54 C; 106 G; 42 T; 0 other;
AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG TTTGCAGCGG
AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT GGGAGGCGTG ACTAGAAGCG
GAAGTAGTTG TGGGCGCCTT TGCAACCGCC TGGGACGCCG CCGAGTGGTC TGTGCAGGTT
CGCGGGTCGC TGGCGGGGGT CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT
AAAAAAAAAA AAAAAAAAAA AAAAAAA
//
####################
####################
####################
The file name is record.fasta
The date of the file is Sun Mar 30 14:31:40 2003
The filetype of the file is _fasta
The contents of the file are:
> AB031069     2487 bp    mRNA            PRI       27-MAY-2000 Sequence severely 
truncated for demonstration. AB031069
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGG
TTTGCAGCGGAGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCT
GGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCC
TGGGACGCCGCCGAGTGGTCTGTGCAGGTTCGCGGGTCGCTGGCGGGGGT
CGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAA
AAAAAAAAAAAAAAAAA
####################
####################
####################
The file name is record.gcg
The date of the file is Sun Mar 30 14:31:47 2003
The filetype of the file is _gcg
The contents of the file are:
AB031069     2487 bp    mRNA            PRI       27-MAY-2000 Sequence severely
truncated for demonstration. AB031069
AB031069 Length: 267 (today)  Check: 4285  ..
1  AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG
51  TTTGCAGCGG AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT
101  GGGAGGCGTG ACTAGAAGCG GAAGTAGTTG TGGGCGCCTT TGCAACCGCC
151  TGGGACGCCG CCGAGTGGTC TGTGCAGGTT CGCGGGTCGC TGGCGGGGGT
201  CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT AAAAAAAAAA
251  AAAAAAAAAA AAAAAAA
####################
####################
####################
The file name is record.staden
The date of the file is Sun Mar 30 14:32:01 2003
The filetype of the file is _staden
The contents of the file are:
;<----AB031069------>
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGG
AGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCG
GAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCCGAGTGGTCTGTGCAGGTT
CGCGGGTCGCTGGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGAT
AAAAAAAAAAAAAAAAAAAAAAAAAAA
####################
####################
####################
##########################################
#
# Test file format reformatting and writing
#
##########################################
At this point there are 8 objects.
######
###### Testing put methods
######
Printing staden data in raw format:
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGGAGACGACGCATGGGGCCTGCGCAAT
AGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCCGAGTGGTC
TGTGCAGGTTCGCGGGTCGCTGGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAAAAAAAA
AAAAAAAAAAA
Printing staden data in embl format:
ID   AB031069 AB031069
XX
SQ   sequence 267 BP; 65 A; 54 C; 106 G; 42 T; 0 other;
AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG TTTGCAGCGG
AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT GGGAGGCGTG ACTAGAAGCG
GAAGTAGTTG TGGGCGCCTT TGCAACCGCC TGGGACGCCG CCGAGTGGTC TGTGCAGGTT
CGCGGGTCGC TGGCGGGGGT CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT
AAAAAAAAAA AAAAAAAAAA AAAAAAA
//
Printing staden data in fasta format:
> AB031069
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGG
TTTGCAGCGGAGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCT
GGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCC
TGGGACGCCGCCGAGTGGTCTGTGCAGGTTCGCGGGTCGCTGGCGGGGGT
CGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAA
AAAAAAAAAAAAAAAAA
Printing staden data in gcg format:
AB031069
AB031069 Length: 267 (today)  Check: 4285  ..
1  AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG
51  TTTGCAGCGG AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT
101  GGGAGGCGTG ACTAGAAGCG GAAGTAGTTG TGGGCGCCTT TGCAACCGCC
151  TGGGACGCCG CCGAGTGGTC TGTGCAGGTT CGCGGGTCGC TGGCGGGGGT
201  CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT AAAAAAAAAA
251  AAAAAAAAAA AAAAAAA
Printing staden data in genbank format:
LOCUS       AB031069       267 bp
DEFINITION  AB031069 , 267 bases, 829 sum.
ACCESSION  
ORIGIN
1  agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
61  agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
121  gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
181  cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
241  aaaaaaaaaa aaaaaaaaaa aaaaaaa
//
Printing staden data in PIR format:
ENTRY           AB031069
TITLE           AB031069
DATE            
REFERENCE       
SUMMARY         #Molecular-weight 0  #Length 267  #Checksum 4285
SEQUENCE
5        10        15        20        25        30
1 A G A T G G C G G C G C T G A G G G G T C T T G G G G G C T
31 C T A G G C C G G C C A C C T A C T G G T T T G C A G C G G
61 A G A C G A C G C A T G G G G C C T G C G C A A T A G G A G
91 T A C G C T G C C T G G G A G G C G T G A C T A G A A G C G
121 G A A G T A G T T G T G G G C G C C T T T G C A A C C G C C
151 T G G G A C G C C G C C G A G T G G T C T G T G C A G G T T
181 C G C G G G T C G C T G G C G G G G G T C G T G A G G G A G
211 T G C G C C G G G A G C G G A G A T A T G G A G G G A G A T
241 A A A A A A A A A A A A A A A A A A A A A A A A A A A
///

Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی