Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Mastering Perl for Bioinformatics [Electronic resources] - نسخه متنی

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید












4.2 FileIO.pm: A Class to Read and Write Files



Even
though you can easily obtain excellent modules for reading and
writing files, this chapter shows you how to build a simple one from
scratch. One reason for doing this is to better understand the issues
every bioinformatics programmer needs to face, such as how to
organize files and keep track of their contents. Another reason is so
you can see how to extend the class to deal with the multiple file
format problem that is peculiar to bioinformatics.


It''s not uncommon for a biologist to use several
different types of formats of files containing DNA or protein
sequence data and translate from one format to another. Doing these
translations by hand is very tedious. It''s also
tedious to save alternate forms of the same sequence data in
differently formatted files. You''ll see how to
alleviate some of this pain by automating some of these tasks in a
new class called SeqFileIO.pm.


Class inheritance is one of the main reasons why object-oriented
software is so reusable. In order to see clearly how it works,
let''s start with the simple class
FileIO.pm and later use it to define a more
complex class, SeqFileIO.pm.


FileIO is a simple class that reads and writes
files, and stores simple information such as the file contents, date,
and write permissions.


You know that it''s often possible to modify existing
code to create your own program. When I wrote
FileIO.pm, I simply made a copy of the
Gene.pm module from Chapter 3
and modified it.


On my Linux system, I started by copying FileIO.pm
from Gene.pm and giving it a new name:


cp Gene.pm FileIO.pm


I then edited the new file FileIO.pm changing the
line near the top that says:


package Gene;


to:


package FileIO;


The filename must be the same as the class name, with an additional
.pm.


Though I now needed to modify the module to do what I want, a
surprising amount of the overall framework of the codeits
constructor, accessor and mutator methods, and its basic data
structuresremains the same. Gene.pm already
contained such useful parts as a new constructor,
a hash-based object data structure, accessor methods to retrieve
values of the attributes of the object, and mutator methods to alter
attribute values. These are likely to be needed by most classes that
you''ll write in your own software projects.



4.2.1 Analysis of FileIO



Following is the code for FileIO, with commentary
interspersed:


package FileIO;
#
# A simple IO class for sequence data files
#
use strict;
use warnings;
our $AUTOLOAD; # before Perl 5.6.0 say "use vars ''$AUTOLOAD'';"
use Carp;
# Class data and methods
{
# A list of all attributes with defaults and read/write/required/noinit properties
my %_attribute_properties = (
_filename => [ '''', ''read.write.required''],
_filedata => [ [ ], ''read.write.noinit''],
_date => [ '''', ''read.write.noinit''],
_writemode => [ ''>'', ''read.write.noinit''],
);
# Global variable to keep count of existing objects
my $_count = 0;
# Return a list of all attributes
sub _all_attributes {
keys %_attribute_properties;
}
# Check if a given property is set for a given attribute
sub _permissions {
my($self, $attribute, $permissions) = @_;
$_attribute_properties{$attribute}[1] =~ /$permissions/;
}
# Return the default value for a given attribute
sub _attribute_default {
my($self, $attribute) = @_;
$_attribute_properties{$attribute}[0];
}
# Manage the count of existing objects
sub get_count {
$_count;
}
sub _incr_count {
++$_count;
}
sub _decr_count {
--$_count;
}
}


In this first part of FileIO.pm file, the headers
are exactly the same as in Gene.pm.


The opening block, which contains the class data and methods, also
remains the same except for the hash
%_attribute_properties. This new version of the
hash has different attributes (the filename, the file data, the last
modification date of the file, and the mode to use in writing a file)
tailored to the needs of reading and writing files.


In addition to the read, write,
and required properties, there is also a new
"no initialization" (or
noinit) property. An attribute with the
noinit property may not be given an initial value
when an object is created with a call to the new
constructor. In this module, attributes such as data from the file,
or the date on the file, are set only when the file is read or
written. You may also have noticed that the default value for the
filedata attribute is an anonymous array.


Note
that each attribute has both read and
write properties. This being the case, you can
simply omit the listing of the properties. However, in the interest
of future modification, when I may want to add some attribute that
won''t have both properties, I''ve
left in the specification of the two read and
write properties. (Note that one method name,
get_count, doesn''t start with an
underscore; this encourages you to call this method to get a count of
how many objects currently exist.)



4.2.1.1 The constructor method



You''ll notice in the following code that I have cut
the new constructor down to the bare bones.


# The constructor method
# Called from class, e.g. $obj = FileIO->new( );
sub new {
my ($class, %arg) = @_;
# Create a new object
my $self = bless { }, $class;
$class->_incr_count( );
return $self;
}


Why did I do so? Read on.


The read method


The code continues with the
read method:


# Called from object, e.g. $obj->read(  );
sub read {
my ($self, %arg) = @_;
# Set attributes
foreach my $attribute ($self->_all_attributes( )) {
# E.g. attribute = "_filename", argument = "filename"
my($argument) = ($attribute =~ /^_(.*)/);
# If explicitly given
if (exists $arg{$argument}) {
# If initialization is not allowed
if($self->_permissions($attribute, ''noinit'')) {
croak("Cannot set $argument from read: use set_$argument");
}
$self->{$attribute} = $arg{$argument};
# If not given, but required
}elsif($self->_permissions($attribute, ''required'')) {
croak("No $argument attribute as required");
# Set to the default
}else{
$self->{$attribute} = $self->_attribute_default($attribute);
}
}
# Read file data
unless( open( FileIOFH, $self->{_filename} ) ) {
croak("Cannot open file " . $self->{_filename} );
}
$self->{''_filedata''} = [ <FileIOFH> ];
$self->{''_date''} = localtime((stat FileIOFH)[9]);
close(FileIOFH);
}


This new read method has two parts. The first
includes the initialization of the object''s
attributes from the arguments and the defaults as specified in the
%_attribute_properties hash. The second includes
the reading of the file and the setting of the
_filedata and _date attributes
from the file''s contents and its last modification
time.


The first loop in the program initializes the attributes. If an
attribute is specified as an argument, the first test is to see if
the noinit property is set. This forbids
initializing the attribute, in which case the program
croaks. Otherwise, the attribute is set.


If the attribute isn''t passed as an argument but has
a required property (only the
_filename attribute has the
required property), the program
croaks.


Finally, if the argument isn''t given and not
required, the attribute is set to the default value.


After performing those initializations, the read
method reads in the specified file. If it can''t open
the file, the program croaks. (See the exercises
for a discussion of this use of croak.)


The file is read by the line:


$self->{''_filedata''} = [ <FileIOFH> ];


In list context, the input operator on the opened filehandle, which
is given by <FileIOFH> reads in the entire
file. This is done within an anonymous array, as determined by the
square brackets around the input operator angle brackets. A reference
to this anonymous array containing the file''s
contents is then assigned to the _filedata
attribute.



4.2.1.2 stat and localtime functions



Finally, the Perl
stat and localtime functions
are called to generate a string with the file''s last
modification time, which is assigned to the object attribute
_date.


This method of reading a file makes many choices. For instance, the
stat command returns an array with many more items
of interest about a file, such as its size, owner, access permission
modes, and so on (the tenth item of which is the modification time).
As you develop your programs, you should be paying attention to
details such as whether you need to save some of these additional
attributes of a file, the last modification date, or notes about the
kind of data in the file.


The next line of code in the program:


#
# N.B. no "clone" method is necessary
#


is yet another choice to think about. Are there occasions when
cloning a file object makes sense? Maybe I''d like to
clone a file object, make some small change to the data, give it a
new filename, and write it out. Why have I left this out?



4.2.1.3 The write method



The code
continues:


# Write files
# Called from object, e.g. $obj->write( );
sub write {
my ($self, %arg) = @_;
foreach my $attribute ($self->_all_attributes( )) {
# E.g. attribute = "_filename", argument = "filename"
my($argument) = ($attribute =~ /^_(.*)/);
# If explicitly given
if (exists $arg{$argument}) {
$self->{$attribute} = $arg{$argument};
}
}
unless( open( FileIOFH, $self->get_writemode . $self->get_filename ) ) {
croak("Cannot write to file " . $self->get_filename);
}
unless( print FileIOFH $self->get_filedata ) {
croak("Cannot write to file " . $self->get_filename);
}
$self->set_date(scalar localtime((stat FileIOFH)[9]));
close(FileIOFH);
return 1;
}


The write method handles writing a file object out
to an actual file. First, all arguments corresponding to attributes
are set as requested. The file is then opened for writing, using the
_writemode attribute to specify; for example, >
for truncating the file before writing or >> for appending to
the file. The print FileIOFH statement actually
does the writing to the opened FileIOFH
filehandle, retrieving the file data from the object with the
get_filedata method defined by means of
AUTOLOAD. Finally, the object''s
_date attribute is reset to the new modification
time.



4.2.1.4 AUTOLOAD



The
next section of code is the AUTOLOAD method
itself:


# This takes the place of such accessor definitions as:
# sub get_attribute { ... }
# and of such mutator definitions as:
# sub set_attribute { ... }
sub AUTOLOAD {
my ($self, $newvalue) = @_;
my ($operation, $attribute) = ($AUTOLOAD =~ /(get|set)(_\w+)$/);
# Is this a legal method name?
unless($operation && $attribute) {
croak "Method name ''$AUTOLOAD'' is not in the recognized form\n";
}
unless(exists $self->{$attribute}) {
croak "No such attribute ''$attribute'' exists in the class ", ref($self);
}
# AUTOLOAD accessors
if($operation eq ''get'') {
unless($self->_permissions($attribute, ''read'')) {
croak "$attribute does not have read permission";
}
# Turn off strict references to enable symbol table manipulation
no strict "refs";
# Install this accessor definition in the symbol table
*{$AUTOLOAD} = sub {
my ($self) = @_;
unless($self->_permissions($attribute, ''read'')) {
croak "$attribute does not have read permission";
}
if(ref($self->{$attribute}) eq ''ARRAY'') {
return @{$self->{$attribute}};
}else{
return $self->{$attribute};
}
};
# Turn strict references back on
no strict "refs";
# Return the attribute value
# The attribute could be a scalar or a reference to an array
if(ref($self->{$attribute}) eq ''ARRAY'') {
return @{$self->{$attribute}};
}else{
return $self->{$attribute};
}
# AUTOLOAD mutators
}elsif($operation eq ''set'') {
unless($self->_permissions($attribute, ''write'')) {
croak "$attribute does not have write permission";
}
# Turn off strict references to enable symbol table manipulation
no strict "refs";
# Install this mutator definition in the symbol table
*{$AUTOLOAD} = sub {
my ($self, $newvalue) = @_;
unless($self->_permissions($attribute, ''write'')) {
croak "$attribute does not have write permission";
}
$self->{$attribute} = $newvalue;
};
# Turn strict references back on
no strict "refs";
# Set and return the attribute value
$self->{$attribute} = $newvalue;
return $self->{$attribute};
}
}


This AUTOLOAD method has grown!
There''s only one difference, however, between this
code and the AUTOLOAD code for the
Gene.pm class. The new set of attributes for
FileIO.pm don''t all take simple
scalar values, as was the case with Gene.pm.
Another attribute, _filedata, is a reference to an
anonymous array. In order for the accessors to return the correct
data, they must check to see if an attribute is a scalar or a
reference to an array; the accessors can then dereference and return
the data from the method call.


So the accessors, and the definitions of them installed into the
symbol table, test for an array reference and dereference it
accordingly. Other than that, this AUTOLOAD method
is exactly the same as that defined for Gene.pm.


You may also have noticed that sections of code in the
AUTOLOAD method are almost identical to each
other. Recall that AUTOLOAD is invoked when a
method with no subroutine defining it is called.
AUTOLOAD must do two things. First, it performs
whatever method is requested; for example, if an accessor method is
requested, it returns the appropriate value. Second, it defines the
subroutine that implements the requested method and installs it in
the symbol table so the next time the method is called,
AUTOLOAD and its considerable overhead
won''t be necessary. Because of these parameters, the
code AUTOLOAD executes to handle the requested
method is nearly identical with the method that
AUTOLOAD also defines.


Finally, here are the last sections of the
FileIO.pm program:


# When an object is no longer being used, this will be automatically called
# and will adjust the count of existing objects
sub DESTROY {
my($self) = @_;
$self->_decr_count( );
}
# Other methods. They do not fall into the same form as the
majority handled by AUTOLOAD
#
1;


The only change here is that there are no other methods
(Gene.pm had a citation method).



4.2.2 Finishing FileIO



To finish FileIO.pm, here''s some
very terse (too terse for anything but a textbook) POD documentation:


=head1 FileIO
FileIO: read and write file data
=head1 Synopsis
use FileIO;
my $obj = RawfileIO->read(
filename => ''jkl''
);
print $obj->get_filename, "\n";
print $obj->get_filedata;
$obj->set_date(''today'');
print $obj->get_date, "\n";
print $obj->get_writemode, "\n";
my @newdata = ("line1\n", "line2\n");
$obj->set_filedata( \@newdata );
$obj->write(filename => ''lkj'');
$obj->write(filename => ''lkj'', writemode => ''>>'');
my $o = RawfileIO->read(filename => ''lkj'');
print $o->get_filename, "\n";
print $o->get_filedata;
my $gene1 = Gene->new(
name => ''biggene'',
organism => ''Mus musculus'',
chromosome => ''2p'',
pdbref => ''pdb5775.ent'',
author => ''L.G.Jeho'',
date => ''August 23, 1989'',
);
print "Gene name is ", $gene1->get_name( );
print "Gene organism is ", $gene1->_get_organism( );
print "Gene chromosome is ", $gene1->_get_chromosome( );
print "Gene pdbref is ", $gene1->_get_pdbref( );
print "Gene author is ", $gene1->_get_author( );
print "Gene date is ", $gene1->_get_date( );
$clone = $gene1->clone(name => ''biggeneclone'');
$gene1-> set_chromosome(''2q'');
$gene1-> set_pdbref(''pdb7557.ent'');
$gene1-> set_author(''G.Mendel'');
$gene1-> set_date(''May 25, 1865'');
$clone->citation(''T. Morgan'', ''October 3, 1912'');
print "Clone citation is ", $clone->citation;
=head1 AUTHOR
James Tisdall
=head1 COPYRIGHT
Copyright (c) 2003, James Tisdall
=cut


4.2.3 Testing the FileIO Class Module



Now that we''ve got a
class module, complete with examples of its use,
let''s write a small test program and see how it
works. Since the examples in the documentation are, in effect, a
small test program, let''s try running it.
We''ll use the file file1.txt I
created with my text editor that contains:


> sample dna  (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
acacctgagccactctcagatgaggaccta


I''ll take the code from the documentation pretty
much as is, just adding strict and
warnings. I''ll also include a
use lib directive that adds my
development library directory to the list of directories in
@INC, which tells my computer''s
Perl where to look for modules. (Recall that you can either edit this
line, override it with the PERL5LIB
environmental variable, or give your own directory on the command
line.) I also add a few print statements to make the output easier to
read:


#!/usr/bin/perl
use strict;
use warnings;
use lib "/home/tisdall/MasteringPerlBio/development/lib";
use FileIO;
my $obj = FileIO->new( );
$obj->read(
filename => ''file1.txt''
);
print "The file name is ", $obj->get_filename, "\n";
print "The contents of the file are:\n", $obj->get_filedata;
print "\nThe date of the file is ", $obj->get_date, "\n";
$obj->set_date(''today'');
print "The reset date of the file is ", $obj->get_date, "\n";
print "The write mode of the file is ", $obj->get_writemode, "\n";
print "\nResetting the data and filename\n";
my @newdata = ("line1\n", "line2\n");
$obj->set_filedata( \@newdata );
print "Writing a new file \"file2\"\n";
$obj->write(filename => ''file2'');
print "Appending to the new file \"file2\"\n";
$obj->write(filename => ''file2'', writemode => ''>>'');
print "Reading and printing the data from \"file2\":\n";
my $file2 = FileIO->new( );
$file2->read(
filename => ''file2''
);
print "The file name is ", $file2->get_filename, "\n";
print "The contents of the file are:\n", $file2->get_filedata;


I finally run the test program to get the following output:


The file name is file1.txt
The contents of the file are:
> sample dna (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
acacctgagccactctcagatgaggaccta
The date of the file is Thu Dec 5 11:22:56 2002
The reset date of the file is today
The write mode of the file is >
Resetting the data and filename
Writing a new file "file2"
Appending to the new file "file2"
Reading and printing the data from "file2":
The file name is file2
The contents of the file are:
line1
line2
line1
line2


The module seems to be performing as hoped. So, now we have a simple
module that reads and writes files and provides a few options for the
write mode.


But, frankly, this isn''t too impressive.
You''ve already been reading and writing files in
Perl without the overhead of this FileIO module.
The interface to the code is nice, and it''s good to
have objects that contain the file data, but what has really been
accomplished?


The real power of this approach is coming up next. Using class
inheritance, this simple module can be extended relatively easily in
a very useful direction.


It''s another case of the basic software engineering
approach of making small, simple, generally useful tools, and then
combining them into more powerful and specific applications. So,
next, I''ll take my simple FileIO
class and use it as a base class for a
bioinformatics-specific class.



/ 156