Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

1.7 Writing Your First Perl Module

Now that you've been introduced to the basic ideas
of modules, it's time to actually examine a working
example of a module.

In this section, we'll write a module called
Geneticcode.pm,
which implements the genetic code that maps DNA codons to amino acids
and then translates a string of DNA sequence data to a protein
fragment.

1.7.1 An Example: Geneticcode.pm

Let's start by creating a file called
Geneticcode.pm and using it to define the mapping
of codons to amino acids in a hash variable called
%genetic_code. We'll also discuss
a subroutine called codon2aa that uses the hash to
translate its codon arguments into amino acid return values.

Here are the contents of the first module file
Geneticcode.pm:

package Geneticcode;
use strict;
use warnings;
my(%genetic_code) = (
'TCA' => 'S',    # Serine
'TCC' => 'S',    # Serine
'TCG' => 'S',    # Serine
'TCT' => 'S',    # Serine
'TTC' => 'F',    # Phenylalanine
'TTT' => 'F',    # Phenylalanine
'TTA' => 'L',    # Leucine
'TTG' => 'L',    # Leucine
'TAC' => 'Y',    # Tyrosine
'TAT' => 'Y',    # Tyrosine
'TAA' => '_',    # Stop
'TAG' => '_',    # Stop
'TGC' => 'C',    # Cysteine
'TGT' => 'C',    # Cysteine
'TGA' => '_',    # Stop
'TGG' => 'W',    # Tryptophan
'CTA' => 'L',    # Leucine
'CTC' => 'L',    # Leucine
'CTG' => 'L',    # Leucine
'CTT' => 'L',    # Leucine
'CCA' => 'P',    # Proline
'CCC' => 'P',    # Proline
'CCG' => 'P',    # Proline
'CCT' => 'P',    # Proline
'CAC' => 'H',    # Histidine
'CAT' => 'H',    # Histidine
'CAA' => 'Q',    # Glutamine
'CAG' => 'Q',    # Glutamine
'CGA' => 'R',    # Arginine
'CGC' => 'R',    # Arginine
'CGG' => 'R',    # Arginine
'CGT' => 'R',    # Arginine
'ATA' => 'I',    # Isoleucine
'ATC' => 'I',    # Isoleucine
'ATT' => 'I',    # Isoleucine
'ATG' => 'M',    # Methionine
'ACA' => 'T',    # Threonine
'ACC' => 'T',    # Threonine
'ACG' => 'T',    # Threonine
'ACT' => 'T',    # Threonine
'AAC' => 'N',    # Asparagine
'AAT' => 'N',    # Asparagine
'AAA' => 'K',    # Lysine
'AAG' => 'K',    # Lysine
'AGC' => 'S',    # Serine
'AGT' => 'S',    # Serine
'AGA' => 'R',    # Arginine
'AGG' => 'R',    # Arginine
'GTA' => 'V',    # Valine
'GTC' => 'V',    # Valine
'GTG' => 'V',    # Valine
'GTT' => 'V',    # Valine
'GCA' => 'A',    # Alanine
'GCC' => 'A',    # Alanine
'GCG' => 'A',    # Alanine
'GCT' => 'A',    # Alanine
'GAC' => 'D',    # Aspartic Acid
'GAT' => 'D',    # Aspartic Acid
'GAA' => 'E',    # Glutamic Acid
'GAG' => 'E',    # Glutamic Acid
'GGA' => 'G',    # Glycine
'GGC' => 'G',    # Glycine
'GGG' => 'G',    # Glycine
'GGT' => 'G',    # Glycine
);
#
# codon2aa
#
# A subroutine to translate a DNA 3-character codon to an amino acid
#   Version 3, using hash lookup
sub codon2aa {
my($codon) = @_;
$codon = uc $codon;
if(exists $genetic_code{$codon}) {
return $genetic_code{$codon};
}else{
die "Bad codon '$codon'!!\n";
}
}
1;

Now, let's examine the code. First, the module
declares its package with a name (Geneticcode)
that is the same as the file it is in
(Geneticcode.pm), but minus the file extension
.pm.

The directives:

use strict;
use warnings;

will appear in all the code. The use strict
directive enforces the use of the my directive for
all variables. The use warnings directive produces
useful messages about potential problems in your code. (It is
possible to turn both directives off when requiredto avoid
annoying warnings in your program output, for instance. See the
perldiag, perllexwarn, and
perlmodlib sections of the Perl manual.)

Finally, there is a subroutine definition for
codon2aa. As an argument, this subroutine
takes a codon represented as a string of three DNA bases and returns
the amino acid code corresponding to the codon. It accomplishes this
by a simple lookup in the hash %genetic_code and
returns the result from the subroutine using the
return built-in function.

The codon2aa subroutine calls
die and exits the program when it encounters an
undefined codon. See the exercises at the end of this chapter for a
discussion of the pros and cons of this behavior.

In my earlier book, I defined the hash
%genetic_code within the subroutine
codon2aa. That meant that every time the
subroutine was called, the hash would have to be initialized, which
took a bit of time. In this version, the hash only has to be
initialized once, when the program is first called, which results in
a significant speedup. The definition of the hash is outside of the
subroutine definition, but in the namespace of the
Geneticcode package. The hash is initialized when
the Geneticcode.pm module is loaded by this
statement:

use Geneticcode;

Every subsequent call to the codon2aa subroutine
simply accesses the hash without having to initialize it each time.

Here's an example that uses the new
Geneticcode module, which is saved in a file
called testGeneticcode and run by typing
perl testGeneticcode:

use strict;
use warnings;
use lib "/home/tisdall/MasteringPerlBio/development/lib";
use Geneticcode;
my $dna = 'AACCTTCCTTCCGGAAGAGAG';
# Initialize variables
my $protein = '';
# Translate each three-base codon to an amino acid, and append to a protein 
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
$protein .= Geneticcode::codon2aa( substr($dna,$i,3) );
}
print $protein, "\n";

Recall that the Perl built-in function substr can
extract a portion of a string. In this case,
substr extracts from $dna the
three characters beginning at the position given in the counter
variable $i; this three-character codon is then
passed as the argument to the subroutine codon2aa.
This program produces the output:

NLPSGRE

1.7.2 Expanding Geneticcode.pm

Now, let's expand our Geneticcode
module example. This new version of the module includes a few short
helper subroutines. The interest here lies in how the subroutines
interact with each other in the module's namespace,
and how to access the code within the module from a Perl program that
uses the module.

Modules are a great way to organize code into logical collections of
interacting parts. When you create modules, you need to decide how to
organize your code into the appropriate collection of modules. Here,
we have some subroutines that translate codons into amino acids;
others read sequence data from files and print it to the screen. This
is a fairly obvious division of functionality, so
let's create two modules for this code.
We'll expand the Geneticcode
module; let's also create a
SequenceIO module. Of course, the new module will
be created in a file called
SequenceIO.pm,
and that file will be placed in a directory that Perl can
findin this case, the same directory in which
we've placed the Geneticcode
module.

Here's the code for
Geneticcode.pm:

package Geneticcode;
use strict;
use warnings;
my(%genetic_code) = (
'TCA' => 'S',    # Serine
'TCC' => 'S',    # Serine
'TCG' => 'S',    # Serine
'TCT' => 'S',    # Serine
'TTC' => 'F',    # Phenylalanine
... as before ...
'GAG' => 'E',    # Glutamic Acid
'GGA' => 'G',    # Glycine
'GGC' => 'G',    # Glycine
'GGG' => 'G',    # Glycine
'GGT' => 'G',    # Glycine
);
#
# codon2aa
#
# A subroutine to translate a DNA 3-character codon to an amino acid
#   Version 3, using hash lookup
sub codon2aa {
my($codon) = @_;
$codon = uc $codon;
if(exists $genetic_code{$codon}) {
return $genetic_code{$codon};
}else{
die "Bad codon '$codon'!!\n";
}
}
#
# dna2peptide 
#
# A subroutine to translate DNA sequence into a peptide
sub dna2peptide {
my($dna) = @_;
# Initialize variables
my $protein = '';
# Translate each three-base codon to an amino acid, and append to a protein 
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
$protein .= codon2aa( substr($dna,$i,3) );
}
return $protein;
}
# translate_frame
#
# A subroutine to translate a frame of DNA
sub translate_frame {
my($seq, $start, $end) = @_;
my $protein;
# To make the subroutine easier to use, you won't need to specify
#  the end point-it will just go to the end of the sequence
#  by default.
unless($end) {
$end = length($seq);
}
# Finally, calculate and return the translation
return dna2peptide ( substr ( $seq, $start - 1, $end -$start + 1) );
}
1;

Now, we have in one module the code that accomplishes a translation
from the genetic code. However, we also need to read sequence in from
FASTA sequence files, and print out sequence (the translated protein)
to the screen. Because these needs are likely to recur in many
programs, it makes sense to make a separate module for just the file
reading, sequence extraction, and sequence printing operations. (This
may even be too much in one module; maybe there should be separate
modules for each need? See the exercises at the end of the chapter.)

Here's the code for the second module
SequenceIO.pm,
which handles reading from a file, extracting FASTA sequence data,
and printing sequence data:

package SequenceIO;
use strict;
use warnings;
# get_file_data
#
# A subroutine to get data from a file given its filename
sub get_file_data {
my($filename) = @_;
# Initialize variables
my @filedata = (  );
open(GET_FILE_DATA, $filename) or die "Cannot open file '$filename':$!\n\n";
@filedata = <GET_FILE_DATA>;
close GET_FILE_DATA;
return @filedata;
}
# extract_sequence_from_fasta_data
#
# A subroutine to extract FASTA sequence data from an array
sub extract_sequence_from_fasta_data {
my(@fasta_file_data) = @_;
# Declare and initialize variables
my $sequence = '';
foreach my $line (@fasta_file_data) {
# discard blank line
if ($line =~ /^\s*$/) {
next;
# discard comment line
} elsif($line =~ /^\s*#/) {
next;
# discard fasta header line
} elsif($line =~ /^>/) {
next;
# keep line, add to sequence string
} else {
$sequence .= $line;
}
}
# remove non-sequence data (in this case, whitespace) from $sequence string
$sequence =~ s/\s//g;
return $sequence;
}
# print_sequence
#
# A subroutine to format and print sequence data 
sub print_sequence {
my($sequence, $length) = @_;
# Print sequence in lines of $length
for ( my $pos = 0 ; $pos < length($sequence) ; $pos += $length ) {
print substr($sequence, $pos, $length), "\n";
}
}
1;

Before we discuss the code, let's see a small
program that uses it:

# Translate a DNA sequence into one of the six reading frames
use strict;
use warnings;
use lib "/home/tisdall/MasteringPerlBio/development/lib";
use Geneticcode;
use SequenceIO;
# Initialize variables
my @file_data = (  );
my $dna = '';
my $revcom = '';
my $protein = '';
# Read in the contents of the file "sample.dna"
@file_data = SequenceIO::get_file_data("sample.dna");
# Extract the sequence data from the contents of the file "sample.dna"
$dna = SequenceIO::extract_sequence_from_fasta_data(@file_data);
# Translate the DNA to protein in one of the six reading frames
#   and print the protein in lines 70 characters long
print "\n -------Reading Frame 1--------\n\n";
$protein = Geneticcode::translate_frame($dna, 1);
SequenceIO::print_sequence($protein, 70);
exit;

Here's the input file:

> sample dna  (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat
cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca
ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga
gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga
gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag
ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca
gggacaggggttggggccatgcttgctcggggctctgcttcgccccacaa
atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc
agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca
tgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaa
gaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagt
gccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctca
ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac
ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg
agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc
acacctgagccactctcagatgaggaccta

Here's the output of the program:

 -------Reading Frame 1--------
RWRR_GVLGALGRPPTGLQRRRRMGPAQ_EYAAWEA_LEAEVVVGAFATAWDAAEWSVQVRGSLAGVVRE
CAGSGDMEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCDNCNEWFHGDCIRITEKMAKA
IREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEPRDEGGGRKRPVPDPDLQRRAGSGTGVGAMLAR
GSASPHKSSPQPLVATPSQHHQQQQQQIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCR
LRQCQLRARESYKYFPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATA
TPEPLSDEDL

A few comments are in order. First, the subroutines for translating
codons are in the Geneticcode module. They include
the hash %genetic_code and the subroutines
codon2aa, dna2peptide, and
translate_frame, which are involved with
translating DNA data to peptides. The subroutines for reading
sequence data in from files, and for formatting and printing it to
the screen, are in the SequenceIO module. They are
the subroutines get_file_data,
extract_sequence_from_fasta_data, and
print_sequence.

Now, we have two modules and code that exercises them;
let's look at some more facets of using modules.

Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی