5.2 Rebase.pm: A Class ModuleHere is a very simple interface to the Rebase data contained in the bionet file that is part of its distribution: package Rebase; Notice that the opening block is considerably pared down, compared to earlier classes. For instance, I've tossed the code that keeps count of all objects. Why? Because it's unlikely that more than one of these objects will be necessary in a program: so why bother? 5.2.1 Attributes: Short and SweetNotice that the list of attributes is short: _rebase A hash that will be populated to provide the lookup, with enzyme names for keys, and recognition sites (and their translation to regular expressions) for values. (Make sure you see how in the hash %_attributes the value of the key _rebase is itself an anonymous hash.) _bionetfile The name of the datafile from the Rebase distribution. In my examples, I use the version numbered bionet.212, and by the time you read this book, more recent versions will be available (you can get bionet.212 from this book's web site). _dbmfile The DBM filename that resides on disk and stores the data in the hash _rebase.[1] [1] Recall that DBM files are tied to hashes in Perl and provide a simple, easy-to-program database for key/value pairs. They serve as a way to keep a hash on disk between invocations of a program, and so can help you avoid the cost of recalculating a hash each time a program is run. For more information, see O'Reilly's Programming Perl, the documentation for the DB_File module, and the documentation for the dbmopen and tie functions. _mode Contains the permissions with which you will create, or attempt to read, the DBM file. This is important for security purposes. With so few attributes, the class methods can easily handle each attribute individually, without recourse to the use of AUTOLOAD to define various accessors and mutators, as seen in previous chapters. 5.2.2 Creating a Rebase ObjectHere's how a Rebase object is created and initialized: # The constructor method The new constructor method is short. It requires a DBM database filename (existing or new), to which the hash data structure _rebase is tied. If the DBM file doesn't exist, the pathname of the bionet Rebase file is also required (that's where the data comes from that populates the DBM file). If a bionet datafile is given, the method calls the parse_rebase method that parses the bionet file to create the _rebase hash. As the comments indicate, my class is so simple I've even decided to do away with AUTOLOAD and DESTROY, and I've dispensed with the _set mutators as well. 5.2.3 Methods for the Rebase ClassNow, let's continue by looking at the methods for the Rebase class. Given an enzyme, the following two methods, get_recognition_sites and get_regular_expressions, retrieve the enzyme's recognition sites and the translations of the recognition sites into regular expressions. How do these two methods work? One method returns all recognition sites for an enzyme as given in the Rebase database; the other returns all the translations of the recognition sites into regular expressions. They both work very similarly. First, the enzyme is looked up in the _rebase hash. The value for each enzyme in the _rebase hash is a space-separated string that alternates recognition sites with their regular-expression translations. In both methods, the space-separated string is split to get a list of alternating recognition sites and regular expressions. This list is then assigned to the hash %sites to populate it with keys as recognition sites (the data that's actually in the Rebase bionet file) and values as regular expressions. The Perl operators keys and values are then used to generate the list of recognition sites (keys) or regular expressions (values). The get_bionetfile, get_dbmfile, and get_mode methods just report on the arguments that are set to specific filenames or mode strings when the object was created: sub get_regular_expressions { 5.2.4 parse_rebaseThe workhorse method of the class is parse_rebase, which reads the bionet Rebase datafile (with a suffix that indicates the release version, such as bionet.212). The bionet input datafile begins like this: REBASE version 212 bionet.212 As you can see, the header information ends with a line containing Rich Roberts. Apart from a blank line or two, the rest of the file contains records, one per line. Each record begins with a restriction enzyme name, optionally followed by another enzyme name in parentheses. The last field of each line is the recognition site. These are given using IUB codes for nucleotides. (For the IUB codes, see the comments in the program.) They also contain cut sites, indicated by a caret symbol ^. Cut sites contribute very important information about a restriction enzyme; they show where the enzyme makes the break when it cuts the DNA. Among other things, they are needed to correctly perform restriction digests in the computer when determining if there are overhangs that will be useful when inserting vectors or otherwise reassembling the fragments. However, in the code here we'll ignore the cut sites taking as a goal the virtual fingerprinting of DNA by just locating the recognition sites. Cut sites are omitted to simplify the code. (See the exercises for more on handling cut sites.) sub parse_rebase { This subroutine is a bit complex, corresponding to the nature of the data that it's processing. For instance, because enzymes can appear on more than one line, it has to check if an enzyme was already entered as a key in the hash. Let me remind you of the range operator that is used here to skip header lines: # Discard header lines The expression ( 1 .. /Rich Roberts/ ) returns true (and leads to the line being skipped) only when the line being read is included in the range bordered by the first line and the first line containing the regular expression /Rich Roberts/. (See the perlop section of the Perl manual for all the details on the range operator.) The parse_rebase subroutine, after skipping the header and any blank lines, then processes each data line in a while loop. Each line is split into either two fields (name and recognition site) or three fields (name, parenthesized alternate name, and recognition site). The name or names are placed in the @names array and looped through. In the last foreach loop, if the enzyme name hasn't yet been defined in the DBM-tied hash, it is added as a key. The value assigned to the key is a string with recognition site followed by a translation of the recognition site to a regular expression. The program passes to the next name. If the enzyme name has previously been entered as a key, the previously entered recognition sites are examined, and if the new site is there, the program passes to the next name. Similarly, if the reverse complement of the site has been entered, the program passes to the next name. But otherwise (if the enzyme name was entered, but neither the site nor its reverse complement were in the list of sites for that enzyme), the recognition site is added with its translation to a regular expression. This method has to handle reverse complements of recognition sites. Many restriction enzymes are palindromic in the sense that their reverse complements are the same. (For instance, the reverse complement of GAATTC is GAATTC.) These biological palindromes indicate that the enzyme can cut the site on both strands. 5.2.5 Methods to Translate Nucleotides to Regular ExpressionsFinally, the remaining methods translate IUB-coded nucleotide sequence data to Perl regular expressions. They also perform reverse complementation on IUB-coded sequence data. sub revcomIUB { 5.2.6 Testing the ModuleEnding the module, as usual, is some POD documentation for the module. Recall that you can view the output of this documentation in various ways, as HTML on a web page, as PostScript, etc. However, the simplest way is to say the following at the command line: perldoc Rebase.pm Let's try running the sample code given in the documentation. Notice that the Rebase.pm module is available, as is the bionet.212 file from the Rebase distribution. Also, notice there are two alternate calls to Rebase->new, so you should comment out the first one, then the other, in tests. Save the sample code from the documentation in a file called testRebase, and when you run it with the command: perl testRebase you get the following output: Looking up restriction enzyme EcoRI |