4.2 FileIO.pm: A Class to Read and Write FilesEven though you can easily obtain excellent modules for reading and writing files, this chapter shows you how to build a simple one from scratch. One reason for doing this is to better understand the issues every bioinformatics programmer needs to face, such as how to organize files and keep track of their contents. Another reason is so you can see how to extend the class to deal with the multiple file format problem that is peculiar to bioinformatics. It''s not uncommon for a biologist to use several different types of formats of files containing DNA or protein sequence data and translate from one format to another. Doing these translations by hand is very tedious. It''s also tedious to save alternate forms of the same sequence data in differently formatted files. You''ll see how to alleviate some of this pain by automating some of these tasks in a new class called SeqFileIO.pm. Class inheritance is one of the main reasons why object-oriented software is so reusable. In order to see clearly how it works, let''s start with the simple class FileIO.pm and later use it to define a more complex class, SeqFileIO.pm. FileIO is a simple class that reads and writes files, and stores simple information such as the file contents, date, and write permissions. You know that it''s often possible to modify existing code to create your own program. When I wrote FileIO.pm, I simply made a copy of the Gene.pm module from Chapter 3 and modified it. On my Linux system, I started by copying FileIO.pm from Gene.pm and giving it a new name: cp Gene.pm FileIO.pm I then edited the new file FileIO.pm changing the line near the top that says: package Gene; to: package FileIO; The filename must be the same as the class name, with an additional .pm. Though I now needed to modify the module to do what I want, a surprising amount of the overall framework of the codeits constructor, accessor and mutator methods, and its basic data structuresremains the same. Gene.pm already contained such useful parts as a new constructor, a hash-based object data structure, accessor methods to retrieve values of the attributes of the object, and mutator methods to alter attribute values. These are likely to be needed by most classes that you''ll write in your own software projects. 4.2.1 Analysis of FileIOFollowing is the code for FileIO, with commentary interspersed: package FileIO; In this first part of FileIO.pm file, the headers are exactly the same as in Gene.pm. The opening block, which contains the class data and methods, also remains the same except for the hash %_attribute_properties. This new version of the hash has different attributes (the filename, the file data, the last modification date of the file, and the mode to use in writing a file) tailored to the needs of reading and writing files. In addition to the read, write, and required properties, there is also a new "no initialization" (or noinit) property. An attribute with the noinit property may not be given an initial value when an object is created with a call to the new constructor. In this module, attributes such as data from the file, or the date on the file, are set only when the file is read or written. You may also have noticed that the default value for the filedata attribute is an anonymous array. Note that each attribute has both read and write properties. This being the case, you can simply omit the listing of the properties. However, in the interest of future modification, when I may want to add some attribute that won''t have both properties, I''ve left in the specification of the two read and write properties. (Note that one method name, get_count, doesn''t start with an underscore; this encourages you to call this method to get a count of how many objects currently exist.) 4.2.1.1 The constructor methodYou''ll notice in the following code that I have cut the new constructor down to the bare bones. # The constructor method Why did I do so? Read on. The read method The code continues with the read method: # Called from object, e.g. $obj->read( ); This new read method has two parts. The first includes the initialization of the object''s attributes from the arguments and the defaults as specified in the %_attribute_properties hash. The second includes the reading of the file and the setting of the _filedata and _date attributes from the file''s contents and its last modification time. The first loop in the program initializes the attributes. If an attribute is specified as an argument, the first test is to see if the noinit property is set. This forbids initializing the attribute, in which case the program croaks. Otherwise, the attribute is set. If the attribute isn''t passed as an argument but has a required property (only the _filename attribute has the required property), the program croaks. Finally, if the argument isn''t given and not required, the attribute is set to the default value. After performing those initializations, the read method reads in the specified file. If it can''t open the file, the program croaks. (See the exercises for a discussion of this use of croak.) The file is read by the line: $self->{''_filedata''} = [ <FileIOFH> ];In list context, the input operator on the opened filehandle, which is given by <FileIOFH> reads in the entire file. This is done within an anonymous array, as determined by the square brackets around the input operator angle brackets. A reference to this anonymous array containing the file''s contents is then assigned to the _filedata attribute. 4.2.1.2 stat and localtime functionsFinally, the Perl stat and localtime functions are called to generate a string with the file''s last modification time, which is assigned to the object attribute _date. This method of reading a file makes many choices. For instance, the stat command returns an array with many more items of interest about a file, such as its size, owner, access permission modes, and so on (the tenth item of which is the modification time). As you develop your programs, you should be paying attention to details such as whether you need to save some of these additional attributes of a file, the last modification date, or notes about the kind of data in the file. The next line of code in the program: # is yet another choice to think about. Are there occasions when cloning a file object makes sense? Maybe I''d like to clone a file object, make some small change to the data, give it a new filename, and write it out. Why have I left this out? 4.2.1.3 The write methodThe code continues: # Write files The write method handles writing a file object out to an actual file. First, all arguments corresponding to attributes are set as requested. The file is then opened for writing, using the _writemode attribute to specify; for example, > for truncating the file before writing or >> for appending to the file. The print FileIOFH statement actually does the writing to the opened FileIOFH filehandle, retrieving the file data from the object with the get_filedata method defined by means of AUTOLOAD. Finally, the object''s _date attribute is reset to the new modification time. 4.2.1.4 AUTOLOADThe next section of code is the AUTOLOAD method itself: # This takes the place of such accessor definitions as: This AUTOLOAD method has grown! There''s only one difference, however, between this code and the AUTOLOAD code for the Gene.pm class. The new set of attributes for FileIO.pm don''t all take simple scalar values, as was the case with Gene.pm. Another attribute, _filedata, is a reference to an anonymous array. In order for the accessors to return the correct data, they must check to see if an attribute is a scalar or a reference to an array; the accessors can then dereference and return the data from the method call. So the accessors, and the definitions of them installed into the symbol table, test for an array reference and dereference it accordingly. Other than that, this AUTOLOAD method is exactly the same as that defined for Gene.pm. You may also have noticed that sections of code in the AUTOLOAD method are almost identical to each other. Recall that AUTOLOAD is invoked when a method with no subroutine defining it is called. AUTOLOAD must do two things. First, it performs whatever method is requested; for example, if an accessor method is requested, it returns the appropriate value. Second, it defines the subroutine that implements the requested method and installs it in the symbol table so the next time the method is called, AUTOLOAD and its considerable overhead won''t be necessary. Because of these parameters, the code AUTOLOAD executes to handle the requested method is nearly identical with the method that AUTOLOAD also defines. Finally, here are the last sections of the FileIO.pm program: # When an object is no longer being used, this will be automatically called The only change here is that there are no other methods (Gene.pm had a citation method). 4.2.2 Finishing FileIOTo finish FileIO.pm, here''s some very terse (too terse for anything but a textbook) POD documentation: =head1 FileIO 4.2.3 Testing the FileIO Class ModuleNow that we''ve got a class module, complete with examples of its use, let''s write a small test program and see how it works. Since the examples in the documentation are, in effect, a small test program, let''s try running it. We''ll use the file file1.txt I created with my text editor that contains: > sample dna (This is a typical fasta header.) I''ll take the code from the documentation pretty much as is, just adding strict and warnings. I''ll also include a use lib directive that adds my development library directory to the list of directories in @INC, which tells my computer''s Perl where to look for modules. (Recall that you can either edit this line, override it with the PERL5LIB environmental variable, or give your own directory on the command line.) I also add a few print statements to make the output easier to read: #!/usr/bin/perl I finally run the test program to get the following output: The file name is file1.txt The module seems to be performing as hoped. So, now we have a simple module that reads and writes files and provides a few options for the write mode. But, frankly, this isn''t too impressive. You''ve already been reading and writing files in Perl without the overhead of this FileIO module. The interface to the code is nice, and it''s good to have objects that contain the file data, but what has really been accomplished? The real power of this approach is coming up next. Using class inheritance, this simple module can be extended relatively easily in a very useful direction. It''s another case of the basic software engineering approach of making small, simple, generally useful tools, and then combining them into more powerful and specific applications. So, next, I''ll take my simple FileIO class and use it as a base class for a bioinformatics-specific class. |