Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

2.2 References

Many
computer languages provide variables that allow you to refer to, or
point at, other values. So, instead of a variable containing data
such as a string or number of interest, the variable contains the
location of the data; it tells you where to go
to get the value you want. In Perl, the use of a scalar variable to
refer to another value is called a reference,
and the value being pointed at is called a
referent.

References allow you to do many useful things in Perl; you can define
multidimensional arrays and other more complex data structures and
avoid copying large amounts of data (for instance, when passing
arguments into subroutines). Using references can make your programs
faster, more efficient, and shorter. References have a number of
uses, as you'll see in the next sections.

2.2.1 References to Scalars

Here's an example of a reference:

$peptide = 'EIQADEVRL';
$peptideref = \$peptide;
print "Here is what's in the reference:\n";
print $peptideref, "\n";
print "Here is what the reference is pointing to:\n";
print ${$peptideref}, "\n";
print $$peptideref, "\n";

This Perl code produces the following output:

Here is what's in the reference:
SCALAR(0x80fe4ac)
Here is what the reference is pointing to:
EIQADEVRL
EIQADEVRL

What's going on here?

First, a string value of EIQADEVRL is assigned to
the scalar variable $peptide. Next, a backslash
operator is used before the $peptide variable to
return a reference to the variable. This reference is saved in the
scalar variable $peptideref.

The next lines of code show what this example really does. When you
print out the (actual) value of the reference variable
$peptideref, you get the value:

SCALAR(0x80fe4ac)

This says that the reference variable $peptideref
is pointing to a scalar value (which is the value of the scalar
variable $peptide). It also gives a hexadecimal
number that specifies where in the computer memory the value for that
variable resides.

The 0x at the beginning of the number says that it
is a hexadecimal number.[2] Hexadecimal (base 16) numbers are a way
to specify locations in computer memory. The exact location in the
computer memory where this $peptide value resides
is almost never of practical importance to you. However, it can help
when debugging code that uses references, and so it is displayed when
you print the value of a reference as we've just
done or when you use the Perl ref command (which
we'll use later).

[2] Recall that hexadecimal
numbers use 16 digits, from 0 to f, and that the decimal (base 10)
numbers:

2.2.1.1 Dereferencing

Finally, our code fragment performs the essential task of
dereferencing
a reference. In Perl a reference to a scalar variable can be
dereferenced by surrounding it with curly braces {} and prepending
another dollar sign to it. ${$peptideref} returns
the value the reference variable is pointing at. The value being
pointed at is the same as the value of the
$peptide variable, which has the value
'EIQADEVRL', so ${$peptideref}
also has the value 'EIQADEVRL'.

Surrounding a reference with curly
braces before prepending the appropriate symbol ($
for scalar, @ for array, % for
hash) is generally the best way to dereference reference variables.
As you start using more intricate references, you'll
find that it's often the only way to dereference
properly. However, for simple reference variables, it is possible to
omit the additional curly braces. So, our example shows both ways of
dereferencing our scalar reference:

${$peptideref}
$$peptideref

In Perl, every reference must be dereferenced properly in the program
(in other words, by the programmer) to be useful. Perl
doesn't automatically dereference for you, nor can
it figure out when you want a reference or when you want the value
that the reference is pointing to. So, it's up to
you to specify that you want the value of a reference by prepending a
%, @, or $
to hash, array, or scalar references, respectively. (And, as just
pointed out, you often need to surround the reference with curly
braces, although for simple references, they can be omitted.)

2.2.1.2 Anonymous data

A scalar
constant can also be referenced, as in the following code:

$peptideref = \'EIQADEVRL';
print "Here is what's in the reference:\n";
print $peptideref, "\n";
print "Here is what the reference is pointing to:\n";
print ${$peptideref}, "\n";

This produces the output:

Here is what's in the reference:
SCALAR(0x80fe4a0)
Here is what the reference is pointing to:
EIQADEVRL

In this case the reference points directly to a location in memory in
which the string value EIQADEVRL is being stored.

Compare this code with the previous example. The reference was to an
existing variable that held a scalar value. Think of it as a scalar
value with a "name" that is the
already existing variable. Now, the reference is to a scalar value
alone. This scalar value isn't contained in any
variable; it has no name. Thus, it's called an
anonymous referent, which can only be used via
the reference to it.

You may well ask, "Why bother?"
Anonymous scalars are, for most practical purposes, not any more
desirable than simple scalar variables. However, anonymous data
structures, and references to them, are frequently useful, as you
shall see.

2.2.2 References of References

It is
sometimes useful to have references of references. Since a reference
is just a variable containing a scalar value, it's
possible to make a reference to a reference:

$value = 'ACGAAGCT';
$refvalue = \$value;
$refrefvalue = \$refvalue;
print $value, "\n";
print $$refvalue, "\n";
print $$$refrefvalue, "\n";

This prints out:

ACGAAGCT
ACGAAGCT
ACGAAGCT

(Notice that here I've omitted the surrounding curly
braces from around the references.) You can also apply several levels
of reference at one go:

$value = 'ACGAAGCT';
$refrefrefvalue = \\$value;
print $value, "\n";
print $$$$refrefrefvalue, "\n";

This prints out:

ACGAAGCT
ACGAAGCT

2.2.3 References to Arrays

References to
arrays obey pretty much the same syntax as references to scalars. You
make a reference to an array by prepending a backslash to the
@ sign; you dereference the array by surrounding
the reference variable with curly braces and prepending an
@ sign, as in the following example:

@pentamers = ('cggca', 'tgatc', 'ttggc');
$arrayref = \@pentamers;
print "Here is what's in the reference:\n";
print $arrayref, "\n";
print "Here is what the reference is pointing to:\n";
print "@{$arrayref}\n";
print "Here is the second value in the array:\n";
print ${$arrayref}[1], "\n";

This Perl code produces the following output:

Here is what's in the reference:
ARRAY(0x80fe4c4)
Here is what the reference is pointing to:
cggca tgatc ttggc
Here is the second value in the array:
tgatc

An important point to remember here is that Perl
doesn't automatically know if you want the data a
reference is pointing to or the reference variable itself. If
it's pointing to a scalar value,
it's up to you to prepend the $
sign to the reference in order to dereference the value. Similarly,
as in this example, if a reference is pointing to an array value,
it's up to you to prepend the @
sign to the reference to dereference the value;
@{$arrayref} is correct;
@$arrayref is also okay.

On the other hand, if you want the value of one element of the
referenced array, you prepend a dollar sign because the value will be
a scalar value. Recall that to get the scalar value of one element of
an array you use a dollar sign; for example,
$array[0]. Similarly, to get the scalar value of
one element from an array with a reference, you prepend a dollar
sign; for example, ${$arrayref}[0] or
$$arrayref[0].

2.2.3.1 The arrow operator

References to arrays (and
references to hashes and subroutines) can be
dereferenced using another syntax
that's popular and important to learn. If
$arrayref is a reference to an array, then to
dereference the second element (for instance) of that array, you can
say either:

$$arrayref[1]

or, equivalently:

$arrayref->[1]

The following code fragment shows this:

@pentamers = ('cggca', 'tgatc', 'ttggc');
$arrayref = \@pentamers;
print "Here is the second element of the pentamers array:\n";
print $$arrayref[1], "\n";
print "And here it is again:\n";
print $arrayref->[1], "\n";

This code prints out:

Here is the second element of the pentamers array:
tgatc
And here it is again:
tgatc

The arrow operator appears between the name of
the reference to an array and the square brackets and subscript. It
works similarly with hashes and with subroutines, as
you'll see later.

As a convenient shortcut, it is sometimes possible to drop multiple
arrow operators in a reference. Thus, if:

$array = [ [ 'Dennis', 'Drayna' ], [ 'Callum', 'Bell' ] ];

the following are synonymous:

print $$array[1][2];
print $array->[1][2];
print $array->[1]->[2];

Here's the output:

BellBellBell

I'll show more examples of this shortcut later in
this chapter.

2.2.3.2 Anonymous arrays

You can create an
anonymous array by surrounding a list
with square brackets. (A mnemonic device to
remember this bit of syntax is that square brackets are also used
with arrays to refer to a particular element, as in
$arr[4].) You can then create a reference to the
anonymous array like so:

$pentamers = ['cggca', 'tgatc', 'ttggc'];
print "The third and last element of the array is ", $pentamers->[2], "\n";

This gives the output:

The third and last element of the array is ttggc

In this case, $pentamers is a reference to an
(anonymous) array. The third element can equally well be printed
using $$pentamers[2]. The entire array is named by
prepending an @ sign:

$pentamers = ['cggca', 'tgatc', 'ttggc'];
print "The third and last element of the array is $$pentamers[2]\n";
print "The entire array is: @$pentamers\n";

This produces the output:

The third and last element of the array is ttggc
The entire array is: cggca tgatc ttggc

2.2.4 References to Hashes

References to
hashes also follow the same rules as
references to scalars and arrays. You make a reference to a hash by
prepending a backslash to the % sign; you
dereference by prepending the percent sign to the dollar sign on the
reference variable:

%geneticmarkers = ('curly' => 'yes', 'hairy' => 'no', 'topiary' => 'yes');
$hashref =  "Here is what's in the reference:\n";
print $hashref, "\n";
print "Here is what the reference is pointing to:\n";
foreach $k (keys %$hashref) {
print "key\t$k\t\tvalue\t$$hashref{$k}\n";
}
print "Dereferencing using the arrow operator:\n";
foreach $k (keys %$hashref) {
print "key\t$k\t\tvalue\t$hashref->{$k}\n";
}

This Perl code produces the following output:

Here is what's in the reference:
HASH(0x80fe4c4)
Here is what the reference is pointing to:
key        topiary              value        yes
key        curly                value        yes
key        hairy                value        no
Dereferencing using the arrow operator:
key        topiary              value        yes
key        curly                value        yes
key        hairy                value        no

Notice that the keys are printed in a different order than they were
specified: hashes do not preserve the order of their keys. (Also
recall that, in a double quoted string, \t prints
a tab space.)

If you want one value of the referenced hash, you prepend a dollar
sign because the value will be a scalar value. To get one value of a
hash, you use a dollar sign, e.g.,
$geneticmarkers{'curly'}; to get one value from a
reference to a hash, you also use a dollar sign, e.g.,
$$hashref{'curly'}.

The arrow operator -> works with hashes the way
it works with arrays. With hashes, the arrow operator is placed
between the name of the hash and the curly braces. To illustrate:

%geneticmarkers = ('curly' => 'yes', 'hairy' => 'no', 'topiary' => 'yes');
$hashref =  "For key 'curly' the value is '", $$hashref{'curly'}, "'\n";
print "For key 'curly' the value is '", $hashref->{'curly'}, "'\n";

This prints:

For key 'curly' the value is 'yes'
For key 'curly' the value is 'yes'

2.2.4.1 Anonymous hashes

You can create an anonymous hash by
surrounding a list with curly braces. (The mnemonic device to
remember this bit of syntax is that curly braces are also used with
hashes to refer to a particular key, as in
$hash{'curly'}.) You can then create a reference
to the anonymous hash like so:

$geneticmarkers = {'curly' => 'yes', 'hairy' => 'no', 'topiary' => 'yes'};
print "Here is what is in the anonymous hash:\n";
foreach $k (keys %$geneticmarkers) {
print "key\t$k\tvalue\t$geneticmarkers->{$k}\n";
}

This gives the output:

Here is what is in the anonymous hash:
key        topiary        value        yes
key        curly          value        yes
key        hairy          value        no

In this case, $geneticmarkers is a reference to an
(anonymous) hash. The values can equally well be printed using
$$geneticmarkers{$k}or
$geneticmarkers->{$k}.

Curly braces can also be used for blocks and
for subroutine definitions. The Perl interpreter can occasionally get
confused as to which of these constructs is meant, although
it's rare. To be clear, you can put a plus sign
+ in front of an anonymous hash to specify that it
is an anonymous hash and not a block:

$anonhash = +{ 'one' => 1, 'two' => 2 };
print "The old $$anonhash{'one'} $anonhash->{'two'}\n";

This prints:

The old 1 2

2.2.5 References to Subroutines

References
to subroutines are yet another way to reference in Perl. This may
seem a little odd. References to scalars, arrays, and hashes are
references to data structures. But references to subroutines? A
subroutine isn't a data structure, so how did this
come about?

There are two reasons why references to subroutines make sense the
same way that references to data structures make sense. The first
reason is that just as variables are managed with
Perl's symbol tables, so also are subroutine
definitions managed by the symbol table. In Chapter 3, you'll see the deliberate
manipulation of a symbol table to make subroutine definitions on the
fly. In this sense, subroutines, hashes, arrays, and scalars all
refer to data that has a name.

The second reason is that references to subroutines are sometimes a
great tool to use when writing a program. There are times when you
might apply one of a number of different subroutines depending on the
program logic and the input, and using references to subroutines can
make this kind of code easier to write. That's the
real justification for just about everything you might find in the
toolbox that we call a programming language, right? (I admit it
sometimes seems that sheer orneriness was the motivation.)

References to subroutines follow the same rules as references to
scalars and arrays. Recall that a subroutine name may optionally be
prepended with the ampersand sign & when it is
called.[3] Thus, these two are equivalent:

[3] The ampersand was required in older versions
of Perl.

findmotif('ATTAATTTTCCGATC');
&findmotif('ATTAATTTTCCGATC');

To make a reference to a subroutine, you prepend a backslash to the
ampersand:

$subref = \&findmotif;

You dereference a subroutine one of two ways: by prepending an
ampersand to the subroutine reference, like so:

&$subref(  );

or by using the arrow operator, like so:

$subref->(  );

This is demonstrated by the following code fragment (which includes a
subroutine definition):

print "Mark 1:\n";
findmotif('ATTAATTTTCCGATC');
print "Mark 2:\n";
&findmotif('ATTAATTTTCCGATC');
print "Mark 3:\n";
$subref = \&findmotif;
&$subref('ATTAATTTTCCGATC');
print "Mark 4:\n";
$subref = \&findmotif;
$subref->('ATTAATTTTCCGATC');
print "Mark 5:\n";
$subref2 = \findmotif;
&$subref2('ATTAATTTTCCGATC');
sub findmotif {
my($input) = @_;
if($input =~ /CCGA/) {
print "I found CCGA!\n";
}else{
print "No motif\n";
}
}

This produces the output:

Mark 1:
I found CCGA!
Mark 2:
I found CCGA!
Mark 3:
I found CCGA!
Mark 4:
I found CCGA!
Mark 5:
Not a CODE reference at - line 17.

This code defines a little subroutine findmotif
that looks for a short motif in DNA sequence data. The first two
calls to the subroutine simply demonstrate that you can call
subroutines with or without a leading ampersand
&. The third calls the subroutine by means of
a reference to the subroutine, as just described. The fourth call is
by means of a reference to the subroutine using the alternative arrow
operator. Finally, the fifth call produces an error; the problem is
just a syntactical one; it tries to take a reference to a subroutine
by prepending the backslash to the name of the subroutine without
including the leading ampersand.

It's useful to remind the gentle reader that the
error produced by that fifth call to findmotif
occurs only if you don't use the use strict directive (as you are encouraged
always to do). Without use strict, the program
fails only when it reaches that bad call. With use strict, the program complains and fails immediately. What
if that call isn't made until several hours into the
running program (which is not an uncommon running time in
bioinformatics)? use strict can save a lot of time
and effort.

2.2.5.1 Anonymous subroutines

You can create an anonymous subroutine by giving the keyword
sub followed by a subroutine definition within the
usual curly braces, followed by a semicolon. An anonymous subroutine
definition is just like a normal subroutine definition, except the
name of the subroutine is omitted, and you must follow it with a
semicolon. (Recall that subroutine definitions normally are not
followed by a semicolon, as with the subroutine
findmotif in the previous example.)

You can create a reference to the anonymous subroutine like so:

$findmotif = sub {
my($input) = @_;
if($input =~ /CCGA/) {
print "I found CCGA!\n";
}else{
print "No motif\n";
}
};
$findmotif->('ATTAATTTTCCGATC');
&$findmotif('ATTAATTTTCCGATC');

This gives the output:

I found CCGA!
I found CCGA!

In this case, $findmotif is a reference to an
(anonymous) subroutine. The subroutine reference was dereferenced and
called twice to show the use of the two alternative choices of
syntax: the prepended ampersand and the arrow operator.

2.2.5.2 Passing references to subroutines

Perl collapses all arguments to
a subroutine as a list of scalars. This makes it impossible to
distinguish between, say, two arrays you might try to pass to a
subroutine, as the following example illustrates:

@aminoacids1 = ('E', 'V', 'L');
@aminoacids2 = ('D', 'T', 'Y');
printacids(@aminoacids1, @aminoacids2);
sub printacids {
my(@aa1, @aa2) = @_;
print "Amino acids 1\n";
print "@aa1\n";
print "Amino acids 2\n";
print "@aa2\n";
}

This prints out:

Amino acids 1
E V L D T Y
Amino acids 2

As you can see, the elements of both arrays are passed to the
subroutine by means of the special array @_, and
Perl assigns this entire array to the first local array
@aa1.

In order to pass an arbitrary list of any combination of scalars,
arrays, or hashes to a subroutine, it's necessary to
pass the values as references. Here's how to fix the
previous example:

@aminoacids1 = ('E', 'V', 'L');
@aminoacids2 = ('D', 'T', 'Y');
printacids(\@aminoacids1, \@aminoacids2);
sub printacids {
my($aa1, $aa2) = @_;
print "Amino acids 1\n";
print "@$aa1\n";
print "Amino acids 2\n";
print "@$aa2\n";
}

This prints out:

Amino acids 1
E V L
Amino acids 2
D T Y

In this version, the subroutine is passed references to the arrays.
Inside the subroutine, the references are collected in the variables
$aa1 and $aa2 and are
dereferenced to print out their contents using the forms
@$aa1 and @$aa2.

Even when you're passing just one scalar to a
subroutine, you might want to pass a reference. Say you have the DNA
sequence of human chromosome 1 in a variable
$chrom1. You want to pass this sequence into a
subroutine that searches for restriction enzymes. A problem can arise
because passing a variable into a subroutine involves making a copy
of the data into the subroutine's variables, and
you've just used up a significant portion of your
computer's memory.

By passing a reference to the DNA sequence data, you avoid making a
copy of the data, and your program will use less memory. It will also
run much faster because copying large strings is a fairly
time-consuming process for a program.

Here's a simple example of how to pass a scalar
reference to a subroutine:

my $chrom1 = getchrom('1');  # assume we read in human chromosome 1 here
my @enzyme_sites = findrestrictionenzymes(\$chrom1, 'HindIII');
sub findrestrictionenzymes {
my($seqref, $re) = @_; # $seqref is a reference to a scalar string
# $re contains the name of a restriction enzyme
... program logic follows, where $$seqref is the sequence data ...
}

Writing programs is a type of engineering, and engineering always
seems to come back to the idea of tradeoffs. The downside of passing
references to subroutines is that anything the subroutine does to the
referenced data stays in effect after the subroutine has exited. This
"action at a distance" needs to be
treated with care, so as not to modify data unintentionally.

2.2.5.3 Returning references from subroutines

You'll see
in Chapter 3 how the subroutine called
new returns a reference to an anonymous data
structure declared within the subroutine. Until then,
I'll defer a detailed discussion of how this works;
the bottom line is that a subroutine can return a reference because a
reference is "really" just a scalar
value.

2.2.6 Symbolic Versus Hard References

There are two kinds of references, hard
and symbolic. Hard
references actually point to locations in
computer memory.

For example, a hard reference to a scalar:

$name = 'Joel';

is defined like so:

$nameref = \$name;

and the values associated with the hard reference
$$nameref are:

print '$nameref has the value ', $nameref, ' and points to the referent ', 
$$nameref, "\n";

This prints:

$nameref has the value SCALAR(0x80fe4ac) and points to the referent Joel

Symbolic
references refer to a name, not an address. As
a brief example, let's say we have four array
variables @mark1, @mark2,
@mark3, and @mark4. It is
possible to have another variable that is set to one of these
variable names; let's say the variable is called
$arrayname and it's set to the
value mark3, and that is the array we want to
access.

You can place the $arrayname variable in a block.
Because a block returns the value of its last expression, this block
returns the string mark3. You can then place the
special array symbol @ in front of the block, and
Perl will recognize this as meaning the @mark3
array. Here is a demonstration of how this works:

@mark1 = ( 'a1', 'a2', 'a3', 'a4' );
@mark2 = ( 'b1', 'b2', 'b3', 'b4' );
@mark3 = ( 'c1', 'c2', 'c3', 'c4' );
@mark4 = ( 'd1', 'd2', 'd3', 'd4' );
$arrayname = 'mark3';
print "@{$arrayname}\n";

This program prints out the result:

c1 c2 c3 c4

Symbolic references are avoided by some programmers and used
frequently by others; you may sometimes come across them, or even
find yourself using them. They are used in the
AUTOLOAD methods that install methods at runtime,
which you'll learn about in the later
chapters.