Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

7.1 How the Web Works

The
Internet
is short for "interconnected
networks." It is a set of
conventionsprotocolswith which computers and networks
can intercommunicate. Its development from earlier work before 1980
allowed many different networks to join and users on many computers
to communicate. This communication was originally done in several
ways, such as by
email,
electronic mail, and by
FTP,
file transfer protocol. These methods remain very popular and widely
used.

It wasn't until the early 1990s that the
World Wide Web or
Web was born as a new
Internet service. The Web was based on the new hypertext transport
protocol or HTTP, and the first software to use it was in the form of
programs called web browsers and web servers. Web
browsers are programs that handle user requests
and display results to the user; the most widely known web browsers
include Internet Explorer and Netscape. Web
servers are programs that accept requests from
web browsers and send results back to them for display; Apache is the
most widely used web server. With the development of web browsers and
their ability to handle images as well as text, these new protocols
sparked intense popular interest in computers. At the same time,
computer costs were falling steadily, and their capabilities were
growing, which made the new web protocol even more widespread.

The Web has become critical to scientific programming; in fact, it
started there. The Web and its associated protocols such as HTTP were
originally developed at a high-energy physics laboratory in
Switzerland, CERN, and
they have been heavily used in the sciences ever since. In biology,
as elsewhere, the Web has become one of the principal means of
communication.

7.1.1 URLs

The Web is essentially a two-part
system of browsers and servers, in which browsers get results from
servers and display them for the user. This type of architecture is
called a
client-server design, in which the client (web
browser) requests service from the server (web server). Web browsers
and servers are just programs that run on computers. They may both be
on the same computer, or, thanks to the Internet, they may be on
opposite ends of the earth.

In order for this scheme to work, the web browser has to be able to
send its request to the web server. For instance, say you want to see
the New York Times from your Internet Explorer
web browser (or your Netscape, Mozilla, or other web browser). You
have to know the location of the New York Times
on the Web and type it into the space provided in your browser
screen.

So, you type http://www.nytimes.com, hit the
Return or Enter key on your keyboard, and the next thing you know
you're reading the latest articles about human
cloning and double-stranded RNA. How does this work, exactly?

The answer is really very simple. The web browser sends your request
to the Internet; the actual location of the desired computer is
determined, and your request is sent to the web server program on
that computer. The web server handles your request and sends back a
web page your browser then displays. This web page
may include other URLs of specific articles. You can click on one,
and the whole process is repeated, but this time your request is for
a specific article, which is then returned to your computer and
displayed by your web browser.

Behind this simple overall architecture are several steps. A basic
familiarity with some of these steps and the associated terminology
is needed in order to learn the fundamentals of web programming.

The location you typed in,
http://www.nytimes.com, is called a Uniform
Resource Locator or URL. The Internet (to which you must be connected
for this to work, of course) takes the URL you typed in and, with the
help of a network of computers and their routing tables that are
configured for this task, resolves the URL into an
Internet address (IP address), which
is numeric. The address you typed in has a vague resemblance to
English; www.nytimes.com is
translated by routing tables into the numeric IP address.

The details of this are not important; if you know it, you can also
just type in the numeric Internet address instead of the domain name.
The advantage of this design is that it allows the routing tables
maintained by the Internet to take a domain name and translate it to
the correct actual Internet address. Then, if the New York
Times changes its main computer, or decides to move to
Paris,[1] all that needs
to be done is for the routing tables to be updated with the new
actual Internet address. You can still type
www.nytimes.com and get to the proper web server
without worrying about where it actually lives on the Internet.

[1] The Paris Review moved
from Paris to New York, after all.

A URL can have several parts:

It begins with a
scheme,
which is
"http"
in this case and specifies the protocol for the request.
"http" is the most common scheme;
others include "https" for
increased security, "ftp" for file
transfer protocol, and so on.

A colon and two forward slashes
(://) separates
the scheme from the
hostname, which is
www.nytimes.com. This is the part
that's resolved by the routing tables on the
Internet; it gives the address of the computer with which you
actually want to communicate.

Following the hostname, several other bits of information may
optionally appear in the URL, such as a particular location on the
server's computer, a port number, some parameters to
pass to the server, and other details that tell the server exactly
what the browser is requesting.

For instance, say you want to use the web page the reporters for the
New York Times use to organize their list of
helpful web sites. You'd type in
http://www.nytimes.com/navigator. Now, following
the hostname is the additional information
/navigator. This is a
pathname for the particular web page
you're interested in. It's just
like a pathname of a file or directory on your filesystem. Sometimes
it is longer, such as
http://www.nytimes.com/library/tech/reference/cynavil,
and includes several directories and subdirectories, and finally a
filename (in this case, it's
cynavil).

These path names are relative to the way the web server is installed
and configured; they tell the web server exactly what resource is
being requested by the browser.

Following the pathname may come other information. If the pathname is
the name of a CGI program (discussed later in this chapter), certain
arguments may be sent to that program; of course, these vary
depending on the particular CGI program involved. The arguments or
queries are separated by question marks and give desired values to
parameters. A typical example might be
http://www.mycomputer.com/cgi/rebase.cgi?enzyme=EcoRI?enzyme=HinDIII.
This requests a web page from the web server on the computer
www.mycomputer.com. The web page to be returned
is generated on the fly by the CGI script on that computer in the
file cgi/rebase.cgi. The URL also passes that
CGI script the names of two enzymes (which the script will presumably
use to formulate its reply), EcoRI and HinDIII.

Other information may appear in a URL, and other variations are
possible. As one more example, if you had a web page saved on your
computer in the file
/home/tisdall/arabidopsisl, you can display
it by typing the following into a web browser running on the same
computer: file:/home/tisdall/arabidopsisl.

If you have to manipulate URLs in your program (and you very well may
at some point), there is a collection of modules available on CPAN
called URI::URL that will make your life a whole lot
easier.

7.1.2 HTML

The Hypertext Markup
Language (HTML) is the language that embellishes text so that it can
be displayed in a web browser.

There are two important parts of HTML. It formats text, specifying
such things as paragraphs, italics, numbered section headings, and
the like. Although text is the most common type of information
displayed, other types of information such as images and sound are
also commonly incorporated into a document.

The other important part of HTML is that it incorporates
hypertext links, which make a document interactive by
providing the user viewing the document in a web browser the ability
to click on links and go to other web pages.

The basic idea of HTML is to embed within a
document directions for how to display the document. The directions
are rather vague, compared to real typesetting tools such as
FrameMaker or Quark. HTML commands may be interpreted differently by
different web browsers so that your HTML document can look
considerably different when viewed by different people. This
limitation was a deliberate part of the design of HTML and web
browsers. The disadvantage of not being able to exactly specify how a
web page appears is offset by the advantages of the simplicity of
HTML and the possibility to view HTML documents on a variety of
computers and operating systems.

7.1.2.1 HTML web page example

To demonstrate, let's
see a short example of an HTML web page:[2]

[2] The Rebase
web page that I'll develop in this chapter will give
you a more complete example. Most web browsers allow you to see the
HTML for whatever web page you're viewing by
clicking on the Page Source link in the View menu of the web browser.
(Your browser may use slightly different names, but all the major web
browsers enable you to look at the HTML source by selecting a menu
item.)

<html>
<head>
<title>Double stranded RNA can regulate genes</title>
</head>
<body>
<h2>Double stranded RNA can regulate genes</h2>
<p>A recent article in <b>Nature</b> describes the important
discovery of <i>RNA interference</i>, the action of snippets
of double-stranded RNA in suppressing gene expression.
</p>
<p>
The discovery has provided a powerful new tool in investigating
gene function, and has raised many questions about the
nature of gene regulation in a wide variety of organisms.
</p>
</body>
</html>

This HTML, if contained in a file, can be displayed in a web browser.
If the file is on the same computer as the web browser
you're using, you can display it easily. If
it's on a different computer, the file has to be in
a place your computer's web server has been
configured to look.

For instance, if a file on your computer
/home/tisdall/htmlexample1l contained the
previous HTML content, you can type the URL into your web browser as
so:

file:/home/tisdall/htmlexample1l

and the browser would display something like that in Figure 7-1.

Figure 7-1. HTML example

I say "something like this" because
many of the details of exactly how the text and layout appears are
left to the browser program. The browser program may be set to use
different font sizes or font types, break the lines at different
places, display a different colored background, and, in general,
specify locally several of the formatting options for the web page to
be displayed. For example, the browser window may be very small, in
which case the text will be reformatted to fit as well as possible
into the available window size. Still, the basic content of the text
should appear similarly to what is shown in Figure 7-1.

7.1.2.2 HTML directives

Let's
take a look at how the HTML directives are embedded into the
document.

HTML directives are mostly specified by enclosing them in angle
brackets. The directives come in pairs, and the text between the
opening and closing directive is affected. The second member of a
pair has an added forward slash /
before the tag name.

So, for example, to make a word italicized, you surround the word
with the <i> and the
</i> pair of tags. In the previous example,
the term "RNA interference" is
surrounded in this fashion, so it appears in italics in the browser.

The pair of tags
<html> and </html>
surrounds the entire document, and serves to delimit the HTML content
for the web browser (or other HTML-reading program).

HTML documents have two major sections: the head and the body. The pair of tags
<head> and </head>
surrounds text that is related to the document as a whole. In Figure 7-1, there is only one item in the head
sectiona "title" that is
displayed in the titlebar of the web browser. The
title tags
<title> and
</title> surround the title
"Double stranded RNA can regulate
genes".

The head section can contain many different kinds of directives that
influence the display of a document. It is followed by the
"body" of the document which is
surrounded by the tags <body> and
</body> and comprises the rest of the
document.

The body, in this simple example, has a header, paragraphs, and a few
formatting directives, and it is surrounded by the tags
<body> and </body>.

The headers can be of different levels, so you can make a document
structure with primary headers and various subsections. This example
specifies just a single header as follows:

<h2>Double stranded RNA can regulate genes</h2>

The first paragraph makes the journal name
"Nature" appear in bold font, and
the new term "RNA interference"
appears in italics:

<p>A recent article in <b>Nature</b> describes the important
discovery of <i>RNA interference</i>, the action of snippets
of double-stranded RNA in suppressing gene expression.
</p>

Notice how the paragraph tags <p>
and </p> surround a paragraph. (Actually,
the closing paragraph tag can be omitted as a time-saving
convenience; some very common tags have this feature, but most do
not.)

The second paragraph contains only text:

<p>
The discovery has provided a powerful new tool in investigating
gene function, and has raised many questions about the
nature of gene regulation in a wide variety of organisms.
</p>

The following summary of the document highlights the major sections
and omits the details within the head and the body:

<html>
<head>
... header information goes here
</head>
<body>
... the body of the document goes here
</body>
</html>

That's all there is to say about this simple
example. Other features of HTML include embedded hyperlinks to web
pages, email, and so forth. HTML has expanded in several ways over
the last few years, and many more types of formatting are possible.

7.1.3 HTTP

The Web is based
on a language called the Hypertext Transport Protocol, or HTTP. HTTP
is the protocol that communicates between web browsers and servers.

Recall that in this chapter I'm using CGI to handle
the communication between browsers and servers, and CGI can be
thought of as a simplified interface to HTTP. So,
it's necessary and useful to learn a few basic facts
about HTTP before embarking on CGI programming.

HTTP works in a simple fashion. The browser sends a
request
which is made of a header and, often, a body. The server receives the
request and sends a
response,
which is also made of a header and, sometimes, a body.

The first line of a request header is called the request
line and contains the request
method. The request method is usually GET. This
is the most common request, and it asks for a specific resource from
the web server, usually specified as a URL to be retrieved by a
specific protocol such as HTTP.

The remaining lines of the request header are called header
fields and consist of name-value pairs, which
include such items as the hostname the request is being sent from.

The reply message also has a special first line called the
status line which reports the protocol, a numeric code
representing the specific response, and a text version of the
response ("OK").

The remaining lines of the header contain name-value pairs of various
other parameters. For instance, the name and version of the web
server may be specified.

After the reply header may come the body of the response. This is
always separated from the reply header by a blank line (actually a
carriage return and line feed). In this case, the body of the reply
is exactly the HTML code for the simple web page concerning RNA
interference shown earlier.

There are many name-value pairs I have not mentioned and many other
details that can have significance within this basically simple HTTP
protocol scheme. However, this overview gives you the basic idea and
the essential structure of the protocol that is exchanged between the
web browser and the server.

Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی