Mastering Perl for Bioinformatics [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Mastering Perl for Bioinformatics [Electronic resources] - نسخه متنی

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید










6.6 Relational Database Design


Database
design is the process of effectively organizing data into tables in a
relational database. When thinking about how to organize a database
you need to ask, "What fields should I put together
into tables, and how should I interrelate the
tables?" In this section I will show you a short
example that demonstrates some common and useful techniques for doing
just that.

Relational database software projects are best broken down into
separate stages. This list from Database Systems: a
Practical Approach to Design Implementation, and
Management (see Section 6.10), shows the typical
stages of database design and construction:

Database planningSystem definitionRequirements collection and analysisDatabase designDBMS selectionApplication designPrototypingImplementationData conversion and loadingTestingOperational maintenance

For the small biology lab in which database programming may be a
one-person project, some of these stages may be brief and informal,
but they still apply.

How should the tables be defined for a new database? The answer
depends on the problems to be answered by the data, but
it's also largely a matter of common sense and a
feel for the data. The database beginner typically looks at the data,
tries her hand at a few designs, and begins to get a sense of how
tables can be used for a specific problem. Let yourself experiment
and try a few alternatives, and you'll soon get the
hang of it.

Tables are commonly interrelated by indexing and by joining fields
from different tables. The SQL language implemented with your DBMS
provides these abilities. Also, a group of techniques called
normalization
can help you produce a good design and avoid some problems. Simply
putting data into tables doesn't guarantee a good
design.

A set of rules called normal
forms helps you arrange the data into tables in
a way that avoids certain problems. One such problem is
data redundancy, which is unnecessary duplication of
data in different tables. A related problem is update
anomalies caused by having the same data in
more than one location. When such data is updated, copies may not be
updated properly, and the database can become inaccurate.

Here are some simple rules to follow when designing your database:

Each entry of each table has a single value. This is first
normal form.

Each table has a a unique identifier (called a primary key) for each
row. This is second normal
form.

Names aren't used as identifiers because they can
lead to data redundancy.


Third and other normal forms as well as other design considerations
aren't covered in this book due to space
limitations. See Section 6.10 at the end of the book for more information about relational
database design. Consider Table 6-3, which shows
this alternate, unnormalized version of my
homologs database.

Table 6-3. Unnormalized homolog data


Name


Date


Organisms


aging


1984-07-13


human, worm


wrinkle


1987-08-15


human, mouse


hairy


1990-09-30


mouse

This table has multiple values in some of the locations in the table.
To put the table into first normal form, I make new rows for each
multiple entry (see Table 6-4).

Table 6-4. First normal form homolog data


Name


Date


Organisms


aging


1984-07-13


human


aging


1984-07-13


worm


wrinkle


1987-08-15


human


wrinkle


1987-08-15


mouse


hairy


1990-09-30


mouse

In relational databases, a functional dependency exists
between fields for which the value of one field is always associated
with no more than one value in a second field. In Table 6-4, the value
"wrinkle" in the Name field is
associated with Organism "mouse" in
one row, and with Organism "human"
in another row. This isn't a functional dependency.
On the other hand, the value
"wrinkle" in the Name field is
associated only with the value
"1987-08-15" in the Date field.
This is a functional dependency. In a real database, you need to
ascertain that the test for a functional dependency will hold even as
the database is updated; it requires knowledge of the use of the
database and the possible range of values of the fields. In Table 6-4, the functional dependencies are:

Name        ->        Date
Date -> Name

The primary
key is a minimal group of fields that uniquely
identifies each row. In Table 6-4, I can choose
the pair of fields Organism/Name or Organism/Date as a primary key.
Remember, a field is fully functionally dependent on a primary key if
removing any field from the primary key destroys the functional
dependency.

Practically speaking, you can ensure second normal form by making
sure each table has a field that is a unique identifier: usually this
field has integer values such as 1, 2, 3, etc; no row is missing an
identifier integer, and no two rows have the same integer. More
formally, a relation in first normal form for which every
non-primary-key field is fully functionally dependent on the primary
key, is in second normal form. The rule-of-thumb solution is to divvy
up the fields into new tables, and possibly add some new unique
identifier keys. But how exactly do you do that?

Looking over the data, I notice that Gene and Date are always
associated with the same values (say they represent a gene name and
the date it was first reported). I can make a new table Genes (Table 6-5) out of them, with Gene as the primary key. I
can then make a second table Organism (Table 6-6)
from the Organism field, adding an ID field OrgID. Finally, I can
make a third table Variants (Table 6-7) with its
own field VarID unique for each row, an OrgID field, and a GeneID
field. Table 6-7 contains a row for each gene in
each organism in the database. Table 6-5 through Table 6-7 show my new
design for the homologs database (in second normal form).

Table 6-5. homologs database design in second normal form: Genes


Gene


Date


aging


1984-07-13


wrinkle


1987-08-15


hairy


1990-09-30

Table 6-6. homologs database design in second normal form: Organism


OrgID


Organism


1


human


2


worm


3


mouse

Table 6-7. homologs database design in second normal form: Variants


VarID


OrgID


Gene


1


1


aging


2


2


aging


3


1


wrinkle


4


3


wrinkle


5


3


hairy

Each table contains a field that isn't null, that is
unique for each row, and that I can use as a primary key. Such keys
are called unique
identifiers. These are Gene, OrgID, and VarID,
in Table 6-5, Table 6-6,
and Table 6-7, respectively. Notice that in each
table the other fields are functionally dependent on the unique
identifier row. And, finally, it's clear that the
definition of "fully functionally
dependent," which involves removing fields from a
primary key, is satisfied when the primary key has only one field.

My tables are now in second normal form and are much improved.
You'll notice, however, that there is still a
problem of data redundancy, which can lead to update anomalies. If,
for instance, the name of the
"aging" gene was changed to
"fountain of youth," three changes
would have to be made to this database to keep all the data in sync.
Using names as unique identifiers is a problem. Sometimes it works,
but you must consider the operation of your database over time. In
this case, genetic nomenclature may (and does) change. If I use a
numeric ID as a unique identifier for a gene name, I can handle a
gene name change by making a single update in only one table in my
database (see Table 6-8, Table 6-9, and Table 6-10).

Table 6-8. homologs database design: Genes


GeneID


Gene


Date


118


aging


1984-07-13


9223


wrinkle


1987-08-15


273


hairy


1990-09-30

Table 6-9. homologs database design: Organism


OrgID


Organism


1


human


2


worm


3


mouse

Table 6-10. homologs database design: Variants


VarID


OrgID


GeneID


1


1


118


2


2


118


3


1


9223


4


3


9223


5


3


273

Here's an example from my Linux system in which I
drop the previous definition of the homologs
database in my MySQL RDMS and redefine it with the three tables as
just shown. I just define the tables here, but I
don't populate them, as that will be done in Section 6.7.2:

[tisdall@coltrane tisdall]$ mysql -u tisdall -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or ,,.
Your MySQL connection id is 9 to server version: 3.23.41
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> drop database homologs;
Query OK, 9 rows affected (0.13 sec)
mysql> show databases;
+----------+
| Database |
+----------+
| caudyfly |
| dicty |
| gadfly |
| jkl |
| master |
| mysql |
| poetry |
| rebase |
| test |
| yeast |
+----------+
10 rows in set (0.12 sec)
mysql> create database homologs;
Query OK, 1 row affected (0.00 sec)
mysql> use homologs;
Database changed
mysql> CREATE TABLE organism (
orgid int(11) default NULL,
organism char(20) default NULL
) TYPE=MyISAM;
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE TABLE genes (
geneid int(11) default NULL,
gene char(20) default NULL,
date date default NULL
) TYPE=MyISAM;
Query OK, 0 rows affected (0.01 sec)
mysql> CREATE TABLE variants (
varid int(11) default NULL,
orgid int(11) default NULL,
geneid int(11) default NULL
) TYPE=MyISAM;
Query OK, 0 rows affected (0.00 sec)
mysql> show tables;
+--------------------+
| Tables_in_homologs |
+--------------------+
| genes |
| organism |
| variants |
+--------------------+
3 rows in set (0.00 sec)
mysql>

I have only briefly introduced some of the techniques useful in the
design of a small database. There is, of course, training and skill
involved in database design, as well as certain standard (and
occasionally competing) methodologies; far more than I have the space
to introduce here.


/ 156