| 6.6 Relational Database DesignDatabase design is the process of effectively organizing data into tables in a relational database. When thinking about how to organize a database you need to ask, "What fields should I put together into tables, and how should I interrelate the tables?" In this section I will show you a short example that demonstrates some common and useful techniques for doing just that.Relational database software projects are best broken down into separate stages. This list from Database Systems: a Practical Approach to Design Implementation, and Management (see Section 6.10), shows the typical stages of database design and construction:Database planningSystem definitionRequirements collection and analysisDatabase designDBMS selectionApplication designPrototypingImplementationData conversion and loadingTestingOperational maintenance For the small biology lab in which database programming may be a one-person project, some of these stages may be brief and informal, but they still apply.How should the tables be defined for a new database? The answer depends on the problems to be answered by the data, but it's also largely a matter of common sense and a feel for the data. The database beginner typically looks at the data, tries her hand at a few designs, and begins to get a sense of how tables can be used for a specific problem. Let yourself experiment and try a few alternatives, and you'll soon get the hang of it.Tables are commonly interrelated by indexing and by joining fields from different tables. The SQL language implemented with your DBMS provides these abilities. Also, a group of techniques called normalization can help you produce a good design and avoid some problems. Simply putting data into tables doesn't guarantee a good design.A set of rules called normal forms helps you arrange the data into tables in a way that avoids certain problems. One such problem is data redundancy, which is unnecessary duplication of data in different tables. A related problem is update anomalies caused by having the same data in more than one location. When such data is updated, copies may not be updated properly, and the database can become inaccurate.Here are some simple rules to follow when designing your database:Each entry of each table has a single value. This is first normal form.Each table has a a unique identifier (called a primary key) for each row. This is second normal form.Names aren't used as identifiers because they can lead to data redundancy. Third and other normal forms as well as other design considerations aren't covered in this book due to space limitations. See Section 6.10 at the end of the book for more information about relational database design. Consider Table 6-3, which shows this alternate, unnormalized version of my homologs database. 
 To put the table into first normal form, I make new rows for each multiple entry (see Table 6-4). 
 between fields for which the value of one field is always associated with no more than one value in a second field. In Table 6-4, the value "wrinkle" in the Name field is associated with Organism "mouse" in one row, and with Organism "human" in another row. This isn't a functional dependency. On the other hand, the value "wrinkle" in the Name field is associated only with the value "1987-08-15" in the Date field. This is a functional dependency. In a real database, you need to ascertain that the test for a functional dependency will hold even as the database is updated; it requires knowledge of the use of the database and the possible range of values of the fields. In Table 6-4, the functional dependencies are: Name -> Date The primary key is a minimal group of fields that uniquely identifies each row. In Table 6-4, I can choose the pair of fields Organism/Name or Organism/Date as a primary key. Remember, a field is fully functionally dependent on a primary key if removing any field from the primary key destroys the functional dependency.Practically speaking, you can ensure second normal form by making sure each table has a field that is a unique identifier: usually this field has integer values such as 1, 2, 3, etc; no row is missing an identifier integer, and no two rows have the same integer. More formally, a relation in first normal form for which every non-primary-key field is fully functionally dependent on the primary key, is in second normal form. The rule-of-thumb solution is to divvy up the fields into new tables, and possibly add some new unique identifier keys. But how exactly do you do that?Looking over the data, I notice that Gene and Date are always associated with the same values (say they represent a gene name and the date it was first reported). I can make a new table Genes (Table 6-5) out of them, with Gene as the primary key. I can then make a second table Organism (Table 6-6) from the Organism field, adding an ID field OrgID. Finally, I can make a third table Variants (Table 6-7) with its own field VarID unique for each row, an OrgID field, and a GeneID field. Table 6-7 contains a row for each gene in each organism in the database. Table 6-5 through Table 6-7 show my new design for the homologs database (in second normal form). 
 
 
 unique for each row, and that I can use as a primary key. Such keys are called unique identifiers. These are Gene, OrgID, and VarID, in Table 6-5, Table 6-6, and Table 6-7, respectively. Notice that in each table the other fields are functionally dependent on the unique identifier row. And, finally, it's clear that the definition of "fully functionally dependent," which involves removing fields from a primary key, is satisfied when the primary key has only one field.My tables are now in second normal form and are much improved. You'll notice, however, that there is still a problem of data redundancy, which can lead to update anomalies. If, for instance, the name of the "aging" gene was changed to "fountain of youth," three changes would have to be made to this database to keep all the data in sync. Using names as unique identifiers is a problem. Sometimes it works, but you must consider the operation of your database over time. In this case, genetic nomenclature may (and does) change. If I use a numeric ID as a unique identifier for a gene name, I can handle a gene name change by making a single update in only one table in my database (see Table 6-8, Table 6-9, and Table 6-10). 
 
 
 drop the previous definition of the homologs database in my MySQL RDMS and redefine it with the three tables as just shown. I just define the tables here, but I don't populate them, as that will be done in Section 6.7.2: [tisdall@coltrane tisdall]$ mysql -u tisdall -p I have only briefly introduced some of the techniques useful in the design of a small database. There is, of course, training and skill involved in database design, as well as certain standard (and occasionally competing) methodologies; far more than I have the space to introduce here. | 
 لطفا منتظر باشید ...
        لطفا منتظر باشید ...
     
                     
                
                