An Introduction to DatabasesThe biggest problem for Excel-centric developers making the jump to applications based on a back-end database is understanding the fundamental differences between how data is treated in Excel and how data is treated in a database. A database requires a significant amount of rigor from the data when compared to an Excel worksheet. You can enter almost anything you want on a worksheet, but database tables are much more picky. Even some of the things you can enter in database tables you shouldn't.Database tables have a concept of being formally related to each other, a concept that doesn't exist at all in Excel. Modifications made to databases take effect immediately. There's no need to "save" the data. Unfortunately, there's also no way to undo a change made to data in a database once that change has been committed.As you can see, working with databases is significantly different from working with Excel worksheet tables. This section covers the most important things you need to know about how databases work and why you would want to use one.NOTEThose of you who are very familiar with databases and the terminology surrounding them will notice that we use some nonstandard terms to describe database concepts in this section. This is intentional and is designed to explain these concepts in terms that Excel programmers with limited database experience will find easier to understand. Why Use a DatabaseFor many purposes Excel is a perfectly adequate data container. However, certain common circumstances will force you to confront a database. These circumstances include the following:An existing database The data your Excel application requires may already be stored in a database. In that case, you'll need to use the existing database, if for no other purpose than to extract the data your application requires.Capacity constraints An Excel worksheet can hold only a limited amount of data, 65,536 rows to be exact. It is not uncommon for a large application to be required to deal with millions of rows of data. You cannot hope to store this much data in Excel, so you are forced to use a database, which can easily manage this volume of data.Operational requirements Excel is not a multiuser application. It does have some very unreliable sharing capabilities, but as a general rule, if one person is using an Excel workbook that contains data, anyone else who wants to use it must wait for that person to finish. Databases, by contrast, are inherently multiuser applications. They are designed from the ground up to allow many people to access the same data at the same time. If your application requires multiple users to access the same data at the same time, then you probably need a database. Relational DatabasesA relational database is a set of tables containing data organized into specific categories. Each table contains data categories in columns. Each row contains a unique instance of data for the categories defined by the columns. A relational database is structured such that data can be accessed in many different ways without having to reorganize the tables. A relational database also has the important advantage of being easy to extend. After the original database has been created, new tables can be added without requiring all existing applications that use the database to be modified. The standard method used to access relational data is Structured Query Language (SQL), which we describe in the Data Access Techniques section later in the chapter. File-Based Databases vs. Client-Server DatabasesThere are two broad categories of relational databases: file-based databases and client-server databases. The fundamental difference between them has to do with where the data access logic is executed.In a file-based database, the database consists of one or more files that simply contain the data. When an application accesses a file-based database, all of the data access logic is executed on the client computer where the application resides. The advantages of file-based databases are that they are inexpensive, relatively simple and require little ongoing maintenance. The disadvantages of file-based databases are that they can create significant network traffic, they are limited in the amount of data they can store and limited in the number of simultaneous users who can access them. Microsoft Access and Microsoft Visual FoxPro are two examples of file-based databases.In a client-server database, databases are contained within a larger server application. This database server is responsible for executing the data access requests from client applications. The advantages and disadvantages of client server databases are more or less the mirror image of those for file-based databases. Client server databases reduce network traffic by handling the data access logic on the server. They can store very large amounts of data and handle very large numbers of simultaneous users. However, client-server databases are expensive, complex and require routine maintenance to keep them operating efficiently. Microsoft SQL Server and Oracle are two examples of client-server databases. NormalizationNormalization is the process of optimizing the data in your database tables. Its goal is to eliminate redundant data and ensure that only related data is stored in each table. Normalization can be taken to extreme lengths. But for most developers most of the time, understanding the first three rules of normalization and ensuring that your database is in third normal form is all you'll ever need to do. (We cover the first three normal forms in detail in the sections that follow.)NOTEPrior to normalizing your data, you must be sure all of the rows in every table are unique. A database table should not contain duplicate rows of data and normalization will not correct this problem.Before we can talk about normalization, we need to cover the concept of a primary key. A primary key consists of one or more columns in a data table whose value(s) uniquely identify each row in the table. Let's take a look at a simple example. Figure 13-1 shows a database table containing a list of author information. Figure 13-1. The Authors Table![]() Figure 13-2. The Initial BillableHours Table[View full size image] ![]() First Normal FormThere are two requirements a data table must meet to satisfy the first normal form:All column values are atomic. This means there are no values that can be split into smaller meaningful parts.We have one column that obviously violates this requirement. The Consultant column consists of both the first name and the last name of each consultant. This data must be separated into two distinct columns, FirstName and LastName, to satisfy the first normal form.Repeating groups of data should be eliminated by moving them into new tables.The Consultant column, even after separating it into first name and last name, violates this requirement. The solution is to create a separate Consultants table to hold this data. Each consultant will be assigned a unique consultant ID number that will be used in the BillableHours table to identify the consultant.The result of transforming our BillableHours table into first normal form is two tablesthe modified BillableHours table shown in Figure 13-3 and the new Consultants table shown in Figure 13-4. The first column in the BillableHours table, which previously held each consultant's name, has been replaced with a unique ConsultantID number created in the new Consultants table. Figure 13-3. The BillableHours Table in First Normal Form[View full size image] ![]() Figure 13-4. The New Consultants Table![]() Second Normal FormThere are two requirements a data table must meet to satisfy the second normal form:The table must be in first normal form.Each successive normal form builds upon the previous normal form. Because our BillableHours table is already in first normal form, this requirement has been satisfied.Each column in the table must depend on the whole primary key. This means if any column in the primary key were removed, you could no longer uniquely identify the rows in any nonkey column in the table.The primary key in our BillableHours table consists of a combination of the ConsultantID, Date, Project and Activity columns. Do we have any columns that are not dependent on all four of these key columns? Yes. The Client column depends only on the Project column, because a project name uniquely identifies the client for whom the project is completed.To solve this problem we will remove the client column from the BillableHours table and create a new Clients table. We will also create a new Projects table that provides each project with a unique ID number, and use this project ID rather than the project name in the BillableHours table. This will serve two purposes. The new Projects table will provide a link from the BillableHours table to the Clients table (which we discuss in more detail later in the chapter). It will also enable us to handle the situation in which two clients have the same project name.The result of transforming our BillableHours table into second normal form is three tables. The modified BillableHours table, shown in Figure 13-5, the new Clients table, shown in Figure 13-6, and the new Projects table, shown in Figure 13-7 (this is in addition to the Consultants table we created in the previous section). Figure 13-5. The BillableHours Table in Second Normal Form[View full size image] ![]() Figure 13-6. The New Clients Table![]() Figure 13-7. The New Projects Table![]() Third Normal FormThere are three requirements a data table must meet to satisfy third normal form:The table must be in second normal form.This requirement has been met by the modifications we made in the previous section.Nonkey columns cannot describe other nonkey columns.This requirement can be memorably expressed as "Nonkey columns must represent the key, the whole key and nothing but the key." In our BillableHours table, the Rate column depends only on the Activity key column. All the other key columns could be removed from the table and the values in the Rate column could still be uniquely associated with the remaining Activity column.We can solve this problem by creating a new table to hold the list of Activities and their associated rates. The BillableHours table will retain only an activity ID number in place of the previous Activity and Rate columns.The table cannot contain derived data.Derived data refers to a column in a data table whose values have been created by applying a formula or transformation to the values in one or more other columns in the table. In our BillableHours table, the Charge column is a derived column that is the result of multiplying the Rate column by the Hours column. Columns containing derived data should simply be removed from the table and calculated "on the fly" whenever their values are required.The result of transforming our BillableHours table into third normal form is two tables: the modified BillableHours table, shown in Figure 13-8 and the new Activities table shown in Figure 13-9. Figure 13-8. The BillableHours Table in Third Normal Form![]() Figure 13-9. The New Activities Table![]() When Not to NormalizeIn the vast majority of cases, you will always want to follow the normalization rules outlined above when preparing your data for storage in a relational database. As with almost every other rule, however, there are exceptions.The most common exception has to do with derived columns. When we transformed our BillableHours table into third normal form we eliminated the derived column that showed the total charge for each line item. This is generally a good practice because if you have derived columns you must also create logic to ensure those columns are updated correctly if the values of any of the columns they depend on change. This creates overhead that is best deferred until you actually need to query the derived value.In some cases, however, it makes sense to store derived data. This is usually the case when the data is derived from columns that are very unlikely to change. If the columns that the derived data depends on are unlikely to change, the derived data is also unlikely to change. In this case you can improve the performance of queries that access the derived data by calculating it in advance and storing the result so the query can simply retrieve its value rather than having to calculate it on the fly. Relationships and Referential IntegrityThe ability to take advantage of relationships and referential integrity are two of the primary advantages that relational databases provide over Excel for data storage. The ability to create formal relationships between tables enables you to avoid massive repetition of data and its associated frequency of data-entry errors that lead to bad or "dirty" data. Referential integrity enables you to ensure that data entered in one table is consistent with data entered in other related tables. Neither one of these capabilities is available to data stored in Excel. Foreign KeysBefore we can discuss relationships and referential integrity, we need to have a firm understanding of foreign keys. A foreign key is a column in one table (the referencing table) containing data that uniquely identifies records from another table (the referenced table). The foreign key serves to connect the two tables and ensures that only valid data from the referenced table can be entered in the referencing table. The ActivityID column in the BillableHours table shown in Figure 13-8 is a foreign key that refers to the ActivityID column in the Activities table in Figure 13-9.For a column to serve as a foreign key, it must either be the primary key of the referenced table or a column on which a unique index has been defined. We cover unique indexes later in the chapter. A foreign key can also consist of multiple columns, as long as the constraints above are followed. Foreign keys provide the basis of creating relationships between tables.In this section we will be using the Microsoft Access Relationships window to visually display the effect of relationships and referential integrity. Figure 13-10 shows the relationship described above between the BillableHours table and the Activities table. Each table's primary key columns are shown in bold. Figure 13-10. The Relationship Between the BillableHours and Activities Tables![]() Types of RelationshipsThe vast majority of relationships between database tables fall into one of three categories: one to one, one to many, and many to many. One to OneIn this type of relationship, each row in one table is associated with a single row in another table. One-to-one relationships are not frequently encountered. The most common reason for creating a one-to-one relationship is to divide a single table into its most frequently accessed sections and least frequently accessed sections. This can improve performance by making the most frequently accessed table smaller. All things being equal, the rows in a small table can be accessed more quickly than the rows in a large table. A smaller table is also more likely to remain in memory or require less bandwidth to transfer to the client depending on the type of database being used. Consider the hypothetical Parts table shown in Figure 13-11. Figure 13-11. The Parts Table![]() Figure 13-12. The Parts Table Divided into Two Tables with a One-to-One Relationship![]() One to ManyThis is by far the most common type of relationship. In a one-to-many relationship, a single row in one table can be related to zero, one or many rows in another table. As Figure 13-13 shows, every relationship created in the process of normalizing our BillableHours table is a one-to-many relationship. Figure 13-13. One-to-Many Relationships![]() Many to ManyIn this type of relationship, each row in one table can be related to multiple rows in the other table. This concept can be a bit difficult to visualize. We demonstrate it by extending the example we used in the Normalization section to include the role each consultant plays on each project. Our list of roles might look like the Roles table shown in Figure 13-14. Figure 13-14. The Roles Table![]() Figure 13-15. Many-to-Many Relationships Resolved into One-to-Many Relationships[View full size image] ![]() Referential IntegrityAfter your database has been fully normalized and you have established relationships among your tables, referential integrity ensures that the data in those tables remains valid when data is added, modified or deleted. In Figure 13-13, for example, referential integrity ensures that invalid rows cannot be added to the BillableHours table by enforcing the fact that the ConsultantID, ProjectID and ActivityID foreign key columns all refer to valid rows in their respective source tables. (The validity of the Date and Hours columns is ensured by domain constraints that can be added to those columns. The topic of domain constraints is beyond the scope of this chapter.)NOTEAll modern relational databases support the concept of referential integrity, but the steps required to implement it are completely different from one database to the next. Consult the documentation for your database to determine how to implement referential integrity. Natural vs. Artificial Primary KeysThere is an ongoing debate within the database community over whether the primary key for a table should be natural, which is to say it should consist of one or more columns that naturally occur in the table, or whether it should be artificial. An artificial key is a unique but otherwise meaningless number that is automatically added to each row in a table. Most modern relational databases provide a special column type that will automatically generate a unique number for every row added to a table.To oversimplify the debate, natural keys tend to be advocated by people who primarily design databases, whereas artificial keys tend to be advocated by people who must live and work with databases. The argument really comes down to two points: providing easily usable foreign keys and ensuring the true uniqueness of records in a table.The formal task of a primary key can be divided into two very similar but subtly different tasks: uniquely identifying each row in a table and ensuring every row in a table is unique. The first task provides a foreign key that can be used by other tables to uniquely identify a specific row in a table while the second task enforces data integrity by ensuring that you cannot create duplicate records in a table.When you have a table with a primary key that is performing both tasks simultaneously, problems begin to arise when you need to reference that table from another table through the use of a foreign key. Take, for example, our BillableHours table. This table requires the use of four columns to construct a natural primary key. If you need to reference the BillableHours table from another table in your database, you will need to add a foreign key to that table consisting of all four of these columns. This is because all four columns are required to uniquely identify a record in the BillableHours table. As you can imagine, this can become very unwieldy. Assume we have an Invoices table that needs to reference the BillableHours table. An example of how this would be accomplished using a natural key is shown in Figure 13-16. Figure 13-16. Using a Natural Key for the Billable Hours Table![]() Figure 13-17. Using an Artificial Key for the BillableHours Table![]() ![]() |