Professional Excel Development [Electronic resources] : The Definitive Guide to Developing Applications Using Microsoft® Excel and VBA® نسخه متنی

An Introduction to Databases

The biggest problem for Excel-centric developers making the jump to applications based on a back-end database is understanding the fundamental differences between how data is treated in Excel and how data is treated in a database. A database requires a significant amount of rigor from the data when compared to an Excel worksheet. You can enter almost anything you want on a worksheet, but database tables are much more picky. Even some of the things you can enter in database tables you shouldn't.

Database tables have a concept of being formally related to each other, a concept that doesn't exist at all in Excel. Modifications made to databases take effect immediately. There's no need to "save" the data. Unfortunately, there's also no way to undo a change made to data in a database once that change has been committed.

As you can see, working with databases is significantly different from working with Excel worksheet tables. This section covers the most important things you need to know about how databases work and why you would want to use one.

NOTE

Those of you who are very familiar with databases and the terminology surrounding them will notice that we use some nonstandard terms to describe database concepts in this section. This is intentional and is designed to explain these concepts in terms that Excel programmers with limited database experience will find easier to understand.

Why Use a Database

For many purposes Excel is a perfectly adequate data container. However, certain common circumstances will force you to confront a database. These circumstances include the following:

An existing database
The data your Excel application requires may already be stored in a database. In that case, you'll need to use the existing database, if for no other purpose than to extract the data your application requires.

Capacity constraints
An Excel worksheet can hold only a limited amount of data, 65,536 rows to be exact. It is not uncommon for a large application to be required to deal with millions of rows of data. You cannot hope to store this much data in Excel, so you are forced to use a database, which can easily manage this volume of data.

Operational requirements
Excel is not a multiuser application. It does have some very unreliable sharing capabilities, but as a general rule, if one person is using an Excel workbook that contains data, anyone else who wants to use it must wait for that person to finish. Databases, by contrast, are inherently multiuser applications. They are designed from the ground up to allow many people to access the same data at the same time. If your application requires multiple users to access the same data at the same time, then you probably need a database.

Relational Databases

A relational database is a set of tables containing data organized into specific categories. Each table contains data categories in columns. Each row contains a unique instance of data for the categories defined by the columns. A relational database is structured such that data can be accessed in many different ways without having to reorganize the tables. A relational database also has the important advantage of being easy to extend. After the original database has been created, new tables can be added without requiring all existing applications that use the database to be modified. The standard method used to access relational data is Structured Query Language (SQL), which we describe in the Data Access Techniques section later in the chapter.

File-Based Databases vs. Client-Server Databases

There are two broad categories of relational databases: file-based databases and client-server databases. The fundamental difference between them has to do with where the data access logic is executed.

In a file-based database, the database consists of one or more files that simply contain the data. When an application accesses a file-based database, all of the data access logic is executed on the client computer where the application resides. The advantages of file-based databases are that they are inexpensive, relatively simple and require little ongoing maintenance. The disadvantages of file-based databases are that they can create significant network traffic, they are limited in the amount of data they can store and limited in the number of simultaneous users who can access them. Microsoft Access and Microsoft Visual FoxPro are two examples of file-based databases.

In a client-server database, databases are contained within a larger server application. This database server is responsible for executing the data access requests from client applications. The advantages and disadvantages of client server databases are more or less the mirror image of those for file-based databases. Client server databases reduce network traffic by handling the data access logic on the server. They can store very large amounts of data and handle very large numbers of simultaneous users. However, client-server databases are expensive, complex and require routine maintenance to keep them operating efficiently. Microsoft SQL Server and Oracle are two examples of client-server databases.

Normalization

Normalization is the process of optimizing the data in your database tables. Its goal is to eliminate redundant data and ensure that only related data is stored in each table. Normalization can be taken to extreme lengths. But for most developers most of the time, understanding the first three rules of normalization and ensuring that your database is in third normal form is all you'll ever need to do. (We cover the first three normal forms in detail in the sections that follow.)

NOTE

Prior to normalizing your data, you must be sure all of the rows in every table are unique. A database table should not contain duplicate rows of data and normalization will not correct this problem.

Before we can talk about normalization, we need to cover the concept of a primary key. A primary key consists of one or more columns in a data table whose value(s) uniquely identify each row in the table. Let's take a look at a simple example. Figure 13-1 shows a database table containing a list of author information.

Figure 13-1. The Authors Table

Notice that there are two instances of the name Robert in the FirstName column. These names refer to two different people, but if you were trying to use only this column to identify rows there would be no way to distinguish these from duplicate entries. To uniquely identify rows in this table, we must designate the FirstName and LastName columns as the primary key. If you combine the values of these two columns there is no longer any duplication, so all rows are uniquely identified. In the text that follows, primary key columns are often referred to simply as key columns, whereas any columns that do not belong to the primary key are referred to as nonkey columns.

For our discussion of normalization, we use the BillableHours table shown in Figure 13-2. This table contains data that might have been extracted from one of our PETRAS time-entry workbooks. As it stands, this data is very denormalized and not suitable for use in a relational database. The primary key for this table consists of the combination of the Consultant, Date, Project and Activity columns.

Figure 13-2. The Initial BillableHours Table

[View full size image]

First Normal Form

There are two requirements a data table must meet to satisfy the first normal form:

All column values are atomic. This means there are no values that can be split into smaller meaningful parts.

We have one column that obviously violates this requirement. The Consultant column consists of both the first name and the last name of each consultant. This data must be separated into two distinct columns, FirstName and LastName, to satisfy the first normal form.

Repeating groups of data should be eliminated by moving them into new tables.

The Consultant column, even after separating it into first name and last name, violates this requirement. The solution is to create a separate Consultants table to hold this data. Each consultant will be assigned a unique consultant ID number that will be used in the BillableHours table to identify the consultant.

The result of transforming our BillableHours table into first normal form is two tablesthe modified BillableHours table shown in Figure 13-3 and the new Consultants table shown in Figure 13-4. The first column in the BillableHours table, which previously held each consultant's name, has been replaced with a unique ConsultantID number created in the new Consultants table.

Figure 13-3. The BillableHours Table in First Normal Form

[View full size image]

Figure 13-4. The New Consultants Table

This not only allows us to satisfy first normal form, but also allows us to handle the situation in which two consultants have the same first and last names. In the original table there would have been no way to distinguish between two consultants with the same name.

Before we go any further we must explain the concept of a foreign key. A foreign key is a column in one table that uniquely identifies records in some other table. In the case above, the ConsultantID column in the BillableHours table is a foreign key column, each of whose values identify a single, unique consultant in the new Consultants table. The use of foreign keys is ubiquitous in relational databases. As we will see in the upcoming section on Relationships and Referential Integrity, foreign keys are used to create connections between related tables in a database.

Second Normal Form

There are two requirements a data table must meet to satisfy the second normal form:

The table must be in first normal form.

Each successive normal form builds upon the previous normal form. Because our BillableHours table is already in first normal form, this requirement has been satisfied.

Each column in the table must depend on the whole primary key. This means if any column in the primary key were removed, you could no longer uniquely identify the rows in any nonkey column in the table.

The primary key in our BillableHours table consists of a combination of the ConsultantID, Date, Project and Activity columns. Do we have any columns that are not dependent on all four of these key columns? Yes. The Client column depends only on the Project column, because a project name uniquely identifies the client for whom the project is completed.

To solve this problem we will remove the client column from the BillableHours table and create a new Clients table. We will also create a new Projects table that provides each project with a unique ID number, and use this project ID rather than the project name in the BillableHours table. This will serve two purposes. The new Projects table will provide a link from the BillableHours table to the Clients table (which we discuss in more detail later in the chapter). It will also enable us to handle the situation in which two clients have the same project name.

The result of transforming our BillableHours table into second normal form is three tables. The modified BillableHours table, shown in Figure 13-5, the new Clients table, shown in Figure 13-6, and the new Projects table, shown in Figure 13-7 (this is in addition to the Consultants table we created in the previous section).

Figure 13-5. The BillableHours Table in Second Normal Form

[View full size image]

Figure 13-6. The New Clients Table

Figure 13-7. The New Projects Table

NOTE

The sharp-eyed among you may have noticed that the Rate column in the BillableHours table also violates second normal form. This is absolutely correct, so give yourself a gold star if you caught this. However, this column is part of such a good example for demonstrating third normal form that we've decided to postpone it for that topic.

Third Normal Form

There are three requirements a data table must meet to satisfy third normal form:

The table must be in second normal form.

This requirement has been met by the modifications we made in the previous section.

Nonkey columns cannot describe other nonkey columns.

This requirement can be memorably expressed as "Nonkey columns must represent the key, the whole key and nothing but the key." In our BillableHours table, the Rate column depends only on the Activity key column. All the other key columns could be removed from the table and the values in the Rate column could still be uniquely associated with the remaining Activity column.

We can solve this problem by creating a new table to hold the list of Activities and their associated rates. The BillableHours table will retain only an activity ID number in place of the previous Activity and Rate columns.

The table cannot contain derived data.

Derived data refers to a column in a data table whose values have been created by applying a formula or transformation to the values in one or more other columns in the table. In our BillableHours table, the Charge column is a derived column that is the result of multiplying the Rate column by the Hours column. Columns containing derived data should simply be removed from the table and calculated "on the fly" whenever their values are required.

The result of transforming our BillableHours table into third normal form is two tables: the modified BillableHours table, shown in Figure 13-8 and the new Activities table shown in Figure 13-9.

Figure 13-8. The BillableHours Table in Third Normal Form

Figure 13-9. The New Activities Table

Note that our third normal form BillableHours table consists of a set of primary key columns, the ConsultantID, Date, ProjectID and ActivityID columns, and a nonkey Hours column that depends on the entire primary key and nothing but the primary key. If any primary key column were removed from the table, it would no longer be possible to uniquely identify any of the entries in the Hours column. Because there are no other nonkey columns in the table, the Hours column cannot possibly depend on any nonkey columns. Our data is now ready to be used in a relational database.

When Not to Normalize

In the vast majority of cases, you will always want to follow the normalization rules outlined above when preparing your data for storage in a relational database. As with almost every other rule, however, there are exceptions.

The most common exception has to do with derived columns. When we transformed our BillableHours table into third normal form we eliminated the derived column that showed the total charge for each line item. This is generally a good practice because if you have derived columns you must also create logic to ensure those columns are updated correctly if the values of any of the columns they depend on change. This creates overhead that is best deferred until you actually need to query the derived value.

In some cases, however, it makes sense to store derived data. This is usually the case when the data is derived from columns that are very unlikely to change. If the columns that the derived data depends on are unlikely to change, the derived data is also unlikely to change. In this case you can improve the performance of queries that access the derived data by calculating it in advance and storing the result so the query can simply retrieve its value rather than having to calculate it on the fly.

Relationships and Referential Integrity

The ability to take advantage of relationships and referential integrity are two of the primary advantages that relational databases provide over Excel for data storage. The ability to create formal relationships between tables enables you to avoid massive repetition of data and its associated frequency of data-entry errors that lead to bad or "dirty" data. Referential integrity enables you to ensure that data entered in one table is consistent with data entered in other related tables. Neither one of these capabilities is available to data stored in Excel.

Foreign Keys

Before we can discuss relationships and referential integrity, we need to have a firm understanding of foreign keys. A foreign key is a column in one table (the referencing table) containing data that uniquely identifies records from another table (the referenced table). The foreign key serves to connect the two tables and ensures that only valid data from the referenced table can be entered in the referencing table. The ActivityID column in the BillableHours table shown in Figure 13-8 is a foreign key that refers to the ActivityID column in the Activities table in Figure 13-9.

For a column to serve as a foreign key, it must either be the primary key of the referenced table or a column on which a unique index has been defined. We cover unique indexes later in the chapter. A foreign key can also consist of multiple columns, as long as the constraints above are followed. Foreign keys provide the basis of creating relationships between tables.

In this section we will be using the Microsoft Access Relationships window to visually display the effect of relationships and referential integrity. Figure 13-10 shows the relationship described above between the BillableHours table and the Activities table. Each table's primary key columns are shown in bold.

Figure 13-10. The Relationship Between the BillableHours and Activities Tables

Types of Relationships

The vast majority of relationships between database tables fall into one of three categories: one to one, one to many, and many to many.

One to One

In this type of relationship, each row in one table is associated with a single row in another table. One-to-one relationships are not frequently encountered. The most common reason for creating a one-to-one relationship is to divide a single table into its most frequently accessed sections and least frequently accessed sections. This can improve performance by making the most frequently accessed table smaller. All things being equal, the rows in a small table can be accessed more quickly than the rows in a large table. A smaller table is also more likely to remain in memory or require less bandwidth to transfer to the client depending on the type of database being used. Consider the hypothetical Parts table shown in Figure 13-11.

Figure 13-11. The Parts Table

Assume this table has a very large number of rows, and the only columns you want to access in most cases are the PartNumber, PartName, UnitPrice and Weight columns. You can improve the performance of queries against your Parts table by dividing it into two tables, with the most frequently accessed columns in one table and the rest in a second table. Each row in these two new tables will be related to exactly one row in the other table and the two tables will share a primary key. The result of this partitioning is shown in Figure 13-12.

Figure 13-12. The Parts Table Divided into Two Tables with a One-to-One Relationship

The Access Relationship window indicates a one-to-one relationship by placing the number 1 on each side of the relationship line connecting the two tables.

One to Many

This is by far the most common type of relationship. In a one-to-many relationship, a single row in one table can be related to zero, one or many rows in another table. As Figure 13-13 shows, every relationship created in the process of normalizing our BillableHours table is a one-to-many relationship.

Figure 13-13. One-to-Many Relationships

The Access Relationship window indicates a one-to-many relationship by placing the number 1 on the one side of the relationship line and the infinity symbol on the many side of the relationship line. Each row in the Consultants, Projects and Activities table can be related to zero, one or many rows in the BillableHours table. The one or many part of the relationship should be obvious; the zero part may require some additional explanation.

If a new consultant has joined the company but not yet logged any billable hours, that consultant would be represented by a row in the Consultants table. However, the row representing the new consultant would not be associated with any rows in the BillableHours table. The Consultants table would have a one-to-many relationship with the BillableHours table, and the row representing the new consultant in the Consultants table would be associated with zero rows in the BillableHours table.

NOTE

Some of you may be thinking that Excel's data-validation list feature provides the same benefit as a one-to-many database relationship. Unfortunately, this is not true. Although it's a very useful feature, a data-validation list only enforces a one-time relationship check, at the time an entry is selected from the list. Unlike a database relationship, after an entry has been selected from a data-validation list it does not maintain any connection to the list from which it was selected.

Many to Many

In this type of relationship, each row in one table can be related to multiple rows in the other table. This concept can be a bit difficult to visualize. We demonstrate it by extending the example we used in the Normalization section to include the role each consultant plays on each project. Our list of roles might look like the Roles table shown in Figure 13-14.

Figure 13-14. The Roles Table

Each project will require multiple roles in order to complete it, and consultants will serve in different roles on different projects. A consultant might be a programmer on one project and a tester on another. On a small project, in fact, one consultant could very well serve multiple roles. This means that role information cannot be attached to either the Projects table or the Consultants table because there are many-to-many relationships between projects and roles as well as consultants and roles. Each project requires multiple roles and each role applies to multiple projects. Likewise, each consultant can serve multiple roles and each role can be served by multiple consultants.

Most relational databases cannot directly represent a many-to-many relationship. When this type of relationship is encountered, however, it can always be broken up into multiple one-to-many relationships linked by an intermediate table. In our Projects, Consultants and Roles example, the many-to-many relationships would be represented in the database as shown in Figure 13-15.

Figure 13-15. Many-to-Many Relationships Resolved into One-to-Many Relationships

[View full size image]

The ProjectConsultantRole table serves as the intermediate table that enables us to convert the many-to-many relationships between the Projects and Roles tables and the Consultants and Roles tables into one-to-many relationships with the ProjectConsultantRole table in the middle. The Projects, Consultants and Roles tables now have one-to-many relationships with the ProjectConsultantRole table, allowing the many-to-many relationships between the Projects and Roles tables and the Consultants and Roles tables to be implemented.

Referential Integrity

After your database has been fully normalized and you have established relationships among your tables, referential integrity ensures that the data in those tables remains valid when data is added, modified or deleted. In Figure 13-13, for example, referential integrity ensures that invalid rows cannot be added to the BillableHours table by enforcing the fact that the ConsultantID, ProjectID and ActivityID foreign key columns all refer to valid rows in their respective source tables. (The validity of the Date and Hours columns is ensured by domain constraints that can be added to those columns. The topic of domain constraints is beyond the scope of this chapter.)

NOTE

All modern relational databases support the concept of referential integrity, but the steps required to implement it are completely different from one database to the next. Consult the documentation for your database to determine how to implement referential integrity.

Natural vs. Artificial Primary Keys

There is an ongoing debate within the database community over whether the primary key for a table should be natural, which is to say it should consist of one or more columns that naturally occur in the table, or whether it should be artificial. An artificial key is a unique but otherwise meaningless number that is automatically added to each row in a table. Most modern relational databases provide a special column type that will automatically generate a unique number for every row added to a table.

To oversimplify the debate, natural keys tend to be advocated by people who primarily design databases, whereas artificial keys tend to be advocated by people who must live and work with databases. The argument really comes down to two points: providing easily usable foreign keys and ensuring the true uniqueness of records in a table.

The formal task of a primary key can be divided into two very similar but subtly different tasks: uniquely identifying each row in a table and ensuring every row in a table is unique. The first task provides a foreign key that can be used by other tables to uniquely identify a specific row in a table while the second task enforces data integrity by ensuring that you cannot create duplicate records in a table.

When you have a table with a primary key that is performing both tasks simultaneously, problems begin to arise when you need to reference that table from another table through the use of a foreign key. Take, for example, our BillableHours table. This table requires the use of four columns to construct a natural primary key. If you need to reference the BillableHours table from another table in your database, you will need to add a foreign key to that table consisting of all four of these columns. This is because all four columns are required to uniquely identify a record in the BillableHours table. As you can imagine, this can become very unwieldy. Assume we have an Invoices table that needs to reference the BillableHours table. An example of how this would be accomplished using a natural key is shown in Figure 13-16.

Figure 13-16. Using a Natural Key for the Billable Hours Table

The alternative is to separate the two tasks of the primary key between two different constructs. The task of uniquely identifying a row in the table can be accomplished with a single column that provides a unique number for each row. This column becomes the primary key for the table. When you need to reference this table from another table, the only column you will need to import as the foreign key is the single numeric primary key column. In Figure 13-17 we have added an artificial primary key to the BillableHours table. This is just a number that uniquely identifies each row in the table. See how much simpler this makes the task of referencing a record in the BillableHours table from the Invoices table.

Figure 13-17. Using an Artificial Key for the BillableHours Table

If all we did was create an artificial primary key consisting of an automatically generated unique value that had no real meaning, you could easily enter two completely identical rows in the BillableHours table and the database would make them artificially unique by creating a unique value in the artificial primary key column. The task of ensuring data uniqueness can be accomplished with a unique index.

A unique index is a special construct you can create that includes one or more columns in a table. The values in the column(s) on which the unique index is defined must be unique. The index will not allow duplicate values to be entered. In the case of our BillableHours table, a unique index would include the ConsultantID, Date, ProjectID and ActivityID columns. Note that these columns are the columns that originally formed the natural primary key.