Data Processing FeaturesAfter we've arranged our data and/or calculations in a tabular form, we can use Excel's data processing features to manipulate it. Obviously, the actual manipulation that needs to be done will depend totally on the problem being solved, so this section describes some of the techniques available and how they can be linked together. It is up to the reader to decide which techniques are most appropriate for their situation! It Doesn't Have to Be DataEven though we've only considered data so far, Excel is not a database; it's designed to manipulate numbers. We are not forced to only include raw data in our structured ranges. Some of the most powerful number crunching comes from organizing our formulas in a structured, tabular form, then using Excel's data processing features on the results of those calculations. The only caveat with using formulas in our data tables is that the data processing (consolidation, pivot table, filter and so on) will not be updated or refreshed when the source formulas are recalculated. We can easily work around this by including some VBA code to trigger the processing from within the Worksheet_Calculate event (which conveniently occurs after the sheet has been calculated). Pivot CachesWhen Microsoft introduced pivot tables in Excel 95, they realized that it would be much more efficient to store the source data for the pivot tables in a hidden form within the workbook file than within worksheet cells. The hidden data stores are called pivot caches and are just like query tables, but without the visual representation and can only be used by pivot tables. They suffer from the same problem of including hard-coded paths to the database within their connection and SQL query information, and the same solution of using VBA to define the connection string and query text applies. Pivot TablesPivot tables are Excel's premier data processing feature. Using either a table in a worksheet as the data source or a pivot cache for direct connection to a database, pivot tables enable us to filter, group, sort, total, count and drill down into our data. Most books about using Excel include a section explaining how to set up and manipulate pivot tables, so we're assuming you already know the basics. With pivot tables, there is only one level of difficulty, so once the basics are understood, the skill is in knowing how to most efficiently use pivot tables to analyze our data and integrate them with the rest of our data processing.The best way to use pivot tables for data processing in most applications is to create all the pivot tables beforehand, on individual worksheets. If the pivot tables are set up to connect directly to a database, it is usually more efficient to have a single, large pivot cache that feeds multiple pivot tables, than having separate queries for each pivot table. The easiest way to do this is to create the first pivot table normally, then base subsequent pivot tables on the first (in step one of the Pivot Table Wizard). By using the same pivot cache, Excel will only need to store one copy of the data within the workbook and all the pivot tables will be updated when the cache is refreshed. Although it is possible to use the Excel object model to create and modify pivot tables, this should be kept to a minimum, because Excel refreshes and redraws the table with every modification. With anything other than a trivial amount of data, this quickly becomes extremely slow.In the NWindOrders.xls example workbook, found on the CD in the \Concepts\Ch14Data Manipulation Techniques folder, the OrderData worksheet contains a query table which retrieves information about each order from the NorthWind sample Access database, shown in Figure 14-4. Figure 14-4. The Query Table for NorthWind Order Details[View full size image] ![]() Figure 14-5. Two Diverse PivotTables, Derived from the Same Pivot Cache[View full size image] ![]() Calculated Pivot FieldsExcel enables us to add extra fields and data to our pivot caches, in the form of calculated fields and calculated items. A calculated field is an extra column, derived from one or more other fields, such as defining a calculated Profit field as Revenue Cost, where Revenue and Cost are fields in the data set. These are of very limited use, because Excel always does the Sum of the individual fields before performing the calculation, so we get the following:Sum of Profit = Sum of Revenue Sum of Cost This is okay and marginally useful for the simple cases, but is useless and dangerous if a more complex formula is required. Looking at the NorthWind data in Figure 14-4, we have fields for the Quantity, UnitPrice and Discount, so we might be tempted to add a calculated Revenue field as Quantity x (UnitPrice Discount). Unfortunately, as Excel sums the individual fields before doing the calculation, we end up multiplying the total quantity sold by the sum of all the prices minus the sum of all the discounts! Unless that is what you really require, it is far better and much safer to add the additional fields at the raw data level, either by including calculated fields in the SQL query, or by adding extra columns alongside the query table, as shown in Figure 14-6. Figure 14-6. Adding a Calculated Field Alongside a Query Table![]() Listing 14-3. Updating Defined Names and Refreshing Pivot Caches When a Query Table Is Refreshed
Data ConsolidationProbably the most little-known of Excel's data processing features is its ability to consolidate numeric data from multiple ranges into a single table, matching the data using the labels in both the first row and first column of each range. If a single cell is selected, Excel first creates a unique list of all the column headers and a unique list of all the row headers (that is, the labels in the first column) to create the result table. If a range is already selected, Excel uses the row and column headers that are already there. It then adds (or counts, averages, max, min and so on) all the items of data that share the same row and column header.This proves extremely useful when consolidating data and calculations that occur over a time series. For example, imagine a project to analyze whether to build and run a new theme park. You might have a workbook for the construction planning, another for the ongoing operations, another for concessions and retail planning and so forth. Each workbook contains a summary table showing the costs, revenue and cash flow for each year. A greatly simplified version is shown in Figure 14-7, but imagine the Construction Planning and Operations tables actually exist in different workbooks and each has been given a defined name. Figure 14-7. Simplified Project Planning![]() Figure 14-8. The Data Consolidation Dialog![]() Figure 14-9. The Consolidated Results![]() Advanced FilteringThe ability to extract specific records from a large data set is often the key to successful and efficient data processing. Pivot tables provide some rudimentary filtering capability, but only by hiding individual items of data. Excel's Advanced Filter feature enables us to filter the data using much more complex expressions, either just by hiding records in the original table, or more commonly by copying the resulting records to a new location for further processing. The Advanced Filter dialog is accessed by clicking the Data > Filter > Advanced Filter menu and is shown in Figure 14-10. Figure 14-10. The Advanced Filter Dialog![]() When the OK button is clicked, Excel scans through the source data range, checks each record against the criteria specified in the criteria range and copies the matching records to the next row in the Copy to range. The result is a subset of the original data, arranged as a simple structured data areathat is, not as a list or query table.Unfortunately, every time the Advanced Filter dialog is shown, the List range is either blanked out or guessed and the Action defaults to Filter in place. It would be much more helpful if Excel remembered the source range and action, which would be possible if only Excel created the filter as a query table. If that were the case, we would also have a one-click Refresh option and be able to tell Excel to automatically copy down adjacent formulas. The best we can do in current versions is to give our ranges some specific names. If the workbook contains the defined names Database, Criteria and/or Extract, Excel will populate the dialog using the ranges pointed to by those names.To save you some frustration if you're working through these examples, we've included the routine shown in Listing 14-14 in the example workbook to refresh the filter without showing the Advanced Filter dialog. Note that in VBA, we use the AdvancedFilter method on the source data range and specify the range to copy the filtered data to. Listing 14-4. Advanced Filtering with VBA
Criteria RangesThe criteria range is used to specify the equivalent of a SQL WHERE clause, telling Excel which records to return. Figure 14-11 shows an example of a criteria range, in A1:B3. Figure 14-11. An Advanced Filter Criteria Range![]() Figure 14-12. Beverages from the UK or USA![]() Figure 14-13. Beverages from Only the UK or USA, and Not Ukraine![]() Figure 14-14. Specifying a Range of Matches by Repeating the Field Name![]() Figure 14-15. Filtering Between to Dates in the USA![]() Figure 14-16. Filtering Using a Formula![]() |