DATA MINING PROCESS
The goal of data mining is to obtain useful knowledge from an analysis of collections of data. Such a task is inherently interactive and iterative. As a result, a typical data-mining system will go through several phases. The phases depicted below start with the raw data and finish with the resulting extracted knowledge that was produced as a result of the following stages:Selection — Selecting or segmenting the data according to some criteria.
Preprocessing — The data cleansing stage where certain information is removed that is deemed unnecessary and may slow down queries.
Transformation — The data is transformed in that overlays may be added, such as the demographic overlays, and the data is made usable and navigable.
Data mining — This stage is concerned with the extraction of patterns from the data.
Interpretation and evaluation — The patterns identified by the system are interpreted into knowledge that can then be used to support human decision-making, e.g., prediction and classification tasks, summarizing the contents of a database, or explaining observed phenomena (Han & Kamber).
Data mining is a field that is heavily influenced by traditional statistical techniques, and most data-mining methods will reveal a strong foundation of statistical and data analysis methods. Some of the traditional data-mining techniques include classification, clustering, outlier analysis, sequential patterns, time series analysis, prediction, regression, link analysis (associations), and multidimensional methods including online analytical processing (OLAP). These can then be categorized into a series of data-mining techniques, which are classified and illustrated in Table 1 (Goebel & Le Grunwald, 1999).
TECHNIQUE | DESCRIPTION |
|---|---|
Predictive modeling | Predict value for a specific data item attribute |
Characterization and descriptive data mining | Data distribution, dispersion and exception |
Association, correlation, causality analysis (Link Analysis) | Identify relationships between attributes |
Classification | Determine to which class a data item belongs |
Clustering and outlier analysis | Partition a set into classes, whereby items with similar characteristics are grouped together |
Temporal and sequential patterns analysis | Trend and deviation, sequential patterns, periodicity |
OLAP (OnLine Analytical Processing) | OLAP tools enable users to analyze different dimensions of multidimensional data. For example, it provides time series and trend analysis views. |
Model Visualization | Making discovered knowledge easily understood using charts, plots, histograms, and other visual means |
Exploratory Data Analysis (EDA) | Explores a data set without a strong dependence on assumptions or models; goal is to identify patterns in an exploratory manner |
In addition, the entire broad field of data mining includes not only a discussion of statistical techniques, but also various related technologies and techniques, including data warehousing, and many software packages and languages that have been developed for the purpose of mining data. Some of these packages and languages include: DBMiner, IBM Intelligent Miner, SAS Enterprise Miner, SGI MineSet, Clementine, MS/SQLServer 2000, DBMiner, BlueMartini, MineIt, DigiMine, and MS OLEDB for Data Mining (Goebel & Le Grunwald, 1999). Data warehousing complements data mining in that data stored in a data warehouse is organized in such a form as to make it suitable for analysis using data-mining methods. A data warehouse is a central repository for the data that an enterprise’s various business systems collect. Typically, a data warehouse is housed on an enterprise server. Data from various online transaction processing (OLTP) applications and other sources are extracted and organized on the data warehouse database for use by analytical applications, user queries, and data-mining operations. Data warehousing focuses on the capture of data from diverse sources for useful analysis and access. A data mart emphasizes the point of view of the end-user or knowledge worker who needs access to specialized, but often local, databases (Delmater & Hancock, 2001; Han & Kamber, 2001).