relationships and distinctions between the information systems concepts of data warehousing and data mining, which combined with online analytical processing (OLAP) form the backbone of decision support capability in the database industry. Decision support applications impose different demands for OLAP database technology than the online transaction processing (OLTP) model that preceded it. Data mining with OLAP differs from OLTP queries in the use of multidimensional data models, different data query and analysis tools at both the user-facing front end and the database back end, and different mechanisms for data extraction and preparation before loading into a data warehouse can take place. The construction of data warehouses entails the operations of data cleaning and data integration, which are key pre-processing steps for enabling data mining. Furthermore, the concept of metadata (data about data) is essential to the functioning of a data warehouse, and must be managed appropriately for an effective and efficient installation (Chaudhuri et. al, 1997).
The major commercial players in the data warehousing market today include IBM, Oracle-Sun, Teradata and Microsoft. Data mining functionality is typically included within the data warehousing vendor's software suite. Some vendors have specialized further by creating product suites sold as data warehousing appliances. These consist of an integrated, pre-packaged combination of server and storage hardware, with pre-installed operating system and relational database software that has been optimized for typical medium to large scale customer implementations (Microsoft, 2008).
Gartner (2008) predicted that a fifth of all organizations worldwide would have customized software-as-a- service (SaaS) applications created to supplement their business intelligence operations by 2010. The value-added business of information aggregators is to provide domain-specific analysis capability using competitive business information as a base. This by its nature tends to generate monopolies in vertical information domains, due to the need for aggregators to ensure the confidentiality and secure protection of their clients' sensitive business data. Without proper integration into proprietary internal information stored in data warehouses, customized SaaS-based tools cannot generate the benefits they are expected to provide.
Data warehousing may be defined in its simplest forma as "a process of centralized data management and retrieval" (Palace, 1996). Ideally, a data warehouse is the centralized repository of all of an organization's data, made available for users to access and analyze according to their individual needs through the process of data mining. It provides the tools and mechanisms for business executives to systematically organize, comprehend, and utilize their data to make strategic decisions. In recent years with competition mounting in every industry, data warehousing has become an essential method for organizations to retain customers by learning more about their needs using a solid platform of consolidated historical data and powerful analysis and mining tools (Berson et. al., 1997)
Data mining refers to the ability to enable analysis, categorization and summarization of data from multiple angles or different dimensions. Palace (1996) defines data mining as "the process of finding correlations or patterns among dozens of fields in large relational databases." The relationships, associations, historical patterns and future trends extracted from data in the database are what constitutes useful information or knowledge to the user. Data mining was initially used and promoted by consumer-oriented organizations that needed to deal with large volumes of data related to their business, finances, and customers, so as to be able to effectively design and price their products to address competition and meet customer priorities.
Douq (2009) outlines the set of marketing criteria that most often addressed by vendors of data warehouse products in comparing their own offerings with competing providers. Physical architecture and design, scalability, parallelism, performance and optimization, system availability, ease of operations and management are the subjects most frequently discussed and debated by vendors and analysts in industry circles.
It is useful to distinguish commercial relational databases from the multidimensional database structures used in data mining and warehousing. Traditional relational databases emphasize the operation of normalization (minimizing data redundancy), and are specifically tuned and organized to permit ad-hoc queries upon normalized data stored in tables and indexes. Multidimensional databases organize data in the form of data "cubes," which can be visualized as data sets and subsets implemented in array structures. A data cube consists of a large set of facts or measures, along with a number of associated dimensions. Dimensions are hierarchical entities that the organization wants to record and keep information about (Berson et. al., 1997). For example, a 3-D data cube could display the value of sales dollars, according to the measures of city, product and month sold. A 4-D data cube could add the dimension of year sold to the original three. Figure 1 provides a simplified example of the 3-D case illustrating the conceptual model.
Figure 1. OLAP Cube (Microsoft TechNet, 2011)
Unlike traditional relational database implementations, data may be repeated or reorganized extensively within a multidimensional database to meet the needs for faster search and query operations. Therefore, the needs of data warehouses are most compatible with data mining operations carried out on multidimensional databases (Palace, 1996). Data warehouses commonly utilize three-tier architecture. The first or bottom tier is the data warehouse database server's relational database system. The second or middle tier is an OLAP server implementing the multidimensional OLAP database functionality. The third or top tier is a client layer providing the user-facing query and reporting tools used for mining the data warehouse (Berson et. al., 1997).
Two leading commercial implementations of data warehousing and data mining functionality include Oracle Corporation and NCR Teradata. Both solutions are based upon relational database management systems (RDBMS) at their core. However their origins, implementation specifics, and performance characteristics have significant differences. Oracle's database originally evolved to respond to the market for traditional online transaction processing (OLTP), then gradually evolved to incorporate data warehousing and mining capabilities through its online analytical processing (OLAP) offerings. OLAP functionality is encompassed within the larger Business Intelligence (BI) disciplines, and includes both relational queries and data mining functions to produce output reports oriented to the business functions of finance, marketing, and management. Oracle's OLAP implementation deals effectively with multi-dimensional data by using algorithms optimized to handle rapid drill-down and aggregation in large data sets. This enables the Oracle data warehouse system to respond to complex information queries that may be posed in different ways from different angles (Douq, 2009).
Teradata is generally acknowledged to be the original large scale data warehouse offering. It originated as part of NCR Corporation, and formally separated into its own entity in 2007. The Teradata relational database was created and architected from its earliest beginnings for optimized information retrieval. As such, it is arguably faster and more efficient for certain "pure" data warehousing implementations than Oracle (Douq, 2009).
At a smaller scale, data warehousing and mining capability can also be created using desktop tools such as Oracle MySQL, or Microsoft Access and Microsoft Excel spreadsheets. With the Microsoft product suite, using features such as pivot tables, fact tables and the Query-By-Example function enables search indexing for practical performance on databases of over a million records while bypassing the more sophisticated programming methods involving Structured Query Language (SQL) commonly found in commercial RDBMS products (Microsoft Corporation, 2009)
How effectively a vendor or small business is able to integrate the operations of warehousing and mining of data is a key determinant of not only its competitive strength, but also the type of target implementations where a satisfactory outcome is most likely to result for the end customer. As such, the strategic business intelligence derived from data warehousing and data mining has become a management tool of critical importance to gaining and retaining competitive advantage (King, 2009).
IBM made a strategic entry into the commercial data warehouse appliance space with its acquisition of Netezza as a subsidiary in 2010. Netezza-based appliances feature a proprietary hardware and software implementation called Asymmetric Massively Parallel Processing (AMPP). This architecture incorporates rack-mounted blade format servers and disk storage, with a hardware-based data filtering component using field-programmable gate arrays (FPGA). Following IBM's acquisition of the ten-year-old Netezza technology, it has modified the TwinFins standard configuration to exchange processing modules for additional disk storage within the same two or four-rack assembly, to offer a "near-line" data warehouse appliance (Prickett Morgan, 2010, 2011). Figure 2 illustrates a typical example of a large-scale, commercial data warehouse appliance product, the IBM Netezza.
Figure 2. Typical data warehouse appliance (Prickett Morgan, 2011).
The technical implementation of a data warehouse RDBMS can differ substantially from a standard commercial implementation. For example, data warehouses are designed to optimize the speed of complex data retrieval queries involved in data mining. To accomplish this, a data warehouse RDBMS may store multiple copies of the same data in granular format using a technique called aggregation. De-normalization of data (that is, the use of data repetition and grouping) is common for read-intensive database applications to ensure adequate query response times. Without de-normalization, performance can be seriously hindered by the overhead involved in accessing normalized logical views or join tables across multiple physical data files (Shin…