Business Datamining and Warehousing
Business Datamining and Warehousing
P .O . B o x 3 4 2 -0 1 0 0 0 T h ik a
E m a il: In fo @m k u .a c .k e
We b : w w w .m k u .a c .k e
Study Units
The study units in this course are as follows:
1.0 Introduction 2
2.0 Objectives 2
3.0 Definition of Data Mining 2
3.1 Data Mining and Knowledge Discovery in Databases (kDD) 2
3.2 Data Mining and On-Line Analytical Processing (OLAP) 4
3.3 The Evolution of Data Mining 5
3.4 Scope of Data Mining 6
3.5 Architecture for Data Mining 7
3.6 How Data Mining Works 8
4.0 Conclusion 9
5.0 Summary 9
6.0 Tutor Marked Assignment 9
7.0 Further Reading and Other Resources 10
5
1.0 Introduction
From the time immemorial humans have been manually extracting hidden predictive patterns
from data, but the increasing volume of data in modern time requires an automatic approach.
With the advent of data mining, it provides a new powerful technology with great potential to
help private and public focus on the most important information in their data bases. Data
mining is a result of a long process of research and product development, and the primary
reason is to assist not only in uncovering hidden pattern s from databases but also consists of
collecting, managing, analysis and prediction of data.
The term data mining derived its name from the similarities between searching for valuable
information in a large database and mining a mountain for a vein of valuable one so the two
processes require either sifting through an immense amount of material, or intelligently
probing it to find where the value resides. This unit examines the meaning of data mining, the
difference between it and knowledge discovery in databases (KDD), evolution of data
mining, its scope, architecture and how it works.
2.0 Objectives
At the end of this unit, you should be able to:
v Define the term data mining
v Differentiate between data mining and knowledge discovery in databases (KDD)
v Understand the difference between data mining and OLAP
v Understand the evolution of data mining
v Know the scope of data mining
v Understand the architecture of data mining
v Understand how data mining works
Data mining is a cooperative effort of humans and computers; human actually designs the
databases, describe the problems and set goals while computers sort through the data and
search for patterns that matches the goals.
The term KDD was first coined by Gregory Piatetsky-Shapiro in1989 to describe the process
of searching for interesting, interpreted, useful and novel data. Reflecting the
6
v Data Integration: In this phase, multiple data sources which are often heterogeneous
may be combined to a common source.
v Data Selection: The data that is relevant to the analysis is decided upon and retrieved
from the data collection at this stage.
v Pattern Evaluation: At this stage, patterns that are very interesting and represent
knowledge are identified based on given measures.
v Knowledge Representation: This is the final stage of the KDD process in which the
discovered knowledge is visually represented to the user. Visualization techniques are
used to assist the users to have a better understanding and interpret the data mining
results.
It is common to combine some of these steps together for instance, data cleaning and data
integration can be performed together as a pre-processing phase to generate a data warehouse.
Also, data selection and data transformation can be combined where the consolidation of the
data is the result of the selection, or as for the case of data warehouses, the selection is done
on transformed data.
The KDD is an iterative process and can contain loops between any two steps. Once
knowledge is discovered it is presented to the user, the evaluation measures are enhanced and
the mining can be further refined, new data can be selected or further transformed, or new
data sources can be integrated, in order to get different and more appropriate results
OLAP is part of a spectrum of decision support tools. Unlike traditional query and report
tools that describe what is in a database, OLAP goes further to answer why certain things are
true. The user forms a hypothesis about a relationship and verifies it with a series of queries
against the data. For example, an analyst may want to determine the factors that lead to loan
defaults. He or she might initially hypothesis that people with low incomes are bad credit
risks and analyze the database with OLAP to verify or disprove assumption. If that
hypothesis were not borne out by the data, the analyst might then look at high debt as the
determinant of risk. It the data does not support this guess either, he or she might then try
debt and income together as the best prediction of bad credit risks (Two Crows Corporation,
2005)
In other words, OLAP is used to generate a series of hypothetical patterns and relationships,
uses queries against database to verify them or disprove them. OLAP analysis is basically a
deductive process. But when the number of variable to be analyzed becomes voluminous it
becomes much more difficult and time-consuming to find a good hypothesis, analyze the
database with OLAP to verify or disprove it.
Data mining is different from OLAP; unlike OLAP that verifies hypothetical patterns, it uses
the data itself to uncover such patterns and is basically an inductive process. For instance,
suppose an analyst wants to identify the risk factors for loan default is to use a data mining
tool. The data mining tool may discover people with high dept and low incomes are bad
credit risks, it may go further to discover a pattern that the analyst does not consider that age
is also a determinant of risk.
8
Although data mining and OLAP complement each other in the sense that before acting on
the pattern, the analyst needs to know what would be the financial implications using the
discovered pattern to govern who gets credit. OLAP tool allows the analyst to answer these
kinds of questions. It is also complimentary in the early stages of the knowledge discovery
process.
Data mining is a natural development of the increased use of computerized databases to store
data and provide answers to business analysis. Traditional query and report tools have been
used to describe and extract what is in a database. Data mining is ready for application in the
business community because it is supported by these technologies that are now sufficiently
matured:
v Massive data collection
v Powerful multiprocessor computer
In the evolution data mining from business data to business information, each new step has
built upon the previous one. For example the four steps listed in table 1.1 were revolutionary
because they allowed new business questions to be answered accurately and quickly.
The core components of data mining technology have been under development for decades
in research areas such as statistic, artificial intelligence and machine learning Today the
maturity of these techniques coupled with high performance relational database engines and
broad data integration efforts, make these technologies practical for current data warehouse
environments.
If the database is given a sufficient size and quality, data mining technology can generate new
business opportunities by providing the following capabilities:
v Automated Prediction of trends and Behaviors : Data mining automates the process
of searching for predictive information in large databases. Questions that may
traditionally require extensive hands-on analysis can now be answered directly from
data very quickly. An example of a predictive problem is targeted marketing. Data
mining uses data on past promotional mailings to identify the most likely target to
maximize return on investment in future mailings. Other predictive problems include
forecasting bankruptcy and other forms of default, and identifying segments of a
population likely to respond similarly to given events.
Data mining techniques can yield the benefits of automation on existing software and
hardware platforms and can be implemented on new systems as existing platforms are
upgraded and new products developed. When data mining tools are implemented on high
performance parallel processing systems, they can analyze massive database in minutes.
Faster processing means that users can automatically experiment with more models to
understand complex data. High speed makes it practical for users to analyze huge quantities
of data. Larger databases in turn yield improved predictions.
The ideal starting point is a data warehouse that contains a combination of internal data
tracking all customers contact coupled with external market about competitor s activity. The
background information on potential customers also provides an excellent basis for
prospecting. The warehouse can be implemented in a variety of relational database systems:
Sybase, Oracle, Redbrick and so on and should be optimized for flexible and fast data access.
new decisions and results, the organization can continually mine the best practices and apply
them to future decisions.
This design represents a fundamental shift from conventional decision support system. Rather
than simply delivering data to the end user through query and reporting software, the
Advanced Analysis Server applies users business models directly to the warehouse and
returns a proactive analysis of the most relevant information. These results enhance the
metadata in the OLAP server by providing a dynamic metadata layer that represents a
distilled view of a data. Other analysis tools can then be applied to plan future actions and
confirm the impact of those plans (An Introduction to Data Mining)
For example, as the marketing director of a telecommunication company you have access to a
lot of information such as age, sex, credit history, income, zip code, occupation and so on of
all your customers; but difficult to discern the common characteristics of his best customers
because there are so many variables. From the existing database of customers that contains
their information as earlier mentioned; data mining tools such as neural networks can be used
to identify the characteristics of those customers that make a lot of long distance calls. This
then becomes the director s model for high-value customers, and he would budget his
marketing efforts accordingly
4.0 Conclusion
With the introduction of data mining technology, individuals and organization can uncover
hidden patterns in their data which they can use to predict the behaviour of customers,
products and processes.
5.0 Summary
In this unit we have learnt that:
v Data mining is the process of extracting hidden and predictive information from large
databases or data warehouse.
12
Usama, F., Gregory, P., and Padhraic, S., From Data Mining to Knowledge Discovery in
Databases, Article of American Association for Artificial Intelligence Press, (1996).
Jeffrey W. Seifert, (Dec. 2004), Data Mining: An Overview . From: Congressional Research
Service, The Library of Congress.
Leon, A. and Leon, M. (1999), Fundamentals of Information Technology . New Delhi: Leon
Press Channel and Vikas Publishing House Pvt Ltd.
13
1.0 Introduction 12
2.0 Objectives 12
3.0 Types of Information Collected 12
3.1 Types of Data to Mine 14
3.1.1 Flat Files 14
3.1.2 Relational Databases 14
3.1.3 Data Warehouses 15
3.1.4 Transaction Databases 15
3.1.5 Spatial Databases 15
3.1.6 Multimedia Databases 15
3.1.7 Time-Series Databases 16
3.1.8 World Wide Web 16
3.2 Data Mining Functionalities 16
3.2.1 Classification 16
3.2.2 Characterization 17
3.2.3 Clustering 17
3.2.4 Prediction (or Regression) 18
3.2.5 Discrimination 18
3.2.6 Association Analysis 18
3.2.7 Outlier Analysis 18
3.2.8 Evolution and Deviation Analysis 18
14
4.0 Conclusion 19
5.0 Summary 19
6.0 Tutor Marked Assignment 19
7.0 Further Reading and Other Resources 19
1.0 Introduction
In actual sense data mining is not limited to one type of media or data but applicable to any
kind of information available in the repository. A repository is a location or set of locations
where systems analysts, system designers and system builders keep the documentation
associated with one or more systems or projects.
But before we begin to explore the different kinds of data to mine it will be interesting to
familiarize ourselves with the variety of information collected in digital form in databases
and flat files. Also to be explored are the types to mine and data mining functionalities.
2.0 Objectives
At the end of this unit, you should be able to:
v Know the different kinds of information collected in our databases
v Describe the types of data to mine
v Explain the different kinds of data mining functionalities and the knowledge they
discover
(3). Games
15
The rate at which our society gathers data and statistics about games, players and athletes is
tremendous. These ranges from car-racing, swimming, hockey scores, footballs, basketball
passes, chess positions and boxers pushes, all these data are stored. Trainers and athletes
make use of this data to improve their performances and have a better understanding of their
opponents, but the journalists and commentators use this information to report.
The most commonly used query language for relational database is Structured Query
Language (SQL), it allows for retrieval and manipulation of data stored in the tables as well
as the calculation of aggregate function such as sum, min, max and count. The data mining
algorithm that uses relational databases can be more versatile than data mining algorithm that
is specifically designed for flat files because they can always take advantage of the structure
inherent in relational databases, while data mining can benefit from Structured Query
Language (SQL) for data selection, transformation and consolidation. Also, it goes beyond
what SQL can provide like predicting, comparing and detecting deviations.
reference all management systems using designated technology suitable for corporate
database management e.g. Sybase, Ms SQL Server.
Rentals
Transaction(1) Data Time Customer ID Item List
TI 14/09/04 14.40 12 10,11,30,110,
Figure 1.1 represents a transaction database, each record shows a rental contact with a
customer identifier, a date and list of items rented. But relational database do not allow nested
tables that is a set as attribute value, transactions are usually stored in flat files or stored in
two normalized transaction tables, one for the transactions and the other one for the
transaction items. A typical data analysis on such data is the so-called market basket analysis
or association rules in which associations occurring together or in sequence are studied.
how and when the resources are accessed. A fourth dimension can be added relating the
dynamic nature or evolution of the documents. Data mining in the World Wide Web, or web
mining, addresses all these issues and is often divided into web content mining and web
usage mining.
The data mining functionalities and the variety of knowledge they discover are briefly
described in this section. These are as follows:
3.2.1 Classification
This is also referred to as supervised classification and is a learning function that maps (i.e.
classifies) item into several given classes. The classification uses given class labels to order
the objects in the data collection. Classification approaches normally make use of a training
set where all objects are already associated with known class labels. The classification
algorithm learns from the training set and builds a model which is used to classify new
objects. Examples of classification method used in data mining application include the
classifying of trends in financial markets and the automated identification of objects of
interest in large image database. Figure 2.2 shows a simple partitioning of the loan data into
two class regions; this may be done imperfectly using a linear decision boundary. The bank
may use the classification regions to automatically decide whether future loan applicants will
be given loan or not.
Debt
Regression
Line
x x o o
x x x o o o
x o
x o o o
Income
Figure 2.2 A Simple Linear Classification Boundary for the Loan Data Set
Source: From Data Mining to Knowledge Discovery in Databases, Ussama, F. et al page 44
19
3.2.2 Characterization
Data characterization is also called summarization and involves methods for finding a
compact description (general features) for a subject of data or target class, and produces what
is called characteristics rules. The data that is relevant to a user-specified class are normally
retrieved by a database query and run through a summarization module to extract the essence
of the data at different levels of abstractions. A simple example would be tabulating the mean
and standard deviations for all fields. More sophisticated methods involve the deviation of
summary rules (Usama et al. 1996; Agrawal et al. 1996), multivariate visualization
techniques and the discovery of functional relationships between variables. Summarization
techniques are often applied to interactive exploratory data analysis and automated report
generation (Usama et al. 1996)
3.2.3 Clustering
Clustering is similar to classification and is the organization of data in classes. But unlike
classification, in clustering class tables are not predefined (unknown) and is up to clustering
algorithm to discover acceptable classes. Clustering can also be referred to as unsupervised
classification because the classification is not dictated by given class tables. We have so
many clustering approaches which are all based on the principle of maximizing the similarity
between objects in the same class (that is intra-class similarity) and minimizing the similarity
between objects of different classes that is inter-class similarity.
3.2.5 Discrimination
Data discrimination generates what we call discriminant rules and is basically the comparison
of the general features of objects between two classes referred to as the target class and the
contrasting class . For instance, we may want to characterize the rental customers that
regularly rent more than 50 movies last year with those whose rental account is lower than
10. The techniques used for data discrimination are similar to that used for data
characterization with the exception that data discrimination results include comparative
measures.
is used to pinpoint association rules. Association analysis is commonly used for market
basket analysis because it searches for relationship between variable. For example, a
supermarket might gather data of what each customer buys. With the use of association rule
learning, the supermarket can work out what products are frequently bought together, which
is useful for marketing purposes. This is sometimes called market basket analysis.
4.0 Conclusion
Data mining therefore is not limited to one media or data; it is applicable to any kind of
information repository and the kind of patterns that can be discovered depend upon the data
mining tasks employed.
5.0 Summary
In this unit we have learnt that:
v Different kind of information are often collected in digital form in our databases and
flat files, these include scientific data, personal and medical data, games,
v Data mining can be applied to any kind of information in the reporting
v Data mining system allows the discovery of different kind of knowledge and at
different levels of abstraction.
2. List and explain any five data mining functionalities and the variety of knowledge
they discover.
Mosud, Y. Olumoye (2009), Introduction to Data Mining and Data Warehousing, Lagos:
Rashmoye Publications
Usama, F., Gregory, P., and Padhraic, S., From Data Mining to Knowledge Discovery in
Databases, Article of American Association for Artificial Intelligence Press, (1996).
Jeffrey W. Seifert, (Dec. 2004), Data Mining: An Overview . From: Congressional Research
Service, The Library of Congress.
Leon, A. and Leon, M. (1999), Fundamentals of Information Technology . New Delhi: Leon
Press Channel and Vikas Publishing House Pvt Ltd.
22
1.0 Introduction 22
2.0 Objectives 22
3.0 Classification of Data Mining Systems 22
3.1 Data Mining Tasks 23
3.2 Data Mining Issues 23
3.2.1 Security and Social Issues 23
3.2.2 Data Quality 24
3.2.3 User Interface Issues 24
3.2.4 Data Source Issues 24
3.2.5 Performance Issues 25
3.2.6 Interoperability 25
3.2.7 Mining Methodology Issues 26
3.2.8 Mission Creep 26
3.2.9 Privacy 26
3.3 Data Mining Challenges 27
4.0 Conclusion 28
5.0 Summary 29
6.0 Tutor Marked Assignment 29
7.0 Further Reading and Other Resources 29
23
1.0 Introduction
There are many data mining systems available or presently being developed. Some are
specialized systems dedicated to a given data sources or are confined to limited data mining
functionalities, while others are more versatile and comprehensive. This unit examines the
various classifications of data mining systems, data mining tasks, the major issues and
challenging in data mining.
2.0 Objectives
At the end of this unit, you should be able to:
v Understand the various classifications of data mining systems
v Describe the categories of data mining tasks
v Understand the diverse issues coming up in data mining
v Describe the various challenges facing data mining
2. Classification By the Data Model Drawn On: This class categorizes data mining
systems based on the data model involved such as relational database, object-oriented
database, data warehouse, transactional etc.
4. Classification By The Mining Techniques Used: Data mining systems employ and
provide different techniques. This class categorizes data mining systems according to the data
analysis approach used such as machine learning, neural networks, genetic algorithms,
statistics, visualization, database-oriented or data warehouse-oriented. This class also takes
into account the degree of user interaction involved in the data mining process such as query-
driven systems interactive exploratory systems, or autonomous systems. A comprehensive
system would provide a wide variety to data mining technique to fit different situation and
options, and offer different degrees of user interaction.
24
2. Clustering
Clustering is similar to classification but the groups are not predefined, so the algorithms
will try to group similar items together.
3. Regression
This task attempts is similar to find a function which models the data with the least error. A
common method is to use Genetic Programming
knowledge discovered, some vital information could be withheld, which other may be widely
distributed and used without control.
To improve quality of data, it is somehow necessary to clean the data, which may involve
the removal of duplicate records, normalizing the values used to represent information in the
database (for example, ensuring that no is represented as a 0 throughout the database, and
not sometimes as a O, and sometimes N), accounting for missing data points, removing
unrequired data fields, identifying anomalous data points (e.g. an individual whose age is
shown as 215 years), and standardizing data formats (e.g. changing dates so they include
DD/MM/YYYY).
The major issues related to user interface and visualizations are screen real-estate
information rendering, and interaction. The interactivity of data and data mining results is
very vital since it provides means for the user to focus and refine the mining tasks, as well as
to picture the discovered knowledge from different angles and at different conceptual levels.
Regarding the issues related to data sources, there is the subject of heterogeneous databases
and the focus on diverse complex data types. We store different types of data in a variety of
repositories. It is difficult to expect a data mining system to effectively and efficiently
achieve good mining result on all kinds of data and sources. Different types of data and
sources may require distinct algorithms and methodologies. Presently, there is a focus on
26
relational databases and data warehouses, but other approaches need to be pioneered for other
specific complex data types. Therefore the proliferation of heterogeneous data sources, at
structural and semantic levels, poses important challenges not only to the database
community but also to the data mining community.
Other topic that needs to be considered in performance issues includes completeness and
choice of samples, incremental updating and parallel programming. Although parallelism can
help solve the size problem if the dataset can be subdivided and the results merged later. And
incremental updating is very important for merging results from parallel mining, or updating
data mining results when new data because available without necessarily re-analyzing the
complete dataset.
3.2.6 Interoperability
Data quality is related to the issue of interoperability of different databases and data mining
software. Interoperability refers to the ability of a computer system and data to work with
other systems or data using common standards or processes . It is a very critical part of the
larger efforts to improve or enhance interagency collaboration and information sharing
through government and homeland security initiatives. In data mining, interoperability of
databases and software is important to enable the search and analysis of multiple databases
simultaneously and to help ensure the compatibility of data mining activities of different
agencies.
The data mining projects that want to take the advantage of existing legacy databases or
trying to initiate first-time collaborative efforts with other agencies or levels of government
such as police may experience interoperability problems. Also, as agencies advances with the
creation of new databases and information sharing efforts, they will need to address
interoperability issues during their planning stages to better ensure the effectiveness of their
data mining projects.
More so, different approaches may suit and solve user s needs differently. Most algorithms
used in data mining assume the data to be noise-free, which of course is a strong assumption.
Most of the datasets contain exceptions, invalid or incomplete information which may
27
probably complicate, if not obscure the analysis process and in many cases compromise the
accuracy of the results. Consequentially, data preprocessing (i.e. data cleaning and
transformation) becomes very essential. Data cleaning preprocessing is often seen as time
consuming and frustrating but is one of the most important phase in the knowledge discovery
process. Data mining techniques should be able to handle noise in data or incomplete
information.
Similarly, government officials that are responsible for ensuring the safety of others may be
pressurized to use or combine existing databases to identify potential threats. Unlike physical
searches, or the detention of individuals, accessing information for purposes other than
originally intended may appear to be a victimless or harmless exercise. However, such
information use can lead to unintended outcome and produce misleading results
One of the primary reasons for misleading results is inaccurate data. All data collection
efforts suffer accuracy concerns to some degree. Ensuring the accuracy of information can
require costly protocols that may not be cast effective if the data is not of inherently high
economic value (Jeffrey W. Seifert, 2004).
3.2.9 Privacy
Privacy focuses on both actual projects proposed as well as concerns about the potential for
data mining applications to be expanded beyond their original purposes (mission creep). As
additional information sharing and data mining initiatives have been announced, increased
attention has focused on the implications for privacy. For instance, some experts suggest that
anti-terrorism data mining applications might be for combating other types of crime as well.
So far there has been little consensus about how data mining should be carried out with
several competing points of view being debated. Some observers suggest that existing laws
and regulations regarding privacy protections are adequate and that these initiatives do not
pose any threats to privacy. Others argue that not enough is known about how data mining
projects will be carried out and that greater oversight is needed. As data mining efforts
continues to advance, Congress may consider a variety of questions including, the degree to
which government data, whether data sources are being used for purpose other than those for
which they were originally designed, and the possible application of the privacy Act to these
initiatives (Jeffrey W. Seifert, 2004)
4.0 Conclusion
Data mining systems therefore can be categorized into various group using different criteria
and there are four major classes of data mining tasks. Also issues and challenges affecting the
effective implementation of data mining have to be addressed in order to ensure a successful
exercise.
5.0 Summary
In this unit we have learnt that:
v Data mining systems can be categorized according to various criteria such as type of
data source mined, data model drawn, kind of knowledge discovered and the mining
techniques used.
v Data mining tasks can be grouped into four major classes, and these classes include
classification, clustering, repression and association rule learning.
v There are lots of data mining issues affecting the implementation of data mining
among these are security and social issues, data quality, user interface issues, data
source issues, performance issues among others.
v There are lots of challenges facing data mining, among these are larger databases,
high dimensionality, missing and noisy data.
30
2. List and explain any five data mining challenges affecting the implementation of data
mining.
Usama, F., Gregory, P., and Padhraic, S., From Data Mining to Knowledge Discovery in
Databases, Article of American Association for Artificial Intelligence Press, (1996).
Jeffrey W. Seifert, (Dec. 2004), Data Mining: An Overview . From: Congressional Research
Service, The Library of Congress.
Leon, A. and Leon, M. (1999), Fundamentals of Information Technology . New Delhi: Leon
Press Channel and Vikas Publishing House Pvt Ltd.
31
1.0 Introduction 32
2.0 Objectives 32
3.0 Data Mining Technologies 32
3.1 Neural Networks 32
3.2 Decision Trees 34
3.3 Rule Induction 35
3.4 Multivariate Adaptive Regression Splines 35
3.5 K- Nearest Neighbor and Memory-Based Reasoning 36
3.6 Genetic Algorithms 37
3.7 Discriminant Analysis 37
3.8 Generalized Additive Models 37
3.9 Boosting 38
3.10 Logistic Regression 38
4.0 Conclusion 39
5.0 Summary 39
6.0 Tutor Marked Assignment 39
7.0 Further Readings and Other Resources 39
32
1.0 Introduction
In this unit, we shall be exploring some types of models and algorithms used in mining data.
Most of the models and algorithms we shall be discussing are generalization of the standard
workhorse of the modeling; although we should realize that no one model or algorithm can or
should be used exclusively. As a matter of fact, for any given problem, the nature of the data
itself will affect the choice of models and algorithm you decide to choose; and there is no
best model or algorithm as you will need a variety of tools and technologies for you to find
the best possible.
2.0 Objectives
At the end of this unit, you should be able to:
v Understand the various data mining technologies available
Most of the products use variations of algorithms that have been published in statistics or
computer science journals with their specific implementations customized to meet individual
vendor s goal. For instance, most of the vendors sells versions of the CART (Classification
And Regression Trees) or CHAID (Chi-Squared Automatic Interaction Detection) decision
trees with enhancements to work on parallel computers, while some vendors have proprietary
algorithms that will not allow extension or enhancements of any published approach to work
well.
Some of the technologies or tools used in data mining that will be discussed are: Neutral
networks, decision trees, rule induction, multivariate adaptive repression splines (MARS), K-
nearest neighbor and memory based reasoning (MBR), logistic regression, discriminant
analysis, genetic algorithms, generalized additive models (GAM) and boosting.
derive its meaning from complicated or imprecise data and can be used to extract patterns and
detect trends that are too complex to be noticed by either human or computer techniques. A
trained neural network can be thought of as an expert in the class of information it wants to
analysis. This expert can then be used to provide projections given new situations of interest
and answer what if questions.
Neural networks have very wide applications to real world business problems and have
already been implemented in many industries. Because neural networks are very good at
identifying patterns or trend in data, they are very suitable for prediction or forecasting needs
including the following:
v Sales forecasting
v Customer research
v Data validation
v Risk management
v Industrial process control
v Target marketing
Neural networks use a set of processing elements or nodes similar to neurons in human brain.
The nodes are interconnected in a network that can then identify patterns in data once it is
exposed to the data, that is to say network learns from experience like human beings. This
makes neural networks to be different from traditional computing programs that follow
instructions in a fixed sequential order.
The structure of a neural network is shown in figure 4.1. It starts with an input layer, where
each mode corresponds to a prediction variable. These input nodes are connected to a number
of nodes in a hidden layer. Each of the input nodes is connected to every node in the hidden
layer. The nodes in that hidden layer may be connected to nodes in another hidden layer, or to
an output layer. The output layer consists of one or more response variables.
The commonest type of neural network is the feed-forward back propagation network and it
proceeds as:
v Feed forward: the value of the output made is calculated based on the input node
value and a set of initial weights. The value from the input nodes are combined in the
hidden layers and the values of those nodes are combined to calculate the output value
(Two Crows Corporation).
34
The problems associated with neural networks as summed up by Arun Swami of Silicon
Graphics Computer Systems are the resulting network is viewed as a black box and no
explanation of the results is given. This lack of explanation inhibits confidence, acceptance
and application of results. Also, neural networks suffered from long learning times which
become worse as the volume of data grows.
Figure 4.2 shows a simple decision tree that describes the weather at a given time while
illustrating all the basic components of a decision tree: the decision node branches and leaves.
The objects contain information on the outlook, humidity, rain etc. Some of the objects are
positive examples denoted by P while others are negative N. Classification in this case is the
construction of a tree structure, illustrated in figure 4.2 which can be used to classify all the
objects correctly.
Decision trees models are commonly used in data mining to examine the data and induce the
tree and its rules that will be used to make predictions. A good number of different
algorithms may be used to build decision trees which include Chi-squared Automatic
Interaction Detection (CHAID), Classification And Repression Trees (CART), Quest and
C5.0.
35
Decision trees grow through an iterative splitting of data into discrete groups, where the goal
is to maximize the distance between groups at each split. Decision trees that are used to
predict categorical variables are called classification trees because they place instances in
categories or classes, and the one used to predict continuous variables are called repression
trees.
Just like most neural and decision tree algorithms, MARS has a tendency to overfit the
training data which can be addressed in two ways:
(i) Manual cross validation can be performed and the algorithms tuned to provide
prediction on the test set.
(ii) There are various tuning parameters in the algorithm itself that can guide internal
cross validation.
X X
X
X
X Y
N
X X
Y
X Y
Figure 4.3 K-nearest neighbor. N is a new case.
Source: Two Crows Corporation, 2005
It would be assigned to the class X because the seven X s within the ellipse outline
outnumber the two Y s
In order to apply k-NN, you must first of all find a measure of the distance between attributes
in the data and then calculate it. While this is easy for numerical data, categorical variables
need special handling. For example, what is the distance between blue and green? You must
then have a way of summing the distance measures for the attributes. Once you can calculate
the distance between cases, you then select the set of already classified cases to use as the
basis for classifying new cases, decide how large a neighborhood in which to do the
comparisons, and also decide how to count the neighbors themselves. For instance you might
give more weight to nearer neighbors than farther neighbors. (Two crows Corporations)
With K-NN, a large computational load is placed on the computer because the calculation
time increases as the factorial of the total number of points. While it s a rapid process to
apply a decision tree or neutral net to a new case, K-NN requires that a new calculation be
made for each new case. To speed up K-NN frequently all the data is kept in memory.
Memory-based reasoning usually refers to a K-NN classifier kept in memory. (Two crow
corporation)
The use of K-NN models are very easy to understand when there are few predictor variables,
they are also useful for building models that involve non-standard data types, such as text.
The only requirement to be able to include a data type is the existence of an appropriate
metric.
For instance, to build a neural net, genetic algorithms can replace back propagation as a way
to adjust the weights. The chromosomes would contain the number of hidden layers and the
numbers of nodes in each layer. Although, genetic algorithms are interesting approach to
optimizing models, but add a lot of computational over head.
With the use of computer power in place of theory or knowledge of the functional form,
GAM will produce a smooth curve and summarize the relationship. As with neural nets
where large numbers of parameters are estimated, GAM goes a step further and estimates a
value of the output for each value of the input-one point, one estimate and generates a curve
automatically choosing the amount of complexity based on the data.
3.9 Boosting
The concept of boosting applies to the area of predictive data mining, to generate multiple
models or classifiers (for prediction or classification), and to derive weights to combine the
predictions from those models into a single prediction or predicted classification. If you are to
build a new model using one sample of data, and then build a new model using the same
algorithms but on a different sample, you might get a different result. After validating the two
models, you could choose the one that best meet your objectives. Better results might be
achieved if several models are built and let them vote, making a prediction based on what the
majority recommends. Of course any interpretability of the prediction would be lost, but the
improved results might be worth it.
Boosting is a technique that was fist published by Freund and Shapire in 1996; It takes
multiple random samples from the data and builds a classification model for each. The
training set is changed based on the result of the previous models. The final classification is
the class assigned most often by the models. The exact algorithms for boosting have evolved
38
from the original, but the underlying idea is the same. Boosting has become a very popular
addition to data mining packages
4.0 Conclusion
Therefore, there is no one model or algorithm that should be used exclusively for data mining
since there is no best technique. Consequently one needs a variety of tools and technologies
in order to find the best possible model for data mining.
5.0 Summary
In this unit we have learnt that:
v There are various techniques or algorithm used for mining data, this include neural
networks, decisions trees, genetics algorithm, discriminant analysis, rule induction
and the nearest neighbor.
Mosud, Y. Olumoye (2009), Introduction to Data Mining and Data Warehousing, Lagos:
Rashmoye Publications
39
Usama, F., Gregory, P., and Padhraic, S., From Data Mining to Knowledge Discovery in
Databases, Article of American Association for Artificial Intelligence Press, (1996).
Jeffrey W. Seifert, (Dec. 2004), Data Mining: An Overview . From: Congressional Research
Service, The Library of Congress.
Data Mining. Retrieved on 29/07/2009. Available Online:
http://en.wikipedia.org/wiki/Data_mining.
Leon, A. and Leon, M. (1999), Fundamentals of Information Technology . New Delhi: Leon
Press Channel and Vikas Publishing House Pvt Ltd.
Module 1: Concepts of Data Mining
1.0 Introduction
Data preparation and preprocessing are often neglected but important step in data mining
process. The phrase Garbage in, Garbage out (G1G1) is particularly applicable to data
mining and machine learning projects. Data collection methods are often loosely controlled
thereby resulting in out of range values (e.g. income =N= 400), impossible data
combinations (e.g. Gender: Male, Pregnant; yes), missing values and so on. This unit
examines meaning and reasons for preparing and preprocessing data, cleaning, data
transaction, data reduction and data discretization.
2.0 Objectives
At the end of this unit, you should be able to:
v Know the different data formats of an attribute
v Explain the meaning and importance of data preparation
v Define term data preprocessing
v Why data is being preprocessed
v Understand the various data pre-processing tasks
Data can also be classified as static or dynamic (temporal). Other types of data that we come
across in data mining applications are:
v Distributed data
v Textual data
v Web data (e.g. html pages)
v Images
v Audio /Video
v Metadata (information about the data itself )
Data preparation is also required when data is to be processed is in the raw format, e.g. pixel
format for images. Such data should be converted into appropriate formats which can
processed by the data mining algorithms.
e.g. -For the initial range [-991, 99], k =3 and v=-991 becomes v =-0.991
3. Zero Mean Normalization: When you use this type of normalization, the mean of
the transformed set of data points is reduced to zero. For this, the mean and standard
deviation of the initial set of data value are required. The transformation formula is
v = (v - meanA)/std_devA
where meanA and std_devA are the mean and standard deviation of the initial data
values, e.g. - If meanIncome = 54000, and std_devIncome = 16000, then v = 76000 ⇒
=∋ϖ 1.225.
Web usage data is collected in various ways each mechanism collecting attributes relevant for
purpose. There is a need to preprocess the data to make it easier to mine for knowledge.
Specifically, the following issues need to be addressed in data preprocessing:
Data which is inconsistent with our models should be dealt with. Common sense can also be
used to detect such kind of inconsistency. Examples are
v The same name occurring differently in an application
v Different names can appear to be the same (Dennis Vs Denis).
v Inappropriate values (Males being pregnant, or having a negative age)
v A particular bank database had about 5% of its customers born on 11/11/11/, which is
usually the default value for the birthday attribute.
How to correct inconsistent Data
v It is important to have data entry verification (check both format and values of data
entered).
v Correct with the help of external reference data (Look-up tables, e.g. sydney, new,
7000) or rules (e.g. male/0 M, Female/1 F)
v Histogram analysis:
- Equal-interval (equiwidth) binning: split the whole range of numbers intervals with
equal size.
- Equal frequency (equidepth) binning: use interval containing equal number of values
v Segmentation by natural partitioning (partition into 3,4, or 5 relatively uniform
intervals)
v Entropy (information) based discretization
4.0 Conclusion
Therefore data preparation and preprocessing are very important step in data mining process.
5.0 Summary
In this unit we have learnt that:
v Attributes can be in different data formats
v Data preparation is one of the important tasks in data mining
v Data preprocessing is a preliminary processing of data in order to prepare it for further
analysis
v Data has to be prepared because of a lot of reason these include real world data is
dirty, incomplete data and noisy data.
v Data preprocessing involves a lot number of tasks which include data cleaning, data
transformation, attribute/feature construction, data reduction, discretization and
concept hierarchy generation, data parsing and standardization
Mosud, Y. Olumoye (2009), Introduction to Data Mining and Data Warehousing , Lagos:
Rashmoye Publications
Usama, F., Gregory, P., and Padhraic, S., From Data Mining to Knowledge Discovery in
Databases, Article of American Association for Artificial Intelligence Press, (1996).
48
Jeffrey W. Seifert, (Dec. 2004), Data Mining: An Overview . From: Congressional Research
Service, The Library of Congress.
Cross Industry Standard Process for Data Mining , Retrieved on 19/09/2009. Available
Online: http: // www.crisp-dm.org/
1.0 Introduction
It is very crucial to recognize the fact that a systematic approach is essential for a successful
data mining; although, many vendors and consulting organizations have specified a process
designed to guide the user, especially someone new to building predictive models through a
sequence of steps that will lead to good results. This unit examines the necessary steps in
successful data mining using the two crows process Model
2.0 Objectives
At the end of this unit, you should be able to:
v Understand the basic steps of data mining for knowledge discovery
Recently, a consortium of vendors and users that consist of NCR Systems Engineering
Copenhagen (Denmark), Daimler-Benz AG (Germany), SPSS/ Integral Solutions Ltd.
(England) and OHRA Verzekeringen en Bank Groep B.V. (The Netherlands) has been
developing a specification called CRISP-DM (Cross Industry Standard Process for Data
Mining). SPSS uses the 5 As- Assess, Access, Analyze, Act and Automatic and SAS uses
SEMMA- Sample, Explore, Modify, Model, Assess. CRISP-DM is similar to process models
50
from other companies including the one from two crow s corporations. As of September
1999, CRISP-DM was a work in progress. (Two Crows Corporation, 2005)
In order to make best use of data mining you must be wishing to increase the respond to a
direct mail campaign. Depending on your specific goal such as increase the response rate
or increasing the value of a response you will build a very different model. An effective
statement of the problem will include a way of measuring the results of the knowledge
discovery project, which may also include a cost justification.
The data to mine should be collected in a database and this does not necessarily mean that a
database management system must be used. Depending on the amount of data, complexity of
the data and use to which it is to be put, a flat file or even a spreadsheet may be adequate. By
and large, it is not a good idea to use your corporate data warehouse for this, it is better to
create a separate data mart. Almost certainly you will be modifying the data from the data
51
warehouse. You may also want to bring in data from outside your company to overlay on the
data warehouse data or you may want to add new fields computed from existing fields. You
may need to gather additional data through surveys.
Other people building different models from the warehouse (some of whom will use the same
data as you) may want to make similar alterations to the warehouse. However, data
warehouse administrators do not look kindly on having data changed in what is
unquestionably a corporate resource (Two Crows, 2005).
Another reason for a separate database is that the structure of the data warehouse may not
easily support the kinds of exploration you need to do to understand this data. This include
queries summarizing the data, multi-dimensional reports (which is sometimes referred to as
pivot tables), and many different kinds of graphs or visualization. Also, you may want to
store this data in different database management system (DBMS) with a different physical
design than the one used for your corporate data warehouse.
A data collection report (DCR) lists the properties of different source data sets. Some of the
elements in this report include the following:
v Source of data (either internal application or outside vendor)
v Owner
v Person/organization responsible for maintaining the data
v Database administration (DBA)
v Cost (if purchased)
v Storage organization (oracle database, VSAM file etc)
v Size in table, rows, records etc.
v Size in bytes
v Physical storage (CD-ROM, tape, server etc)
52
v Security requirements
v Restrictions on use
v Privacy requirements.
You should be sure to take note of special security and private issues that your data mining
database will inherit from the source data. For example, some countries datasets are
constrained in their use by privacy regulations.
3.3.3 Selection
The next step after describing the data is selecting the subset of data to mine. This is not the
same as sampling the database or choosing prediction variables. Instead, it is a gross
elimination of irrelevant or unrequired data. Other criteria for excluding data may include
resource constraints, cost, restrictions on data use, or quality problems
There are different types of data quality problems. This include, single fields having an
incorrect value, incorrect combinations in individual fields (e.g. pregnant males) and missing
data such as throwing out every record with a field missing, this may wind up with a very
small database or an inaccurate picture of the whole database. Recognizing the fact that you
may not be able to fix all the problems, so you will need to work around them as best as
possible; although, it is preferable and more cost-effective to put in place procedures and
checks to avoid the data quality problems. However, you must build the models you need
with the data you now have, and avoid something you will work toward for the future.
53
Data integration and consolidation combines data from different sources into a single mining
database and requires reconciling differences in data values from the various sources.
Improperly reconciled data is a major source of quality problems. There are often large
different databases (Two Crows Corporation, 2005). Though, some inconsistencies may not
be easy to cover, such as different addresses for the same customer, making it more difficult
to resolve. For instance, the same customers may have different names or worse multiple
customers identification numbers. Also, the same name may be used for different entities
(homonyms), or different names may be used for the same entity (synonyms)
The process of building a predictive models requires a well defined training and validation
protocol in order to insure the most accurate and robust predictions. This type of protocol is
sometimes called Supervised Learning . The reason for supervised learning is to train or
estimate your model on a portion of the data, then test and validate it on the remainder of the
data. A model is built when the cycle of training and testing is completed. At times a third
data set referred to as validation data set is needed because the test data may influence
features of the model and the validation set acts an independent measure of the model s
accuracy. Training and testing the data mining model requires the data to be splitted into at
least two groups: one for model training (i.e. estimation of the model parameters) and for one
model testing. If you fail to use different training and test data, the model is generated using
the training database, it is used to predict the test database, and the resulting accuracy rate is a
good estimate of how the model will perform on future database that are similar to the
training and test databases.
Simple Validation: Simple validation is the most basic testing method. To carry out this, you
set aside a percentage of the database, and do not use it in any way in the model building and
estimation. The percentage is basically between 5% and 33% for all the future calculations to
be correct, the division of the data into two groups must be random, so that the training and
test data sets both reflect the data being modeled. In building a single model, this simple
validation may need to be performed several times for instance, when using a neural net,
sometimes each training pass through the net is nested against a test database.
Cross Validation: If you have only a modest amount of data (a few thousand rows) for
building the model, you cannot afford to set aside a percentage of it for simple validation.
Cross validation is a method that let you use all your data. The data is randomly divided into
two equal sets in order to estimate the predictive accuracy of the model. The first thing is to
build a model on the first set and use it to predict the outcomes in the second set and calculate
an error rate. Then a model is built on the second set and use to predict the outcomes in the
first set and again calculate an error rate. Finally a model is built using all the data.
Bootstrapping: This is another technique for estimating the error of a model; it is primarily
used with very small data sets. As in cross validation the model is built on the entire dataset.
Then numerous data sets called bootstrap samples are created by sampling from the original
data set. After each case is sampled, it is replaced and a case is selected again until the entire
bootstrap sample is created. It should be noted that records may occur more than once in the
data sets thus created. A model is built on the data set, and its error rate is calculated. This is
called the resubstitution.
necessarily a correct model, this is because there are always implied assumptions in the
model. Moreover the data used to build the model may fail to match the real world in some
unknown ways leading to an incorrect model. Therefore it is important to test a model in the
real world. If a model is used to select a subset of a mailing list, do a test mailing to verify the
model. Also, if a model is used to predict credit risk, try to the model on a small set of
applicants before full deployment. The higher the risk associated with an incorrect model, the
more important it is to construct an experiment to check the model results. (Two Crow
corporate, 2005)
Data mining model is often applied to one event or transaction at a time, such as scoring a
loan application for risk. The amount of time in processing each new transaction and the rate
at which new transactions arrive will determine whether a parallelized algorithm is required.
Thus, while loan applications can easily be evaluated on modest-sized computers monitoring
credit card transaction or cellular telephone calls for fraud would require a parallel system to
deal with the high transaction rate.
Model Monitoring: There is need to measure how well your model has worked after using it,
even when you think you have finished because your model is working well. You must
continually monitor the performance of the model. Thus from time to time the model will
have to be rested, restrained and possibly completely rebuilt
4.0 Conclusion
Therefore, the process of mining data involves seven basic steps which is not linear but needs
to be looped back to previous steps for a successful data mining.
5.0 Summary
In this unit we have learnt that:
v Data mining for knowledge discovery is made up of some basic steps, this include
defining the business problem, building the data mining database, explore the data,
prepare the data for modeling, build the model, evaluate the model, and deploy model
and results.
Mosud, Y. Olumoye (2009), Introduction to Data Mining and Data Warehousing , Lagos:
Rashmoye Publications
Usama, F., Gregory, P., and Padhraic, S., From Data Mining to Knowledge Discovery in
Databases, Article of American Association for Artificial Intelligence Press, (1996).
Leon, A. and Leon, M. (1999), Fundamentals of Information Technology . New Delhi: Leon
Press Channel and Vikas Publishing House P
Introduction to Data Mining and Knowledge Discovery , Two Crows Corporation, Third
Edition
1.0 Introduction
The purpose of this unit is to give the reader some ideas of the types of activities in which
data mining is already being used and what companies are using them. The applications areas
that would be discussed include data mining in banking and finance, retails,
telecommunications, healthcare, credit card company, transportation, surveillance, games,
business, science and engineering, and spatial data,
2.0 Objectives
At the end of this unit you should be able to:
v Understand the various applications of data mining in our societies
Two critical factors for a successful data mining are: a large well integrated data warehouse
and a well-defined understanding of the business process within which data mining is to be
applied such as customer prospecting, retention, campaign management and so on.
There are so many software applications available in the market that uses data mining
techniques for stock prediction. One of such software developed by neural applications
corporation is a stock-prediction application that makes use of neural networks.
In the banking industry, the most widespread use of data mining is in the area of fraud
detection. Although the use of the data mining in banking has not been noticed in Nigeria but
has been in place in the advanced countries for credit fraud detection to monitor payment
card accounts, thereby resulting in a health return on investment. However, finding banking
industries that uses data mining is not easy, given their proclivity for silence. But one can
assume that most large banks are performing some sort of data mining, though many have
policies not to discuss it.
In addition many financial institutions have been the subject of many articles about their
sophisticated data mining and modeling of their customers behaviour. The only significant
problem with data mining in this aspect is the inability to leverage data-mining studies into
actionable result. For example, while a bank may know that customer meeting certain criteria
are likely going to close their accounts, it is another thing to figure out a strategy to do
something about it.
The direct-mail industry is an area where data mining, or data modeling, is widely used. It is
almost every type of retailer, catalogers, consumer retail chains, grocers, publishers, business-
to-business marketers, and packaged goods manufacturer makes use of direct customer
segmentation, which is clustering problem in data mining. A lot of vendors offer customers
segmentation packages e.g. Pilot Discovery Server segment Viewer (a customer segmentation
software). This software uses the customer segmentation to help in direct-mailing campaigns.
IBM has also used data mining for several retailers to analyze shopping patterns within stores
based on point of sale (POS) information.
Retailers are interested in many different types of data mining studies. In the area of
marketing, retailers are interested in creating data-mining models to answer questions like:
v How much are customers likely to spend over long periods of time?
v What is the frequency of customer purchasing behaviour?
v What are the best types of advertisements to reach certain segments?
v What advertising media are most effective at reaching customers?
v What is the optimal timing at which to send mailers?
Also, in identifying customer profitability, customers may wish to build models to answer
questions like:
v How does a retailer retain profitable customers?
v What are the significant customer segments that buy products?
Call Detail Data: Every time a call is placed on a telecommunications network, descriptive
information about the call is saved as a call detail record. The numbers of call detail records
generated and stored are always huge. For example, the customers of GSM
telecommunication in Nigeria cannot generate less than one million call detail records per
day. Call detail records include sufficient information to describe the important
characteristics of each call. At minimum, each call detail record includes the originating and
terminating phone numbers, the data and time of the call and the duration of the call. Call
detail records are generated in real time and therefore will be available almost immediately
for data mining.
Customer Data
Just like other large businesses, telecommunication companies have millions of customers.
By necessity this implies maintaining a database of information of these customers. This
information includes name and address and other information such as service plan and
contract information, credit score, family income and payment history. This information may
even be supplemented with data from external sources, such as from credit reporting
agencies. The customer data is often used in conjunction with other data in order to improve
results. For instance, customer data is typically used to supplement call detail data when
trying to identify phone fraud.
Now the applications of data mining in telecommunication industries can be grouped into
three areas: Fraud detection, marketing/customer profiling and network fault isolation.
Subscription fraud occurs when a customer opens an account with the intention of not
paying for the account charges. Superimposition fraud involves a customer opening a
legitimate account with some legitimate activity, but also includes some superimposed
illegitimate activity by a person other than the account holder. Superimposition fraud poses a
bigger challenge for the telecommunications industry and for this reason we focus on
applications for identifying this type of fraud.
The applications should basically operate on real-time using the call detail record and
immediately fraud is detected or suspected, should trigger some actions. This action may be
to immediately block the call or deactivate the account, or may involve opening an
investigation, which will result in a call to the customers to verify the legitimacy of the
account activity. The commonest method of identifying fraud is to build a profile of customer
calling behaviour and compare recent activity against this behavior. Thus this data mining
application relies on deviation detection. The calling behaviour is captured by summarizing
the call details for a customer, if call detail summaries are updated in real-time, fraud can be
identified soon after it occurs. (Gary M. Weiss)
store call details record which precisely describe the calling behaviour of each customer. This
information can be used to profile the customers and these profiles can then be used for
marketing and /or forecasting purposes.
Data mining has been used extensively in the medical industry already. For example,
NeuroMedical Systems used neural networks perform a Pap smear diagnostic aid. Vysis used
neural networks to perform a protein analysis for drug development. The University of
Rochester Cancer Center and the Oxford Transplant Center use Knowledge-SEEKER, which
is a decision tree technology to help with their research. Also, the Southern California Spinal
Disorders Hospital uses Information Discovery to data mine. Information Discovery quotes a
doctor as saying Today alone, I came up with a diagnosis for a patient who did not even
have to go through a physical exam .
63
With the use of data mining technology a pharmaceutical company can analyze its recent
sales force activity and their results to improve targeting of high-value physicians and
determine which marketing activities will have the greatest impact in the next few months.
The data needs to include competition market activity as well as information about the local
healthcare systems. The result can be distributed to the sales force via a wide-area network
that enables the representative to review the recommendations from the perspective of the
key attributes in the decision process.
access to data, enforcing laws and police, and ensuring detection of misuse of the information
obtained.
(3). Advanced collaborative and decision support tools: The advanced collaborative
reasoning and decision support technologies would allow analysts from different agencies to
share data.
Computer-Assisted Passenger Prescreening System (CAPPS II)
CAPS II is similar to TIA and represented a direct response to the September 11, 2001
terrorist attacks. With the images of airliners that flew into buildings which is still very fresh
in people s minds; air travel was now widely viewed not only as a serious valuable terrorist
target, but also as weapon for inflicting larger harm. CAPPS II initiative was intended to
replace the original CAPPS that are currently being used. Due to the growing number of
airplane bombings, the existing CAPPS (originally called CAPS) was developed through a
grant provided by the Federal Aviation Administration (FAA) to Northwest Airlines, with a
prototype system tested in 1996. In 1997, other major carriers also began work on screening
systems and by 1998, most of the U.S based airlines had voluntarily implemented CAPS,
with the remaining few working toward implementation.
The current CAPPS system is a rule-based system that uses the information provided by the
passenger when purchasing ticket to determine if the passenger belongs into one of the two
categories; selectees the one requiring additional security screening, and those that do not.
Moreover, CAPPS compares the passenger name to those on a lot of known or suspected
terrorist. CAPPS II was described by TSA as an enhanced system for confirming the
identities of passengers and to identify foreign terrorist or person with terrorist connections
before they can board U.S aircraft. CAPPS II would send the information provided by the
passenger in the Passengers Name Record (PNR), including full name, address, phone
number and data of birth to commercial data providers for comparison to authenticate the
identity of the passenger.
The commercial data provider then transmits a numerical score back to TSA indicating a
particular risk level. Passengers with a green score would have undergone normal
screening , while passengers with a yellow score would have undergone additional
screening, passengers with a red score would not be allowed to board the flight, and would
receive the attention of law enforcement . While drawing on information for commercial
databases, TSA had stated that it would not see the actual information used to calculate the
scores, and that it would not retain the traveler s information.
sending mail, a company can concentrate its efforts on prospects that are predicted to have a
high likelihood of responding to an offer. More sophisticated methods can be used to
optimize resources across campaigns so that one may predict which channel and which offer
an individual is most likely to respond to across all potential offers. Data clustering can also
be used to automatically discover the segments or groups within a customer data set.
Businesses employing data may see a return on investment but also recognize that the number
of predictive models can quickly become very large. Instead of one model predicting which
customers will churn, a business could build a separate model for each region and customer
type. And instead of sending an offer to all people that are likely to churn, it may only want
to send offers to customers that will likely take to offer. And finally, it may also want to
determine which customers are going to be profitable over a period of time and only send the
offer to those that are likely to be profitable. In order to maintain this quantity of model, they
need to mange versions and move to automated data mining.
Data mining is also helpful in human-resources department for identifying the characteristics
of their most successful employees. Information obtained, such as the universities attended by
highly successful employees, can help human resources focus recruiting efforts accordingly.
Another example of data mining, which is often referred to as market basket analysis, relates
to its use in retail sales; for example if a clothing store records the purchases of customers, a
data mining system could identify those customers that favour silk shirts over cotton ones.
Market basket analysis is also used to identify the purchase patterns of the Alpha consumer.
Alpha consumers are people that play a key role in connecting with the concept behind a
product, then adopting that product, and finally validating it for the rest of the society. Data
mining is a highly effective tool in the catalog marketing industry. Catalogers have a rich
history of customer transactions on millions of customers dating back several years. Data
mining tools can identify patterns among customers and help identify the most likely
customers to respond to upcoming mailing campaigns.
In the area of study on human genetics, the important goal is to understand the mapping
relationship between the inter-individual variation in human DNA sequences and variability
in disease susceptibility. This is very important to help improve the diagnosis, prevention and
treatment of the diseases. The data mining technique that is used to perform this task is
known as multifactor dimensionality reduction.
In electrical power engineering, data mining techniques are widely used for monitoring high
voltage equipment. The reason for condition monitoring is to obtain valuable information on
the insulation s fitness status of the equipment. Data clustering such as Self-Organizing Map
(SOM) has been applied on the vibration monitoring and analysis of transformer On-Load-
Tap Changers (OLTCS). Using vibration monitoring, it can be observed that each tap change
operation generates a signal that contains information about the condition of the tap changer
contacts and the drive mechanisms.
Other areas of data mining applications are in biomedical data facilitated by domain
ontology, mining clinical trial data, traffic analysis using self-organizing map (SOM). And
66
recently data mining methodology has been developed to mine large collections of electronic
health records for temporal patterns associating drug prescriptions to medical diagnoses.
4.0 Conclusion
Data mining has become increasingly common in both the private and public sectors.
Industries such as banking and finance, retail, healthcare, telecommunication commonly use
data mining to reduce costs, enhance research and increase sales. In the public sector, data
mining applications initially were used as a means of detecting fraud waste, but have grown
to also be used for purposes such as measuring and improving program performance.
5.0 Summary
In this unit we have learnt that:
v The applications of data mining in banking and finance industry, retails,
telecommunications, healthcare, credit card company, transportation company,
surveillance, games, business, spatial data, science and engineering, include fraud
detection, risk evaluation, to forecast the price of stocks and financial disaster. And
also for marketing and network fault isolation
2. Briefly discuss the roles of data mining in the following application areas:
(i). Spatial data
(ii). Science and engineering
(iii). Business
(iv). Telecommunication
Data Management and Data Warehouse Domain Technical Architecture, June 6, 2002
67
Jeffrey W. Seifert, (Dec. 2004), Data Mining: An Overview . From: Congressional Research
Service, The Library of Congress.
Better Health Care With Data Mining , Philip Baylis (Co-Author), Shared Medical Systems
Limited, UK.
Lean, A. and Lean, M. (1999), Fundamentals of Information Technology . New Delhi: Leon
Press Channel and Vikas Publishing House P
1.0 Introduction
Over the recent years data mining has been establishing itself as one of the major disciplines
in computer science with growing industrial impact. Without any doubt, research in data
mining will continue and even increase over coming decades.
In this unit, we shall be examining the present and future trend in the field of data mining
with a focus on those which are thought to have the most promising and applicability to
future data mining.
2.0 Objectives
At the end of this unit you should be able to:
v Understand the present and the future trend in data mining.
v Know the major trends in technologies and methods.
What is the future of data mining? Certainly, the field of data mining has made a great stride
in the past years, and many industry analysis and experts in the area are optimistic that the
future will be bright. There is explicit growth in the area of data mining. A lot of industry
analysis and research firms have projected a bright future for the entire data mining/KDD
area and its related area Customer Relationship Management (CRM). The spending in the
area of business intelligence that encompasses data mining is increasing in U.S. Moreover,
data mining projects are expected to grow at geometric ratio because a lot of consumer-based
industries with e-commerce orientation will utilize some kinds of data mining model.
As earlier discussed, data mining field is very broad and there are many methods and
technologies that have become dominant in the field. Also, not only has there been
developments in the conventional areas of data mining there are other areas which have been
identified as being especially important as future trends in the field.
Distributed data mining (DDM) is used to offer a different approach to traditional approach
analysis, together with a global data model . In more specific terms, this is specified as:
v Performing local data analysis for generating partial data models, and
v Combining the local data models from different data sites in order to develop the
global model. (Jeffrey Hsu).
This global model combines the results of the separate analyses. The global model that is
often produced, especially if the data in different locations has different features or
characteristics may become incorrect or ambiguous. This problem is especially critical when
the data in distributed site is heterogeneous rather than homogeneous. These heterogeneous
data sets are known as vertically partition datasets.
An approach proposed by Kargupta et al (2000) speaks of the collective data mining (CDM)
approach, which provides a better approach to vertically partition datasets using the notion of
orthonormal basis functions, and computes the basis coefficients to generate the global model
of the data (Jeffrey Hsu; Kargupta et.al. 2000).
70
To visualize patterns like classifiers, clusters, associations and others in portable devices are
usually difficult. The small display areas offer serious challenges to interactive data mining
environments. Data management in a mobile environment is also a challenging issue.
Moreover, the sociological and psychological aspects of the integration between data mining
technology and our lifestyle are yet to be explored. The key issues to consider include
theories of UDM, advanced algorithms for mobile and distributed applications, data
management issues, mark-up languages and other representation techniques; integration with
database applications for mobile environments, architectural issues (architecture, control,
security and communication issues), Specialized mobile devices for UDM, agent interaction,
cooperation, collaboration, negotiation, organizational behaviour, applications of UDM
(Applications in business science, engineering, medicine and other disciplines) location
management issues in UDM and technology for web-based applications of UDM (Jeffrey
Hsn; Kargupta and Joshi, 2001)
In addition to the traditional forms of hypertext and hypermedia, together with the associated
hyperlink structures, there are also inter-document structures which exist on the web, such as
the directories employed by such services as Yahoo or the Open Directory project
(http://dmoz.org). These taxonomies of topics and subtopics are linked together to form a
large network or hierarchical tree of topics and associated links pages.
Some of the important data mining techniques used for hypertext and hypermedia data
mining include classification (supervised learning), clustering (unsupervised learning), semi-
structured learning and social network analysis.
classification attributes. Methods used for classification include naïve Bayes classification,
parameter smoothing, dependence modeling, and maximum entropy (Jeffrey Hsn;
Chokrabarth, 2000).
2. Unsupervised Learning
This differs from classification in that classification involve the use of training data,
clustering is concerned with the creation of hierarchies of documents based on similarity and
organize the documents based on that hierarchy. Intuitively, this would result in more similar
documents being placed on the leaf of the hierarchy, with less similar sets of document areas
being placed higher up, closer to the root of tree. Techniques that are used for unsupervised
learning include k-means clustering, agglomerative clustering, random projections and latent
semantic indexing (Jeffrey Hsu; Chakrabarti, 2000).
3. Semi-Supervised Learning
This is an important hypermedia-based data mining. It is the case where there are both
labeled and unlabeled documents, and there is a need to learn from both types of documents.
Another developing area in multimedia data mining is that of audio data mining (mining
music).The idea is basically to use audio, signals to indicate the patterns of data or to
represent the features of data mining results. The basic advantage of audio data is that while
using a technique such as visual data mining may disclose interesting patterns from observing
graphical displays, it does require users to concentrate on watching patterns which can
become monotonous. But when representing it as a stream of audio, it is possible to transform
patterns into sound and music and listen to pitches, rhythms, tune, and melody in order to
identify anything interesting or unusual.
To analyze spatial and geographic data include such tasks as understanding and browsing
spatial data, uncovering relationships between spatial data items and also between non-spatial
and spatial items. Applications of these would be useful in such fields as remote sensing,
medical imaging, navigation and related uses. Some of the techniques and data structures
which are used when analyzing spatial and related types of data include the use of spatial
warehouses, spatial data uses and spatial OLAP. Spatial data warehouse can be defined as
those which are subject-oriented, integrated, nonvolatile and time-variant. Some of the
challenges in constructing a spatial data warehouse include the difficulties of integration of
data from heterogeneous sources, and also applying the use of on-line analytical processing
which is not only relatively fast, but also offers some forms of flexibility.
By and large, spatial data cubes, which are components of spatial data warehouse, are
designed with three types of dimensions and two types of measures. The three types of
dimensions include:
v The non-spatial dimension- that is data that is non-spatial in nature
v The spatial to non-spatial dimension - primitive level is spatial but higher level-
generalization is non-spatial and
v The spatial-to-spatial dimension both primitive and higher levels are all spatial.
In terms of measures, there are both numerical (numbers only) and spatial (pointers to spatial
object) measured used in spatial data cubes.
Beside the implementation of data warehouse for spatial data there is also the issue of
analysis which can be done on the data. Some these analysis include association analysis,
clustering methods and the mining of raster databases.
data that is ordered in a sequence. Generally, one aspect of mining time series data focuses on
the goal of identifying movements or components which exist within the data (trend
analysis).these include long-term or trend movements, seasonal variations, cyclical
variations, and random movements.
Other techniques that can be used on these kinds of data include similarity search, sequential
pattern mining and periodicity analysis.
v Similarity Search: This is concerned with the identification of a pattern sequence
which is close or similar to a given pattern, and this form of analysis can be broken
down into two subtypes: whole sequences matching and subsequence matching . The
whole sequence matching attempts to find all sequences which bear a likeness to each
other, while subsequences matching attempts to find those patterns which are similar
to a specified given sequences.
v Sequential Pattern Mining: It has its focus on the identification of sequences that
occurs often in a time series or sequence of data. It is particularly useful in the
analysis of customers where certain buying patterns could be identified, such as what
might be the likely follow-up purchase to purchasing a certain electronics item or
computer.
v Periodicity Analysis: This attempt to analyze the data from the perspective of
identifying patterns which repeat or recur in a time series. This form of data mining
analysis can be categorized as being full periodic, partial periodic or cyclic periodic.
Full periodic is the situation where all of the data points in time contribute to the
behavior of the series. This is in contrast to partial periodicity, where only while
certain points in time contribute to series behavior, while cyclical periodicity relates
to sets of events that occur periodically.
Other aspects of usability are the intuitiveness when adjusting the parameters and the
parameter sensitivity. If the results are not strongly dependent on slight variations of the
parametization, adjusting the algorithms becomes less complex. To fulfill these requirements,
there is need to distinguish between the two types of parameters and afterwards propose four
goals for future data mining methods.
v The first type of parameter, which is called type 1 , is tuning data mining algorithms
for deriving useful patterns. For instance, k for a k-NN classifier influences directly
the achieved classification.
v The second type of parameter called type II , is more or less describing the semantics
of the given objects. For instance, the cost matrix used by edit distance has to be
based on domain knowledge and this varies from application to application. The
important thing of this type of parameter is that the parameters are used to model
additional constraints from the real world.
Based on these considerations, the following proposals can be formulated for future data
mining solutions:
(i). Avoid type I parameters if possible when designing algorithms.
(ii). If type I parameters are necessary, try to find the optimal parameter settings
automatically. For many data mining algorithms, it might be possible to integrate the
given parameters into the underlying optimization problem.
(iii). Instead of finding patterns for one possible value of a type II parameter, try to
simultaneously derive patterns for each parameter setting and store them for post
processing.
75
4.0 Conclusion
With the unstoppable and unavoidable growth in data collection in the years ahead, data
mining is playing an important role in the way massive data sets are analyzed. Trends clearly
indicate that future decision making systems would weigh on even quicker and more reliable
models used for data analysis. And to achieve this, current algorithms and computing systems
have to be optimized and tuned, to effectively process the large volumes of raw data to be
seen in future.
5.0 Summary
In this unit we have learnt that:
v The field of data mining and knowledge discovery has made a giant strides in the past
and many experts in the field are optimistic that the future will be bright
v There are lots of data mining trends in terms of technologies and methodologies
which include distributed data mining, hypertext/hypermedia mining, ubiquitous data
mining, time series/sequence data mining, constraint-based, phenomenal data mining
and increasing usability.
Usama, F., Gregory, P., and Padhraic, S., From Data Mining to Knowledge Discovery in
Databases, Article of American Association for Artificial Intelligence Press, (1996).
S. Sumathi and S.N. Sivanamdam. Introduction to Data Mining Principles, Studies in
Computational Intelligence (SCI) 29, 1-20 (2006)
Lean, A. and Lean, M. (1999), Fundamentals of Information Technology . New Delhi: Leon
Press Channel and Vikas Publishing House P
Hans-Peter K., Karasten M.B, Peer K., Alexey P. Matthias S. and Arthur Z. (March, 2007)
Future Trends in Data Mining. Springer Science + business Media, 23 March 2007
76
Jayaprakash, P, Joseph Z., Berkin O. and Alok C., Accelerating Data mining Workloads:
Current Approaches and Future Challenges in System Architecture Design
Jeffrey, H., Data Mining Trends and Developments: The Key Data Mining Technologies and
st
Applications for the 21 Century.
1.0 Introduction
Data warehouses usually contain historical data derived from transaction data, but it can
include data from other sources. Also, it separates analysis work load from transaction
workload and enables an organization to consolidate data from several sources.
This unit examines the meaning of data warehouse, its goals and characteristics, evolution,
advantages and disadvantages, its components, applications and users.
2.0 Objectives
At the end of this unit, you should be able to:
v Define the term data warehouse
v Understand the goals and characteristics of data warehouse
v Know the major components of data warehouse
v Understand the structure and approaches to storing data in data warehouse
v Describe the users and application areas of data warehouse
The father of data warehousing William H. Inmon defined data warehouse as follows: A
data warehouse is a subject oriented , integrated, non-volatile and time-variant collection of
data in support of management decisions.
As mentioned by W.H. Inmon in one of his papers, the data warehouse environment is the
foundation of Decision Support Systems (DSS) and is about molding data into information
and storing this information based on the subject rather than application.
A data warehouse is designed to house a standardized, constraint, clean and integrated form
of data sourced from various operational systems in use in the organization, structured in a
way to specifically address the reporting and analytic requirements. One of the primary
reasons for developing a data warehouse is to integrate operational data from various sources
into a single and consistent architecture that supports analysis and decision making in an
organization.
The words operational (legacy) systems as used in the definition are used to create, update
and delete production data that feed the data warehouse. A data warehouse is analogous to a
physical warehouse. It is the operational systems that create data parts that are loaded into
the warehouse. Some of those parts are summarized into information components and are
stored in the warehouse.
The users of data warehouse make request and information which are the products that
were created from the components and parts stored in the warehouse are delivered. A data
warehouse is typically a blending of technologies, including relational and multidimensional
databases, client/server architecture, extraction/transformation programs, graphical interfaces
and more.
Collection of Data:
Subject-Oriented
Integrated
Non-Volatile and
Time Variant
79
(1). Subject-Oriented
The main objective of storing data is to facilitate decision process of a company, and within
any company data naturally concentrates around subject areas. This leads to the gathering of
information around these subjects rather than around the applications or processes
(Muhammad, A.S.)
(2). Integrated
The data in the data warehouses are scattered around different tables, databases or even
servers. Data warehouses must put data from different sources into a consistent format. They
must resolve such problems as naming conflicts and inconsistencies among units of measure.
When this is achieved, they are said to be integrated.
(3). Non-Volatile
Non-volatile means that information in the data warehouse does not change each time an
operational process is executed. Information is consistent regardless of when and how the
warehouse is accessed.
(4). Time-Variant
The value of operational data changes on the basis of time. The time based archival of data
from operational systems to data warehouse makes the value of data in the data warehouses
to be a function of time. As data warehouse gives accurate picture of operational data for
some given time and the changes in the data in warehouse are based on time-based change in
operational data, data in the data warehouse is called time-variant .
Data warehousing which is a process of centralized data management and retrieval, just like
data mining it is a relatively new term, although the concept itself has been around for years.
Organizations basically started with relatively simple use of data warehousing. Over the
years, more sophisticated use of data warehousing evolves. The following basic stages of the
use of data warehouse can be distinguished:
v Off Line Operational Database : The data warehouses at this stage were developed
by simply copying the data off an operational system to another server where the
processing load of reporting against the copied data does not impact the operational
system s performance.
v Off Line Data Warehouse: The data warehouses at this stage are updated from data
in the operational systems on a regular basis and the data in warehouse data is stored
in a data structure designed to facilitate reporting.
v Real Time Data Warehouse : The data warehouse at this level is updated every time
an operational system performs a transaction, for example, an order or a delivery.
v Integrated Data Warehouse : The data warehouses at this level are updated every
time an operational system carries out a transaction. The data warehouse then
generates transactions that are passed back into the operational systems.
(iii). A data warehouse provides a common data model for all data of interest regardless of
the data s source. This makes it easier to report and analyze information than it would be if
multiple data models were used to retrieve information such as sales invoices, order receipts,
general ledger charges etc.
(iv). Information in the data warehouse is under the control of data warehousing users so
that, even if the source system data is purged over time, the information in the warehouse can
be stored safety for extended periods of time.
(v). Because data warehouse is separated from operational systems, it provide retrieval of
data without slowing down operational systems.
(vi). Enhanced Customer Service: Data warehouses can work in conjunction with
customer service, hence, enhance the value of operational business applications or better
customer relationships notably customer relationship management (CRM) systems by
correlating all customer data via a single data warehouse architecture.
81
(viii). Before loading data into the data warehouse, inconsistencies are identified and
resolved. This greatly simplifies reporting and analysis.
(ix). Cost Effective: A data warehouse that is based upon enterprise-wide data
requirements provides a cost effective means of establishing both data standardization and
operational system interoperability. This typically offers significant savings
(ii). Data warehouses are not the optimal or most favourable environment for unstructured
data.
(iii). Data warehouses have high costs. A data warehouse is usually not static. Maintenance
costs are always on the high side.
(iv). Data warehouse can get outdated relatively quickly and there is a cost of delivering
suboptimal information to the organization.
(v). Because there is often a fine line between data warehouse and operational system,
duplicate and expensive functionality may be developed. Or, functionality may be developed
in the data warehouse that in retrospect should have been developed in the operational
systems and vice versa.
1. Summarized Data
82
Highly summarized data are primarily for enterprise executives. It can come from either the
lightly summarized data used by enterprise elements or from current detail. Data volume at
this level is much less than other levels and represents a diverse collection supporting a wide
variety of needs and interests. In addition to access to highly summarized data, executives
also have the capability of accessing increasing levels of detail through a drill down
process. (Alexis L. et al, 1999).
2. Current Detail
Current detail is the heart of a data warehouse where bulk of data resides and it comes
directly from operational system and may be stored as raw data or as aggregations of raw
data. Current detail that is organized by subject area represents the entire enterprise rather
than a given application. Current detail is the lowest level of data granularity in the data
warehouse. Every data entity in current detail is a snapshot, at a moment in time, representing
the instance when the data are accurate. Current detail is typically two to five years old and
its refreshment occurs as frequently as necessary to support enterprise requirements (Alexis
L. et al, 1999).
3. System of Record
A system of record is the source of the data that feeds the data warehouse. The data in the
data warehouse is different from operational systems data in the sense that they can only be
read and not modified. Thus, it is very necessary that a data warehouse be populated with the
highest quality data available that is most timely, complete, and accurate and has the best
structural conformance to the data warehouse. Often these data are closest to the source of
83
entry into the production. In other cases, a system of record may be containing already
summarized data.
4. Integration and Transformation Programs
As the operational data items pass from their systems of record to a data warehouse,
integration and transformation programs convert them from application-specific data into
enterprise data. These integration and transformation programs perform functions such as:
v Reformatting, recalculating, or modifying key structures.
v Adding time elements.
v Identifying default values
v Supplying logic to choose between multiple data sources
v Summarizing, tallying and merging data from multiple sources.
Whenever either the operational or data warehousing environment changes, integration and
transformation programs are modified to reflect that change.
5. Archives
The data warehouse archives contain old data normally over two years old but of significant
value and containing interest to the enterprise. There are usually large amount of data stored
in the data warehouse archives with a low incidence of access. Archive data are most often
used for forecasting and trend analysis. Although archive data may be stored with the same
level of granularity as current detail, it is more likely that archive data are aggregated as they
are archived. Archives include not only old data in raw or summarized form: they also
include the metadata that describes the old data s characteristics (Alexis L. et al, 1999).
Meta data provides data repository. It provides both technical and business view of data
stored in the data warehouse. It lays out the physical structure which includes:
v Data elements and their types
v Business definition for the data elements
v How to update data and on which frequency
v Different data elements
v Valid values for each data elements
Meta data plays a very significant role in the definition, building, management and
maintenance of data warehouses. In a data warehouse metadata are categorized into two
namely:
v Business Metadata
v Technical Metadata
Business metadata describes what is in the warehouse, its meaning in terms of business. The
business metadata lies above technical metadata, adding some details to the extracted
material. This type of metadata is important as it facilitates business users and increases the
accessibility. Technical metadata describes the data elements as they exist in the warehouse.
This type of metadata is used for data modeling initially, and once the warehouse is erected
this metadata is frequently used by warehouse administrator and software tools. (Alexis L. et
al, 1999)
3.6 Structure of a Data Warehouse
The structure of a data warehouse is shown in figure 1.3 and consists of the following:
84
v Physical Data Warehouse: This is the physical database in which all the data for the
data warehouse is stored, along with metadata and processing logic for scrubbing,
organizing, packaging and processing the detail data.
v Logic Data Warehouse: It also contains metadata enterprise rules and processing
logic for scrubbing, organizing, packaging and processing the data, but does not
contain actual data. Instead, it contains the information necessary to access the data
wherever they reside. This structure is effective only when there is a single source for
the data and they are known to be accurate and timely (Alexis L. et al, 1999).
v Data Mart : This is a data structure that is optimized for access. It is designed to
facilitate end-user analysis of data. It typically supports a single and analytical
application used by a distinct set of workers. Also, a data mart can be described as a
subset of an enterprise-wide data warehouse which typically supports an enterprise
element (e.g. department, region, function). As part of an iterative data warehouse
development process, an enterprise builds a series of physical (or logical) data marts
over time and links them via an enterprise-wide logical data warehouse or feeds them
from a single physical warehouse
The Normalized Approach : In this approach, the data in the data warehouse are stored
following to a degree and database normalization rules. Tables are grouped together by
subject areas that reflect general data categories e.g. data on customers, products, finances.
2. Knowledge Workers
A relatively small number of analysts perform the bulk of new queries and analysis against
the data warehouse. These are the users who get the designer or analyst versions of user
access tools. They figure out how to quantify a subject area. After a few iterations, their
queries and reports typically get published for the benefit of the information consumers.
Knowledge workers are often deeply engaged with the data warehouse design and place the
greatest demands on the ongoing data warehouse operations team from training and support.
3. Information Consumers
Most users of the data warehouse are information consumers; they will probably never
compose a true and ad-hoc query. They use static or simple interactive reports that others
have developed. It is easy to forget about these users, because they usually interact with the
data warehouse only through the work product of others. Do not neglect these users. This
group includes a large number of people, and published reports are highly visible. Set up a
great communication infrastructure for distributing information widely, and gather feedback
from these users to improve the information sites over time.
4. Executives
87
Executives are a special case of the information customer group. Few executives actually
issue their own queries, but an executive s slightest thought can generate an outbreak of
activity among the other types of users. An intelligent data warehouse designer/implementer
or owner will develop a very cool digital dashboard for executives, assuming it is easy and
economical to do so. Usually this should follow other data warehouse work, but it never hurts
to impress the bosses.
Reporting tools and custom applications often access the database directly. Statisticians
extract data for use by special analytical tools. Analysts may write complex queries to extract
and compile specific information not readily accessible through existing tools. Information
consumers do not interact directly with the relational database but may receive e-mail reports
or access web pages that expose data from the relational database. Executives use standard
reports or ask others to create specialized reports for them. When using the Analysis Services
Tools in SQL servers 2000, Statisticians will often perform data mining, analysts will write
MDX queries against OLAP cubes and use data mining, and information consumers will use
interactive reports designed by others.
4.0 Conclusion
Therefore, a data warehouse usually contains historical data derived from transaction data
and may include data from other sources. Also, it separates analysis workload from
transaction workload and enables an organization to consolidate data from several sources.
5.0 Summary
In this unit we have learnt that:
v A data warehouse is a data structure that is optimized for collecting and storing
integrated sets of historical data from multiple operational systems and feeds them to
one or more data marts.
v The characteristics of a data warehouse, these include subject oriented, integrated,
non-volatile and time variant
88
v The major components of a data warehouse, these include summarized data, current
detail, system of record, integration/transformation programs, metadata and archives
v The structure of a data warehouse consists of the physical data warehouse, logical
data warehouse and data mart.
v The data warehouse users can be divided into four categories namely statisticians,
knowledge workers, information consumers and executives. And good numbers of
application areas.
Leon, A. and Leon, M. (1999), Fundamentals of Information Technology . New Delhi: Leon
Press Channel and Vikas Publishing House P
Jayaprakash, P, Joseph Z., Berkin O. and Alok C., Accelerating Data mining Workloads:
Current Approaches and Future Challenges in System Architecture Design
Dave Browning and Joy Mundy, (Dec., 2001). Data Warehouse Design Considerations.
Retrieved on 13/10/2009. Available Online: hhtp://msdn.microsoft.com/en-
us/library/aa902672(SQL.80)aspx.
89
1.0 Introduction
The term architecture in the content of an organization s data warehousing effort is a
conceptualization of how the data warehouse is built. There is no right or wrong architecture;
rather multiple architectures exist to support various environments and situations. The
worthiness of the architecture can be judged on how the conceptualization aids in the
building, maintenance and usage of the data warehouse. This unit examines the meaning of
data warehouse architecture, its evolution, components and differentiates between extraction,
transformation and load. Also to explored are the relevance of resource and data management
in data warehouse architecture.
2.0 Objectives
At the end of this unit you should be able to:
v Understand the term data warehouse architecture
v Know the three types of data warehouse architecture
v Describe the components of data warehouse architecture
v Understand the use of extraction, transformation and load tools
v Describe what is meant by resource management
agencies, and in the long run allow for faster development, reuse and consistent data between
warehouse projects.
In figure 2.1, the metadata and raw data of a traditional OLTP system is present, as additional
types of data and summary data. Summaries are very valuable in data warehouses because
they pre-compute long operations in advance. For example, a typical data warehouse query is
to retrieve something like August sales. A summary in oracle is called a materialized view.
3.2.3 Data Warehouse Architecture (With a Staging Area and Data Marts)
Even though, the architecture in figure 2.2 is quite common, you may want to customize your
warehouse s architecture for different groups within your organization. This can be done by
adding data marts, which are systems designed for a particular line of business. Figure 2.3
shows an example where purchasing, sales and inventories are separated. In this example, a
financial analyst might want to analyze historical data for purchases and sales.
Figure 2.3 Architecture of a Data Warehouse with a Staging Area and Data Marts
Source: Oracle9i Data Warehousing Guide Release 2 (9.2)
Data staging is a major process that includes the following sub procedures:
v Extraction: The extract step is the first step of getting data into the data warehouse
environment. Extracting means reading and understanding the source data, and
copying the pas that are needed to the data staging for further work.
v Transformation: Once the data is extracted into the data staging area, there are
many transformation steps, including:
- Cleaning the data by correcting misspellings, resolving domain conflicts, dealing
with missing data elements, and passing into standard formats
- Purging selected fields from the legacy data that are not useful for data warehouse
- Combining data sources by matching exactly on key values or by performing
fuzzy matches on non-key attributes
- Creating surrogate keys for each dimension record in order to avoid dependency on
legacy defined keys, where the surrogate key generation process enforce referential
integrity between the dimension tables and fast tables.
- Building the aggregates for boosting the performance of common queries.
v Loading and Indexing : At the end of transformation process, the data is in the form
of load record images. Load in the data warehouse environment usually takes the form
of replicating the dimensional tables and fast tables and presenting these tables to bulk
loading facilities in recipient data mart. Bulk loading is a very important capability
that is to be contrasted with record-at-a time loading which is far slower. The target
data mart must them index the newly arrived data for query performance.
4. Data Marts
Data mart is a logical subset of an enterprise-wide data warehouse. The easiest way to
theoretically view a data is that a mart needs to be an extension of the data warehouse. Data is
integrated as it enters the data warehouse from multiple legacy sources. Data marts then
derive their data from the central data warehouse source. The theory is that no matter how
many data marts are created, all the data are drawn from the one and only one version of the
truth, which is the data contained in the warehouse.
Distribution of the data from the warehouse to the mart provides the opportunity to build new
summaries to fit a particular department s needs. The data marts contain subject specific
information supporting the requirements of the end users in individual business units. Data
marts can provide rapid response to end-user requests of if most queries are directed to pre-
computed and aggregated data stored in the data mart
meets the needs of the data warehouse and does not adversely impact the source systems that
store the original data.
Extraction
Extraction is a means of replicating data through a process of selection from one or more
source database. Extraction may or not employ some forms of transformation. Data extraction
can be accomplished through custom-developed programs. But the preferred method uses
vendor- supported data extraction and transformation needs as well as use an enterprise
metadata repository that will document the business rules used to determine what data was
extracted from the source systems.
Transformation
Data is transformed from transaction level data into information through several techniques:
filtering, summarizing, merging, transposing, converting and deriving new values through
mathematical and logical formulas. These all operate on one or more discrete data fields to
produce a target result having more meaning from a decision support perspective than the
source data. This process requires understanding the business focus, the information needs
and the currently available sources. Issues of data standards, domains and business terms
arise when integrating across operational databases.
Data Cleansing
Cleansing data is based on the principle of populating the data warehouse with quality data,
that is consistent data, which is of a known, recognized value and confirms with the business
definition as expressed by the user. The cleansing operation is focused on determining those
values which violate these rules and either reject, or through a transformation process bring
the data into conformance. Data cleansing standardizes data according to specifically defined
rules, eliminates redundancy to increase data query accuracy, reduces the cost associated with
inaccurate, incomplete and redundant data, and reduces the risk of invalid decisions made
against incorrect data.
v Text and Numeric Fields : Data fields comprise of rows of information containing
discrete values related to some business entity. Current operational databases are
almost completely text and numeric data fields. Since there are discrete values, these
can be individually retrieved, queried and manipulated to support some activities,
96
reporting need or analysis. These data types will continue to play a significant role in
all our databases.
v Geographic Data : Geographic data is information about features on the surface and
subsurface of the earth, including their location, share, description and condition.
Geographic information includes spatial and descriptive tabular information in tabular
and raster (image) formats. A geographic information system (GIS) is a hardware and
software environment that captures, stores, analyzes, queries, and displays geographic
information. Usually geographic information is the basic for location- based decision
making, land-use planning, emerging response, and mapping purposes.
Graphics, animation and video, likewise, offer an alternative way to inform users
where simple text does not communicate easily with the complexity or the
relationships between information components. An example might be graphic
displays of vessels and equipment allowing drill down too more detailed information
related to the part or component. Video may be useful in demonstrating some
complex operations as part of a training program.
v Object: Objects are composites of other data types and other objects. Objects form a
hierarchy of information unlike the relational models. Objects contain facts about
themselves and exhibit certain behaviours implemented as procedural code. They also
inherit the facts and behavious of their parent objects up through the hierarchy.
Relational database stores everything in rows and columns. Although they may
support large binary object (LOB) fields that can hold anything an object database can
support.
2. Databases
Database is a collection of data organized to service many applications with minimum
redundancy. Databases organize data and information into physical structures, which are then
accessed and updated through the services of a database management system. Some of the
common terms associated with database are:
5. Data Access
Data access middleware is the layer of communication between a data access level and the
Database. The following components are essential for the data access middleware layer for
accessing a relational database in an N-tier application environment:
v Structured Query Language (SQL): A query language is used to query and retrieve
data from relational databases. The industry standard for SQL is ANSI standard SQL.
RDBMS vendors implement SQL drivers to enable access to their proprietary
databases. Vendors may add extensions to the SQL language for their proprietary
databases.
6. Processing Access
Access to data can be categorized into two major groups:
v On-line Analytical Processing (OLAP): This is decision support software that
allows the user to quickly analyze information that has been summarized into
multidimensional views. OLAP application is a system designed for for few but
complex (read only) request. Traditional OLAP products which are also known as
Multidimensional OLAP or MOLAP summarize transactions into multidimensional
views ahead of time. User queries on these types of databases are extremely fast
because the consolidation has already been done. OLAP places the data into a cube
structure that can be rotated by the user, which is particularly suited for financial
summaries.
7. Replication
Replication is used to keep distributed database up to date with a central source
database. Replication uses a database that has been identified as a central source and
reproduces the data to distributed target databases. As more and more data is being made
available to the public over internet, replication of selected data to locations outside the wire
wall is becoming more common. Replication data should be accessed by applications in a
read-only mode. If updates were allowed on replicated data, data would quickly become
corrupted and out of sync. Updates should be directed to the database access tier in change of
updating the authoritative source, rather than to a replicated database. Replication services are
available from most relational database vendors for their particular products.
v Replication Services : Replication is the process of distributing information access a
network of computers. Replication strategies may also employ some forms of
transformation such that the information has different content and meaning. When
information has a low volatility, replication may be a valid strategy for optimizing
performance. Replication will need to be evaluated against our network, our volume
of activity and local access requirements.
v Partial and Full Refresh: A full refresh simply replaces the existing target with a
new copy of the source databases. It is simple to implement, but may not be practical
for large databases due to the amount of time involved in the process of dumping and
reloading the data.
A partial refresh replicates only the changes made from the source database to the
remote databases. The processing involved in replicating only changes is more
complex than a full refresh, but is an optimal solution for a large database. In a partial
refresh method, either data or transactions can drive the replication e.g. sending the
exact data that was changed on the central database.
v Mirroring: Mirroring provides two images of the same database and allows the two
databases to be synchronized simultaneously. That is, an update to one causes the
mirror to also be updated. This form of replication is the most accurate, but also,
99
potentially the most difficult to achieve and the most costly to operate (Data
Management and Data warehouse, 2002).
Security
Security becomes an increasingly important aspect as access to data and information expands
and takes on new forms such as Web pages and dynamic content. Our security policy needs
to be examined to ensure that it can be enforced given the move to distributed data and
internet access.
Administration
Administration encompasses the creation, maintenance, support, backup and recovery, and
archival processes required in managing a database. There is used to be able to centrally
manage all of the enterprise databases to ensure consistency and availability. Distributing
data to appropriate platform will place more importance on administration and control. This
becomes the key to maintaining the overall data architecture. Currently database
administration is done using the tools and services native to and provided by most relational
database vendors for their particular products. Investing in centrally managed administration
products and resources will improve data quality, availability and reliability
4.0 Conclusion
Therefore, data warehouse always has an architecture which can either be ad-hoc or planned,
implied or documented without which many warehouses subject areas do not fit together,
connections lead to nowhere, and the whole warehouse becomes difficult to manage and
change.
5.0 Summary
In this unit we have learnt that:
v The term architecture in the content of an organization s data warehousing effort is a
conceptualization of how the data warehouse is built.
v There three common architectures are for data warehouse design, these are: data
warehouse architecture (Basic), data warehouse architecture (with a staging Area) and
data warehouse architecture (with a staging Area and Data Marts).
v Data warehouse architecture consists of seven major components namely: Operational
source systems, data staging area, data warehouse, data marts, extract- transform-load
tool, business intelligence and metadata/metadata repository.
v Data extraction transformation load (ETL) tools are used to extract data from data
sources, cleanse the data, perform data transformations, and load the target data
warehouse.
100
v Resource management provides a common view for data, this include definitions,
stewardship, distribution and currency and allows those charged with ensuring
operational integrity and availability of the tools necessary to do so.
Surajit C., and Umeshwar D., An Overview of Data Warehousing and OLAP Technology
Anil Rai, Data Warehouse and its Applications in Agriculture , Indian Agriculture Statistics
Research Institute Library Avenue, New Delhi-110 012.
Dave Browning and Joy Mundy, (Dec., 2001). Data Warehouse Design Considerations.
Retrieved on 13/10/2009. Available Online: hhtp://msdn.microsoft.com/en-
us/library/aa902672(SQL.80)aspx.
Data Management and Data Warehouse Domain Technical Architecture, June 6, 2002
Leon, A. and Leon, M. (1999), Fundamentals of Information Technology . New Delhi: Leon
Press Channel and Vikas Publishing House Pvt Ltd.
1.0 Introduction
Data warehouse support business decisions by collecting, consolidating and organizing data
for reporting and analysis with tools such as on-line analytical processing (OLAP) and data
mining. Though data warehouses are built on relational database technology, but the design
of a data warehouse database differs substantially from the design of an online transaction
processing system (OLTP) database.
This unit examines the approaches and choices to be considered when designing and
implementing a data warehouse. Also to be discussed is the different strategies to test a data
warehouse application
2.0 Objectives
At the end of this unit, you should be able to:
v Differentiate between a logical and physical design
v Understand the basic methodologies used in building a data warehouse
v Explain the phases involved in developing a data warehouse
v Know the data warehouse testing life cycle
Logical design involves describing the purpose of a system and what the system will do as
against to how it is actually going to be implemented physically. It does not include any
specific hardware or software requirements. Also, logical design lays out the system s
components and their relationship to one another as they would appear to users. Physical
design is the process of translating the abstract logical model into the specific technical
design for the new system. It is the actual bolt and nut of the system as it includes the
technical specification that transforms the abstract logical design plan into a functioning
system,
The top-down design methodology generates highly consistent dimensional views of data
across data marts since all data marts are loaded from centralized repository. Top-down
design has also proven to be robust against business changes. Also, the top-down
methodology can be inflexible and indifferent to changing departmental needs during the
implementation phases.
The issues to discuss are the users objectives and challenges, and how they go about making
business decisions. Business users should be closely tied to the design team during the logical
design process; they are the people that understand the meaning of existing data. Many
successfully projects include several business users on the design team to act as data experts
and sounding boards for design concepts. Whatever the structure of the team, it is important
that business users feel ownership for the resulting system.
Interview the data experts after interviewing several users and find out from the experts what
data exists and where it resides, but only after understanding the basic business needs of the
end users. The information about available data is needed early in the process before
completing the analysis of the business needs, but the physical design of existing data should
not be allowed to have much influence on discussions about business needs. It is very
important to communicate with users often thoroughly so that everyone would participate in
the progress of the requirements definition.
A typical dimensional model uses a star or snowflake design that is easy to understand and
relate to business needs, supports simplified queries, and provides superior query
performance by minimizing table joins.
Star Schemas
A star schema is the simplest data warehouse schema. It is called a star schema because the
diagram resembles a star, with points radiating from a center. The center of the star consists
105
of one or more fact tables and the points of the star are the dimension tables as shown in
figure 3.1.
Snowflake Schemas
A schema is called a snowflake schema if one or more dimension tables do not join directly
to the fact table but must join through other dimension tables. For example, a dimension that
describes products may be separated into three tables (snake flaked) as illustrated in figure
3.2.
Fact tables represent data, usually numeric and additive, that can be analyzed and examined.
Examples include sales, cost and profit.
A fact table basically has two types of columns: those containing numeric facts (often called
measurements), and those that are foreign keys to dimension tables. It also contains
aggregated facts that are often called summary tables. A fact table usually contains facts with
the same level of aggregation. Though most facts are additive, they can also be semi-additive
or non-additive. Additive facts can be aggregated by simple arithmetical addition; an example
of this is sales. Non-additive facts cannot be added at all, an example is averages. Semi-
additive facts can be aggregated along some of the dimensions and not along others. An
example of this is inventory levels, where you cannot tell what a level means simply by
looking at it
Dimension data is typically collected at the lowest level of detail and aggregated into higher
level totals that are more useful for analysis. These natural rollups or aggregations within a
dimension table are called hierarchies.
Hierarchies
These are logical structures that uses ordered levels as a means of organizing data. A
hierarchy can be used to define data aggregation. For example, in a time dimension, a
hierarchy might aggregate data from the month level to the quarter level to the year level: (all
time), year quarter, month, day, or (all time), year quarter, week, and day. Also, a dimension
may contain multiple hierarchies; a time dimension often contains both calendar and fiscal
year hierarchies. Geography is seldom a dimension of its own; it is usually a hierarchy that
imposes a structure on sales points, customers, or other geographically distributed
dimensions. An example of geography hierarchy for sales points is: (all), country or region,
sales-region, state or province, city, store. A hierarchy can also be used to define a
navigational drill path and to establish a family structure.
Within a hierarchy, each level is logically connected to the levels above and below it. Data
values at lower levels aggregate into the data values at higher level. A dimension can be
composed of more than one hierarchy. For example, in the product dimension, there might be
two hierarchies, one for product categories and one for product suppliers. Dimension
hierarchies also group levels from general to granular. Query tools use hierarchies to enable
you to drill down into your area to view different levels of granularity, which is one of the
key benefits of a data warehouse. When designing hierarchies, you must consider the
relationship in business structures, for example, a dimensional multilevel sales organization.
107
Levels: A level represents a position in a hierarchy. For example, a time dimension might
have a hierarchy that represents data at the month, quarter, and year levels. Levels range from
general to specific, with the root level as the highest or most general level.
Level Relationship: Level relationships specify top-to-bottom ordering of levels from most
general (the root) to most specific information. They define the parent child relationship
between the levels in a hierarchy
Region
Sub region
Country-name
Customer
v Unique Identifiers : Unique identifiers are specified for one distinct record in a
dimension table. Artificial unique identifiers are often used to avoid the potential
problem of unique identifiers changing. Unique identifiers are represented with the #
character, for example, # customer_id.
The data warehouse architecture reflects the dimensional model developed to meet the
business requirements. Dimension design largely determines dimension table design, and fact
definitions determine fact table design.
The data and time dimensions are created and maintained in the data warehouse independent
of the other dimension tables or fact tables, updating data and time dimension may involve
only a simple annual task to mechanically add the records for the next year. Also the
dimensional model lends itself to easy expansion. New dimension attributes and new
dimensions can be added, usually without affecting existing schemas other than by extension.
Existing historical data should remain unchanged. Data warehouse maintenance applications
will need to be extended, but well-designed user applications should still function, though
some may need to be updated to make use of the new information.
The composite primary key in the fact table is an expensive key to maintain:
v The index alone is almost as large as the fact table.
v The index on the primary key is often created as a clustered index.
109
In many scenarios a clustered primary key provides excellent query performance. However,
all other indexes on the fact table use the large clustered index key. All indexes on the table
will be large, the system will require significant additional storage space, and query
performance may upgrade.
Due to these, many star schemas are defined with an integer, surrogate primary key or no
primary key at all. Therefore, it is recommended that the fact table be defined using the
composite primary key. Also, create an IDENTITY column in the fact table that could be
used as a unique clustered index, should the database administrator determine this structure
would provide better performance.
Indexes
Dimension tables must be indexed on their primary keys, which are the surrogate keys
created for the data warehouse tables. The fact table must have a unique index on the primary
key. There are scenarios where the primary key index should be clustered and other scenarios
where it should not. The large the number of dimensions in the schema, the less beneficial it
is to cluster the primary key index. With a large number of dimensions, it is usually more
effective to create a unique clustered index on a meaningless IDENTITY column.
Elaborating the initial design and development of index plans for end-user queries is not
necessary with SQL server 2000, which has sophisticated index techniques and an easy to use
index tuning wizard tool to tune indexes to query workload. The SQL server 2000 Index
Tuning Wizard allows you to select and create an optimal set of indexes and statistics for a
database without requiring an expert understanding of the structure of the database, the
workload, or the intervals of SQL server. The wizard analyzes a query workload captured in a
SQL Profiler trace or provided by an SQL script, and recommends and index configuration to
improve the performance of the database.
The Index Tuning Wizard provides the following features and functionality:
v It can use the query optimizer to analyze the queries in the provided workload and
recommend the best combination of index to support the query mix in the mix load.
v It analyzes the effects of the proposed changes, including index usage, distribution of
queries among tables, and performance of queries in the work load.
v It can recommend ways to tune the database for a small set of problem queries
v It allows you to customize its recommendation by specifying advanced options, such
as disk space constraints.
Views
Views should be created for users that need direct access to data in the warehouse relational
database. Users can be granted access to views without having access to the underlying data.
Indexed views can be used to improve performance of user queries that access data through
views. View definitions should create column and table names that will make sense to
business users. If analysis services will be the primary query engine to the data warehouse, it
will be easier to create clear and consistent cubes from view with readable column names.
OLAP cube design requirements will be a natural outcome of the dimensional model if the
data warehouse is designed to support the way users want to query data. In a
multidimensional database, a dimensional model is a cube. It holds data more like a 3-D
spreadsheet rather than a traditional relational database. A cube allows different views of the
data to be quickly displayed. The ability to quickly switch between one slice of data and
another allows users to analyze their information in smaller meaningful chunks, at the speed
of thought. Use of cubes allows the user to look at data in several dimensions, for example
attendance by agency, attendance codes and attendance by date.
A classic business case for an operational data store is to support the Customer Call Center;
call center operators have little need for broad analytical queries that reveal trends in
customer behaviour. Rather, their needs are more immediate, the operation should have up to
date information about all transactions that involve the complaining customer, this data may
come from multiple source systems, but should be presented to the call center operator in a
simplified and consolidated way.
The implementations of the ODS vary widely depending on business requirements. There are
no strict rules for how the ODS must be implemented. A successful ODS for one business
problem may be a replicated mirror of the transaction system; for another business problem a
star schema will be most effective operational data stores fall between these two extremes,
and include some level of transformation and integration of data. It is possible to architect the
ODS so that it serves its primary operational need, and also functions as the close source for
the data warehouse staging process.
The applications that support data analysis by the data warehouse users are constructed in this
phase of data warehouse development. OLAP cubes and data mining models are constructed
using Analysis services tools and client access to analysis data is supported by the Analysis
Server. The technique for cube design is covered in Module 3, Unit 2.
Other analysis applications such as Microsoft PivotTables, predefined reports, web sites, and
digital dashboards are also developed in this phase as natural language applications using
English Query. Specialized third-party analysis tools are also required and implemented or
installed. Details of these specialized applications are determined directly by user needs.
Testing a data warehouse application should be done with a sense of utmost responsibility. A
bug in a DW traced at a later stage results in unpredictable losses. It should be remembered
tester must go an extra mile to ensure near defect free solutions.
4.0 Conclusion
Therefore data design is the key to data warehousing. The business users know what data
they need and how they want to use it. In designing a data warehouse there is need to focus
on the users, determine what data is needed, locate sources of data and organize the data in a
dimensional model that represents the business needs.
5.0 Summary
In this unit we have learnt that:
v Logical design involves describing the purpose of a system and what the system will
do as against to how it is actually going to be implemented physically while physical
design is the process of translating the abstract logical model into the specific
technical design for the new system.
v There are three basic methodologies used in building a data warehouse, this include
bottom-up design, top-down design and hybrid design.
v The process of developing a data warehouse is made up of series of stages which are:
Identify and gather requirements, design the dimensional model, develop the
architecture, design the relational database and OLAP cubes, develop the maintenance
applications, develop analysis applications, test and deploy the system.
v The implementation of data warehouse undergoes the natural cycle of unit testing,
system testing, regression testing, integration testing and acceptance testing.
114
Usama, F., Gregory, P., and Padhraic, S., From Data Mining to Knowledge Discovery in
Databases, Article of American Association for Artificial Intelligence Press, (1996).
Lean, A. and Lean, M. (1999), Fundamentals of Information Technology . New Delhi: Leon
Press Channel and Vikas Publishing House P
Hans-Peter K., Karasten M.B, Peer K., Alexey P. Matthias S. and Arthur Z. (March, 2007)
Future Trends in Data Mining. Springer Science + business Media, 23 March 2007
Jayaprakash, P, Joseph Z., Berkin O. and Alok C., Accelerating Data mining Workloads:
Current Approaches and Future Challenges in System Architecture Design
1.0 Introduction
Data warehousing and on-line analytical processing (OLAP) are essentials elements of
decision-support that has become a focus of the database industry. Most of the commercial
products and services are now available and all the principal database management system
vendors now offer in these areas. Decision support places some rather different requirements
on database compared to traditional on-line transaction processing application. This unit
examines the differences between OLAP and data warehouse, types of OLAP servers and
uses of OLAP.
2.0 Objectives
At the end of this unit, you should be able to:
v Understand the meaning of OLAP
116
OLAP was coined in 1993 by Tedd. Codd who is referred to as the father of the relational
database as a type of application that allows users to interactively analyze data. An OLAP
system is often contrasted to an On-Line Transaction processing (OLTP) system that focuses
on processing transaction such as orders, invoice or general ledger transactions. Before
OLAP was coined, these systems were often referred to as Decision Support Systems (DSS).
OLAP enables analysts, managers and business executives to gain insight into data through
fast, consistent and interactive access to a wide variety of possible views of information.
Also, OLAP transform raw data so that it reflects the real dimensionality of the enterprise as
understood by the user. In addition, OLAP systems have the ability to answer what if? and
why? that sets them apart from data warehouses. OLAP enables decision making about
future actions. A typical OLAP calculation is more complex than simply summing data.
OLAP and data warehouse are complementary. A data warehouse stores and manages data.
OLAP transform data warehouse data into strategic information. OLAP ranges from basic
navigation and browsing (this is often referred to as slice and dice ), to calculations, to more
serious analyses such as time series and complex modeling. As decision-makers exercise
more advanced OLAP capabilities, they move from data access to information and to
knowledge.
(ii). Other convenience of OLAP is that it allows the manager to tear down data from OLAP
database in specific or broad terms. In layman s term, the report can be as simple as
comparing two columns or as complex as analyzing a huge amount of data. Moreover, it
helps to realize relationships that were forgotten earlier.
(iii). OLAP helps to reduce the applications backlog still further by making business users
self sufficient enough to build their own models. Unlike standalone departmental applications
running on PC networks, OLAP applications are dependent on data warehouse and
transaction processing systems to refresh their source level data. As a result, ICT gains more
self-sufficient users without relinquishing control over the integrity of the data.
(iv). Through the use of OLAP, ICT realizes more efficient operations by using software
designed for OLAP, ICT reduces the query drag and network traffic on transaction systems or
the data warehouse.
(v). By providing the ability to model real business problems and a more efficient use of
people resources, OLAP enables the organization as a whole to respond more quickly to
market demands. Market responsiveness, in turn often yields improved revenue and
profitability.
data are optimized for rapid ad-hoc information retrieval in any orientation, as well as for
fast, flexible calculation and transformation of raw data based on formulaic relationship. The
OLAP server may either physically stage the processed multidimensional information to
deliver consistent and rapid response times to end users, or it may populate its data
structures in real-time from relational or other databases.
1. Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in
between a relational back-end server and client front-end tools. ROLAP systems work
primarily from the data that resides in a relational database, where the base data and
information tables are stored as relational tables. They use a relational or extended relational
DBMS to store and manage warehouse data; and OLAP middleware to support missing piece.
ROLAP severs include optimization for each DBMS back end, implementation of
aggregation navigation logic, and additional tolls and services. ROLAP technology tends to
have greater scalability than MOLAP technology. The DSS server of Micro-strategy and
Meta-cube of Informix for example, adopt the ROLAP approach.
One major advantage of ROLAP over the other styles of OLAP analytical tools is that it is
deemed to be more scalable in handling huge amounts of data. ROLAP sits on top of
relational database therefore enabling it to leverage several functionalities that a relational
database is capable of. Another benefit of ROLAP tool is that it is efficient in managing both
numeric and textual data. It also permits users to drill down to the leaf details or the lowest
level of a hierarchy structure. The disadvantage of ROLAP applications is that it display a
slower performance as compared to other style of OLAP tools, since calculations are often
times performed inside the server. Another disadvantage of ROLAP tool is that it is
dependent on use of SQL for data manipulation, it may not be ideal for performance of some
calculations that are not easily translatable into a SQL query.
One of the major distinctions of MOLAP against a ROLAP tool is that data are pre-
summarized and are stored in an optimized format in a multidimensional cube, instead of in a
relational database. In this type of model, data are structured into proprietary formats in
accordance with a client s reporting requirements with the calculations pre-generated on the
cubes. MOLAP analytic tool are capable of performing complex calculations, since
calculations are predefined upon cube creation, this results in the faster return of computed
data. MOLAP systems also provide users with the ability to quickly write back data into a
data set. Moreover when compared with ROLAP, MOLAP is considerably less heavy on
hardware due to compression techniques. Summarily, MOLAP is more optimized for fast
query performance and retrieval of summarized information.
119
However, there are certain limitations to the implementation of a MOLAP system; one
primary weakness is that MOLAP tool is less scalable than a ROLAP tool as the former is
capable of handling only a limited amount of data. Also, MOLAP approach introduces data
redundancy. Some certain MOLAP products encounters difficulty in updating models with
dimensions of very high cardinality.
Other Types:
There are also less popular types of OLAP system upon which could stusmble on so often.
We have listed some of the less famous existing in the OLAP industry.
Some of the most appealing features of the style of OLAP are the considerably lower
investment involved, enhanced accessibility as user only needs an internet connection and a
web browser to connect to the data and ease of installation, configuration and deployment
process. But despite all of its unique features, it could still not compare to a conventional
client/server machine. Currently, it is inferior in comparison with OLAP applications which
involve deployment in client machines in terms of functionality, visual appeal and
performance.
facilitate management of both spatial and non-spatial data, as data could come not only in an
alphanumeric form, but also in images and videos. This technology provides easy and quick
exploration of data that resides on a spatial database
Other different blends of an OLAP product like the less popular DOLAP and ROLAP that
stands for Database OLAP and Remote OLAP respectively. LOLAP for Local OLAP and
RTOLAP for Real Time OLAP are existing but have barely made a noise on the OLAP
industry.
Academic research into data warehousing technologies will likely focus on automating
aspects of the warehouse, such as the data acquisition, data quality management, selection
and construction of appropriate access path and structures, self-maintainability, functionality
and performance optimization. Incorporation of domain and business rules appropriately into
121
the warehouse creation and maintenance process may take intelligent, relevant and self
governing.
4.0 Conclusion
Therefore, data warehousing and on-line analytical processing (OLAP) are essentials
elements of decision-support that has become a focus of the database industry.
5.0 Summary
In this unit we have learnt that:
v OLAP is a technology that allows users of multidimensional databases to generate on-
line descriptive or comparative summaries of data and other analytical queries.
v Data warehouse is different from OLAP in a number of ways such as data warehouse
stores and manages data while OLAP transform data warehouse data into strategic
information
v There are different types of OLAP which are ROLAP, MOLAP and HOLAP. These
three are the big players. Other types of OLAP are WOLAP, DOLAP, Mobile-OLAP
and SOLAP
v OLAP as a data warehouse tool can be used to provide superior performance for
business intelligence queries and to operate efficiently with data organized in
accordance with the common dimensional model used in data warehouse.
v Some of the open issues in data warehousing, these include increased research activity
in the near future as warehouse and data mart proliferation.
Jayaprakash, P, Joseph Z., Berkin O. and Alok C., Accelerating Data mining Workloads:
Current Approaches and Future Challenges in System Architecture Design
Anil Rai, Data Warehouse and its Applications in Agriculture , Indian Agriculture Statistics
Research Institute Library Avenue, New Delhi-110 012.
122
Dave Browning and Joy Mundy, (Dec., 2001). Data Warehouse Design Considerations.
Retrieved on 13/10/2009. Available Online: hhtp://msdn.microsoft.com/en-
us/library/aa902672(SQL.80)aspx.
Data Management and Data Warehouse Domain Technical Architecture, June 6, 2002
Leon, A. and Leon, M. (1999), Fundamentals of Information Technology . New Delhi: Leon
Press Channel and Vikas Publishing House Pvt Ltd.