0% found this document useful (0 votes)
470 views13 pages

1st Chapter Dunham Book PDF

This chapter introduces data mining and its relationship to knowledge discovery in databases (KDD). It discusses basic data mining tasks like classification, clustering, association rule mining and discusses data mining issues, metrics, and social implications. The chapter presents a database perspective on data mining and envisions its future evolution to make techniques more user-friendly and widely applicable.

Uploaded by

abhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
470 views13 pages

1st Chapter Dunham Book PDF

This chapter introduces data mining and its relationship to knowledge discovery in databases (KDD). It discusses basic data mining tasks like classification, clustering, association rule mining and discusses data mining issues, metrics, and social implications. The chapter presents a database perspective on data mining and envisions its future evolution to make techniques more user-friendly and widely applicable.

Uploaded by

abhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Preface

Data doubles about every year, but useful information seems to be decreasing. The area
of data mining has arisen over the last decade to address this problem. It has become
not only an important research area, but also one with large potential in the real world.
Current business users of data mining products achieve millions of dollars a year in
savings by using data minif\g techniques to reduce the cost of day to day business
operations. Data mining techniques are proving to be extremely useful in detecting and
predicting terrorism.
The purpose of this book is to introduce the reader to various data mining con­
cepts and algorithms. The book is concise yet thorough in its coverage of the many
data mining topics. Clearly written algorithms with accompanying pseudocode are used
to describe approaches. A database perspective is used throughout. This means that I
examine algorithms, data structures, data types, and complexity of algorithms and space.
The emphasis is on the use of data mining concepts in real-world applications with large
database components.
Data mining research and practice is in a state similar to that of databases in the
1960s. At that time applications programmers had to create an entire database environ­
ment each time they wrote a program. With the development of the relational data model,
query processing and optimization techniques, transaction management strategies, and ad
hoc query languages (SQL) and interfaces, the current environment is drastically differ­
ent. The evolution of data mining techniques may take a similar path over the next few
decades, making data mining techniques easier to use and develop. The objective of this
book is to help in this process.
The intended audience of this book is either the expeiienced database professional
who wishes to learn more about data mining or graduate level computer science students
who have completed at least an introductory database course. The book is meant to
be used as the basis of a one-semester graduate level course covering the basic data
mining concepts. It may also be used as reference book for computer professionals and
researchers.

Introduction

I Chl Introduction 1-
Core Topics
I Ch2 Related Concepts I
I Ch3 Data Mining Techniques I rl Ch4 Classification I
r-H Ch5 Clustering
Advanced Topics
H Ch6 Association Rules I
I Ch7 Web Mining 1-
I Ch8 Spatial Mining 1-r- Appendix

I Ch9 Temporal Mining 1- y Data Mining Products

xi
xii Preface
Preface xiii

The book is divided into four major parts: Introduction, Core Topics, Advanced temporal databases, and I have used some of the information from his dissertation in
Topics, and Appendix. The introduction covers background information needed to under­ the temporal mining chapter. Nat Ayewah has been very patient with his explanations
stand the later material. In addition, it examines topics related to data mining such as of hidden Markov models and helped improve the wording of that section. Zhigang Li
OLAP, data warehousing, information retrieval, and machine learning. In the first chapter has introduced me to the complex world of time series and helped write the solutions
of the introduction I provide a very cursory overview of data mining and how it relates manual. I've learned a lot, but still feel a novice in many of these areas.
to the complete KDD process. The second chapter surveys topics related to data min­ The students in my CSE 8 3 3 1 class (Spring 1 9 9 9 , Fall 2000, and Spring 2002) at
ing. While this is not crucial to the coverage of data mining and need not be read to SMU have had to endure a great deal. I never realized how difficult it is to clearly word
understand later chapters, it provides the interested reader with an understanding and algorithm descriptions and exercises until I wrote this book. I hope they learned something
appreciation of how data mining concepts relate to other areas. To thoroughly under­ even though at times the continual revisions necessary were, I'm sure, frustrating. Torsten
stand and appreciate the data mining algorithms presented in subsequent chapters, it is Staab wins the prize for find�ng and correcting the most errors. Students in my CSE8 3 3 1
important that the reader realize that data mining is not an isolated subject. It has its basis class during Spring 2002 helped me prepare class notes and solutions to the exercises. I
in many related disciplines that are equally important on their own. The third chapter thank them for their input.
in this part surveys some techniques used to implement data mining algorithms. These My family has been extremely supportive in this endeavor. My husband, Jim, has
include statistical techniques, neural networks, and decision trees. This part of the book been (as always) understanding and patient with my odd work hours and lack of sleep.
provides the reader with an understanding of the basic data mining concepts. It also A more patient and supportive husband could not be found. My daughter Stephanie has
serves as J standalone survey of the entire data mining area. put up with my moodiness caused by lack of sleep. Sweetie, I hope I haven't been too
The Core Topics covered are classification, clustering, and association rules. I view short-tempered with you (ILYMMTYLM). At times I have been impatient with Kristina
these as the major data mining functions. Other data mining concepts (such as prediction, but you know how much I love you. My Mom, sister Martha, and brother Dave as always
regression, and pattern matching) may be viewed as special cases of these three. In each are there to provide support and love.
of these chapters I concentrate on coverage of the most commonly used algorithms of Some of the research required for this book was supported by the National Science
each type. Our coverage includes pseudocode for these algorithms, an explanation of Foundation under Grant No. IIS- 9 8 208 4 1. I would finally like to thank the reviewers
them and examples illustrating their use. (Michael Huhns, Julia Rodger, Bob Cimikowski, Greg Speegle, Zoran Obradovic,
The advanced topics part looks at various concepts that complicate data mining T.Y. Lin, and James Buckly) for their many constructive comments. I tried to implement
applications. I concentrate on temporal data, spatial data, and Web mining. Again, algo­ as many of these I could.
rithms and pseudocode are provided.
In the appendix, production data mining systems are surveyed. I will keep a more
up to data list on the Web page for the book. I thank all the representatives of the various
companies who helped me correct and update my descriptions of their products.
All chapters include exercises covering the material in that chapter. In addition to
conventional types of exercises that either test the student's understanding of the material
or require him to apply what he has learned. I also include some exercises that require
implementation (coding) and research. A one-semester course would cover the core topics
and one or more of the advanced ones.

ACKNOWLEDG MENTS

Many people have helped with the completion of this book. Tamer Ozsu provided initial
advice and inspiration. My dear friend Bob Korfhage introduced me to much of computer
science, including pattern matching and information retrieval. Bob, I think of you often.
I particularly thank my graduate students for contributing a great deal to some of
the original wording and editing. Their assistance in reading and commenting on earlier
drafts has been invaluable. Matt McBride helped me prepare most of the original slides,
many of which are still available as a companion to the book. Yongqiao Xiao helped
write much of the material in the Web mining chapter. He also meticulously reviewed
an earlier draft of the book and corrected many mistakes. Le Gruenwald, Zahid Hossain,
Yasemin Seydim, and Al Xiao performed much of the research that provided information
found concerning association rules. Mario Nascimento introduced me to the world of
PART ONE

INTRODUCTION
CHAPTER 1

Introduction

1.1 BASIC DATA MINING TASKS


1.2 DATA MINING VERSUS KNOWLEDGE OISCOVERY IN DATABASES
1.3 DATA MINING ISSUES
1.4 DATA MINING METRICS
1.5 SOCIAL IMPLICATIONS OF DATA MINING
1.6 DATA MINING FROM A DATABASE PERSPECTIVE
1.7 THE FUTURE
1.8 EXERCISES
1.9 BIBLIOGRAPHIC NOTES

The amount of data kept in computer files and databases is growing at a phenomenal rate.
At the same time, the users of these data are expecting mo!l'e sophisticated information
from them. A marketing manager is no longer satisfied with a simple listing of marketing
contacts, but wants detailed information about customers' past purchases as well as pre­
dictions of future purchases. Simple structured/query language queries are not adequate
to support these increased demands for information. Data mining steps in to solve these
needs. Data mining is often defined as finding hidden information in a database. Alterna­
tively, it has been called exploratory data analysis, data driven discovery, and deductive
learning.
Traditional database queries (Figure 1.1), access a database using a well-defined
query stated in a language such as SQL. The output of tht: query consists of the data
from the database that satisfies the query. The output is usually a subset of the database,
but it may also be an extracted view or may contain aggregations. Data mining access
of a database differs from this traditional access in several ways:

• Query: The query might not be well formed or precisely stated. The data miner
might not even be exactly sure of what he wants to see.

• Data: The data accessed is usually a different version from that of the original
operational database. The data have been cleansed and modified to better support
the mining process.

• Output: The output of the data mining query probably is not a subset of the
database. Instead it is the output of some analysis of the contents of the database.

The current state of the art of data mining is similar to that of database query processing
in the late 1960s and early 1970s. Over the next decade there undoubtedly will be great

3
4 Chapter 1 Introduction Section 1.1 Basic Data Mining Tasks 5

SQL

( }
Data mining

Q� I DBMS 1 - Ds
----
Results Predictive Descriptive

FIGURE 1.1: Database access.


Classification Regression Time series Prediction
-------�
Clustering Summarization Association Sequence
analysis rules discovery

mining. We probably will


strides in extending the state of the art with respect to data
, and algorithms targeting FIGURE 1.2: Data mining models and tasks.
see the development of "query processing" models, standards
will also see new data structures designed for
the data mining applications. We probably
used for data mining applicatio ns. Although data mining
the storage of databases being A predictive model makes a prediction about values of data using known results
infancy, over the last decade we have seen a proliferat ion of mining
is currently in its found from different data. Predictive modeling may be made based on the use of
s, applicatio ns, and algorithm ic approach es. Example 1.1 illustrates one such
algorithm other historical data. For example, a credit card use might be refused not because of
application. the user's own credit history, but because the current purchase is similar to earlier
purchases that were subsequently found to be made with stolen cards. Example 1.1
EXAMPL�1.1 uses predictive modeling to predict the credit risk. Predictive model data mining tasks

Credit card companies must determine whether to authorize credit card purchases. Sup­ include classification, regression, time series analysis, and prediction. Prediction may

pose that based on past historical information about purchases, each purchase is placed also be used to indicate a specific type of data mining function, as is explained in

into one of four classes: (1) authorize, (2) ask for further identification before authoriza­ section 1.1.4.

tion, (3) do not authorize, and (4) do not authorize but contact police. The data mining A descriptive model identifies patterns or relationships in data. Unlike the predictive
functions here are twofold. First the historical data must be examined to determine how model, a descriptive model serves as a way to explore the properties of the data examined,
the data fit into the four classes. Then the problem is to apply this model to each new not to predict new properties. Clustering, summarization, association rules, and sequence
purchase. Although the second part indeed may be stated as a simple database query, the discovery are usually viewed as descriptive in nature.
first part cannot be.
1.1 BASIC DATA MINING TASKS
Data mining involves many different algorithms to accomplish different tasks. All In the following paragraphs we briefly explore some of the data mining functions. We
of these algorithms attempt to fit a model to the data. The algorithms examine the data follow the basic outline of tasks shown in Figure 1.2. This list is not intended to be

and determine a model that is closest to the characteristics of the data being examined. exhaustive, but rather illustrative. Of course, these individual tasks may be combined to

Data mining algorithms can be characterized as consisting of three parts: obtain more sophisticated data mining applications.

• Model: The purpose of the algorithm is to fit a model to the data. 1.1.1 i Classification

• Preference: Some criteria must be used to fit one model over another. Classification maps data into predefined groups or classes. It is often referred to as
supervised learning because the classes are determined before examining the data. Two
• Search: All algorithms require some technique to search the data. examples of classification applications are determining whether to make a bank loan and
identifying credit risks. Classification algorithms require that the classes be defined based
In Example 1.1 the data are modeled as divided into four classes. The search requires
on data attribute values. They often describe these classes by looking at the character­
examining past data about credit card purchases and their outcome to determine what
istics of data already known to belong to the classes. Pattern recognition is a type of
criteria should be used to define the class structure. The preference will be given to
classification where an input pattern is classified into one of several classes based on
criteria that seem to fit the data best. For example, we probably would want to authorize
its similarity to these predefined classes. Example 1.1 illustrates a general classification
a credit card purchase for a small amount of money with a credit card belonging to a
problem. Example 1.2 shows a simple example of pattern recognition.
long-standing customer. Conversely, we would not want to authorize the use of a credit
card to purchase anything if the card has been reported as stolen. The search process
EXAMPLE 1.2
requires that the criteria needed to fit the data to the classes be properly defined.
As seen in Figure 1.2, the model that is created can be either predictive or descrip­ An airport security screening station is used to determine: if passengers are potential
tive in nature. In this figure, we show under each model type some of the most common terrorists or criminals. To do this, the face of each passenger is scanned and its basic
data mining tasks that use that type of model. pattern (distance between eyes, size and shape of mouth, shape of head, etc.) is identified.
6 Chapter 1 Introduction
Section 1.1 Basic Data Mining Tasks 7

This pattern is compared to entries in a database to see if it matches any patterns that
are associated with known offenders. ---o-X

__ .,___ y

--z
1.1.2 Regression

Regression is used to map a data item to a real valued prediction vari�ble. In ac �al­

ity, regression involves the learning of the function that does t is mappi�g. Regre �si?n
assumes that the target data fit into some known type of functiOn (e.g., linear, logistic,
etc.) and then determines the best function of this type that models the given data. �orne
type of error analysis is used to determine which function is "best." standard hnear
.
regression, as illustrated in Example 1.3, is a simple example of regressiOn.

FIGURE 1.3: Time series plots.


EXAMPLE 1.3

A college ptofessor wishes to reach a certain level of savings before her retirement. 1.1.4 Prediction
. �urre�t value
Periodically, she predicts what her retirement savings will be based on Its
and several past values. She uses a simple linear regression fo�ula to predict this value Many real-world data mining applications can be seen as predicting future data states
. .
by fitting past behavior to a linear function and then using this functiOn to ?redict the based on past and current data. Prediction can be viewed as a type of classification. (Note:
values at points in the future. Based on these values, she then alters her mvestment This is a data mining task that is different from the prediction model, although the pre­
portfolio. diction task is a type of prediction model.) The difference is that prediction is predicting
a future state rather than a current state. Here we are referring to a type of application
rather than to a type of data mining modeling approach, as discussed earlier. Prediction

1.1.3 Time Series Analysis applications include flooding, speech recognition, machine learning, and pattern recog­
nition. Although future values may be predicted using time series analysis or regression
With time series analysis, the value of an attribute is examined as it varies over time. The techniques, other approaches may be used as well. Example 1.5 illustrates the process.
values usually are obtained as evenly spaced time points (daily, weeki�, hourly, etc.). A
time series plot (Figure 1.3), is used to visualize the time series. In this figure you can
EXAMPLE 1.5
easily see that the plots for Y and Z have similar behavior, while X appears to have less
volatility. There are three basic functions performed in time series analysis In on � case,
. : Predicting flooding is a difficult problem. One approach
distance measures are used to determine the similarity between different tlme senes. In uses monitors placed at various
; points in the river. These monitors collect data relevant
the second case, the structure of the line is examined to determine (and perhaps classi y) � to flood prediction: water level,
' rain amount, time, humidity, and so on. Then the water
its behavior. A third application would be to use the historical time series plot to predict level at a potential flooding point
in the river can be predicted based on the data collected
future values. A time series example is given in Example 1.4. by the sensors upriver from this
point. The prediction must be made with respect to the
time the data were collected.

EXAMPLE 1.4

1.1.5 Clustering
Mr. Smith is trying to determine whether to purchase stock from Companies X, Y,
or z. For a period of one month he charts the daily stock price for ea�h co�pany.
Clustering is similar to classification except that the groups are not predefined, but rather
Figure 1.3 shows the time series plot that Mr. Smith ha� gene�ated . Usmg this and
defined by the data alone. Clustering is alternatively referred to as unsupervised learn­
similar information available from his stockbroker, Mr. Sllllth decides to purchase stock
ing or segmentation. It can be thought of as partitioning or segmenting the data into
X because it is less volatile while overall showing a slightly larger relative amount of
groups that might or might not be disjointed. The clustering is usually accomplished by

growth than either of the other stocks . As a matter of fact, the �to cks or Y and Z have
. determining the similarity among the data on predefined attributes. The most similar data
a similar behavior. The behavior of Y between days 6 and 20 IS Identical to that for Z
are grouped into clusters. Example 1.6 provides a simple clustering example. Since the
between days 13 and 27.
clusters are not predefined, a domain expert is often required to interpret the meaning of
the created clusters.
8 Chapter 1 Introduction Section 1.2 Data Mining Versus Knowledge Discovery in Databases 9

products are frequently purchased with bread. He finds that 60% of the time that bread is

EXAMPLE 1 . 6 sold so are pretzels and that 70% of the time jelly is also sold. Based on these facts, he
tries to capitalize on the association between bread, pretzels, and jelly by placing some
A certain national department store chain creates special catalogs targeted t o various
pretzels and jelly at the end of the aisle where the bread is placed. In addition, he decides
demographic groups based on attributes such as income, location, and physical charac­
not to place either of these items on sale at the same time.
teristics of potential customers (age, height, weight, etc.). To determine the target mailings
of the various catalogs and to assist in the creation of new, more specific catalogs, the
company performs a clustering of potential customers based on the determined attribute Users of association rules must be cautioned that these are not causal relation­
values. The results of the clustering exercise are then used by management to create ships. They do not represent any relationship inherent in the actual data (as is true with
special catalogs and distribute them to the correct target population based on the cluster functional dependencies) or in the real world. There probably is no relationship between
for that catalog. bread and pretzels that causes them to be purchased together. And there is no guarantee
that this association will apply in the future. However, association rules can be used to
assist retail store management in effective advertising, marketing, and inventory control.
A special type of clustering is called segmentation. With segmentation a database
is partitioned into disjointed groupings of similar tuples called segments. Segmentation
1.1.8 Sequence Discovery
is often viewed as being identical to clustering. In other circles segmentation is viewed
as a specilic type of clustering applied to a database itself. In this text we use the two Sequential analysis or sequence discovery is used to determine sequential patterns in data.
terms, clustering and segmentation, interchangeably. These patterns are based on a time sequence of actions. These patterns are similar to
associations in that data (or events) are found to be related, but the relationship is based
1.1.6 Summarization on time. Unlike a market basket analysis, which requires the items to be purchased at

Summarization maps data into subsets with associated simple descriptions. Summariza­ the same time, in sequence discovery the items are purchased over time in some order.

tion is also called characterization or generalization. It extracts or derives representative Example 1.9 illustrates the discovery of some simple patterns. A similar type of discovery

information about the database. This may be accomplished by actually retrieving portions can be seen in the sequence within which data are purchased. For example, most people

of the data. Alternatively, summary type information (such as the mean of some numeric who purchase CD players may be found to purchase CDs within one week. As we will

attribute) can be derived from the data. The summarization succinctly characterizes the see, temporal association rules really fall into this category.

contents of the database. Example 1.7 illustrates this process.


EXAMPLE 1.9

EXAMPLE 1.7 The Webmaster at the XYZ Corp. periodically analyzes the Web log data to determine
One of the many criteria used to compare universities by the U.S. News & World Report how users of the XYZ's Web pages access them. He is interested in determining what
is the average SAT or AC T score [GM99]. This is a summarization used to estimate the sequences of pages are frequently accessed. He determines that 70 percent of the users
type and intellectual level of the student body. of page A follow one of the following patterns of behavior: (A, B, C) or (A, D, B, C)
or (A, E, B, C). He then determines to add a link directly from page A to page C.

1.1.7 Association Rules


1.2 DATA M I NI NG VERSU S KNOWLEDGE DISCOVERY I N DATABASES
Link analysis, alternatively referred to as affinity analysis or association, refers to the
data mining task of uncovering relationships among data. The best example of this The terms knowledge discovery in databases (KDD) and data mining are often used
type of application is to determine association rules. An association rule is a model that interchangeably. In fact, there have been many other names given to this process of
identifies specific types of data associations. These associations are often used in the retail discovering useful (hidden) patterns in data: knowledge extraction, information discovery,
sales community to identify items that are frequently purchased together. Example 1.8 exploratory data analysis, information harvesting, and unsupervised pattern recognition.
illustrates the use of association rules in market basket analysis. Here the data analyzed Over the last few years KDD has been used to refer to a process consisting of many
consist of information about what items a customer purchases. Associations are also used steps, while data mining is only one of these steps. This is the approach taken in this
in many other applications such as predicting the failure of telecommunication switches. book. The following definitions are modified from those found in [FPSS96c, FPSS96a].

DEFINITION 1.1. Knowledge discovery in databases (KDD) is the process of


finding useful information and patterns in data.
EXAMPLE 1.8

A grocery store retailer is trying to decide whether to put bread on sale. To help determine DEFINITION 1.2. Data mining is the use of algorithms to extract the information

the impact of this decision, the retailer generates association rules that show what other and patterns derived by the KDD process.
10 Chapter 1 Introduction Section 1.2 Data Mining Versus Knowledge Discovery in Databases 11

The KDD process is often said to be nontrivial; however, we take the larger view that modified to facilitate use by techniques that require specific types of data distributions.
KDD is an all-encompassing concept. A traditional SQL database query can be viewed Some attribute values may be combined to provide new values, thus reducing the com­
as the data mining part of a KDD process. Indeed, this may be viewed as som�what plexity of the data. For example, current date and birth date could be replaced by age.
simple and trivial. However, this was not the case 30 years ago. If we were to advance One attribute could be substituted for another. An example would be replacing a sequence
30 years into the future, we might find that processes thought of today as nontrivial and of actual attribute values with the differences between consecutive values. Real valued
complex will be viewed as equally simple. The definition of KDD includes the keyword attributes may be more easily handled by partitioning the values into ranges and using
useful. Although some definitions have included the term "potentially useful," we believe these discrete range values. Some data values may actually be removed. Outliers, extreme
that if the information found in the process is not useful, then it really is not information. values that occur infrequently, may actually be removed. The data may be transformed
Of course, the idea of being useful is relative and depends on the individuals involved. by applying a function to the values. A common transformation function is to use the log
'
KDD is a process that involves many different steps. The input to this process is of the value rather than the value itself. These techniques make the mining task easier by
the data, and the output is the useful information desired by the users. However, the reducing the dimensionality (number of attributes) or by reducing the variability of the
objective may be unclear or inexact. The process itself is interactive and may require data values. The removal of outliers can actually improve the quality of the results. As
much elapsed time. To ensure the usefulness and accuracy of the results of the process, with all steps in the KDD process, however, care must be used in performing transfor­
interaction throughout the process with both domain experts and technical experts might mation. If used incorrectly, the transformation could actually change the data such that
be needed. Figure 1.4 (modified from [FPSS96c]) illustrates the overall KDD process. the results of the data mining step are inaccurate.
frhe KDD process consists of the following five steps [FPSS96c]: Visualization refers to the visual presentation of data. The old expression "a picture
is worth a thousand words" certainly is true when examining the structure of data. For
• Selection: The data needed for the data mining process may be obtained from
example, a line graph that shows the distribution of a data variable is easier to understand
many different and heterogeneous data sources. This first step obtains the data
and perhaps more informative than the formula for the corresponding distribution. The use
from various databases, files, and nonelectronic sources.
of visualization techniques allows users to summarize, extra.ct, and grasp more complex

• Preprocessing: The data to be used by the process may have incorrect or miss­ results than more mathematical or text type descriptions of the results. Visualization

ing data. There may be anomalous data from multiple sources involving different techniques include:

data types and metrics. There may be many different activities performed at this
time. Erroneous data may be corrected or removed, whereas missing data must be • Graphical: Traditional graph structures including bar charts, pie charts, histograms,

supplied or predicted (often using data mining tools). and line graphs may be used.

• Transformation: Data from different sources must be converted into a common • Geometric: Geometric techniques include the. box plot and scatter diagram
format for processing. Some data may be encoded or transformed into more usable techniques.
formats. Data reduction may be used to reduce the number of possible data values
being considered. • Icon-based: Using figures, colors, or other icons can improve the presentation of
the results.
• Data mining: Based on the data mining task being performed, this step applies
algorithms to the transformed data to generate the desired results.
• Pixel-based: With these techniques each data value is shown as a uniquely colored
• Interpretation/evaluation: How the data mining results are presented to the users pixel.
is extremely important because the usefulness of the results is dependent on it.
Various visualization and GUI strategies are used at this last step. • Hierarchical: These techniques hierarchically divide the display area (screen) into
regions based on data values.
Transformation techniques are used to make the data easier to mine and more use­
ful, and to provide more meaningful results. The actual distribution of the data may be • Hybrid: The preceding approaches can be combined into one display.

Any of these approaches may be two-dimensional or three-dimensional. Visualization


0 S•l�tion 0 Prepro=&og "'"'form•tioo D•t. m lot<or><ot.tion

O
hU� 0 tools can be used to summarize data as a data mining technique itself. In addition,
D
visualization can be used to show the complex results of data mining tasks.
Initial Target Preprocessed Transformed Model Knowledge The data mining process itself is complex. As we will see in later chapters, there
data data data data are many different data mining applications and algorithms. These algorithms must be
carefully applied to be effective. Discovered patterns must be correctly interpreted and
FIGURE 1.4: KDD process (modified from [FPSS96c]). properly evaluated to ensure that the resulting information is meaningful and accurate.
Section 1 .2 Data M i n i n g Versus Knowledge Di scovery in Databases 13
12 Chapter 1 I ntroduction

Information TAB LE 1 . 1 : Time Line of Data Mining Development


retrieval

Time Area Contribution Reference

Databases Statistics Late 1 700s Stat Bayes theorem of probability [Bay63]


Early 1 900s Stat Regression analysis
Early 1 920s Stat Maximum likelihood estimate [Fis2 1 ]
Early 1 940s AI Neural networks [MP43]
Early 1 950s Nearest neighbor [FJ5 1 ]
Early 1 950s Single link [FLP+ 5 1 ]
Late 1 950s AI Perceptron [Ros58]
Late 1 950s Stat Resampling, bias reduction, jackknife estimator
Algorithms Machine Early 1 960s AI ML started [FF63]
learning
Early 1 960s DB Batch reports
Mid 1 960s Decision trees [HMS66]
F I G U R E 1 . 5 : Historical perspective of data mining. Linear models for classification [Nil65]
Mid 1 960s Stat
IR Similarity measures
IR Clustering
1 .2.1 The Develop men t of Data Min ing Stat Exploratory data analysis (EDA)
result of years of
g functions and products is the
The current evolution of data minin Late 1 960s DB Relational data model [Cod70]
ases, infor matio n retrie val, statistics,
including datab Early 1 970s SMART IR systems [Sal7 1 ]
influence from many disciplines, IR
.5). Anoth er comp uter scien ce area that has
(Figure 1 [Hol75]
algorithms, and machine learning r goal of KDD
Mid 1 970s AI Genetic algorithms
ss is multimedia and graph ics. A majo
had a majo r impact on the KDD proce Late 1 970s Stat Estimation with incomplete data (EM algorithm) [DLR77]
er. Because
the KDD process in a meaningful mann
is to be able to describe the results of Visua lization
Late 1 970s Stat K-means clustering
ced, this is a nontrivial probl em.
many different results are often produ additi on,
Early 1 980s AI Kohonen self-organizing map (Koh82]
ics prese ntatio ns. In
ed multimedi a and graph
techniques often involve sophisticat Mid 1 980s AI Decision tree algorithms [Qui86]
d to multimedia applic ation s.
data mining techniques can be applie datab ase
Early 1 990s DB Association rule algorithms
disparate areas , a majo r trend in the
Unlike previous research in these one
Web and search engines
lines into
from these seemingly different discip
community is to combine results goal of this
1 990s DB Data warehousing
ultim ate
ach. Although in its infancy, the
unifying data or algorithmi c appro integ ration of
1 990s DB Online analytic processing (OLAP)
will facili tate
re" view of the area that
evolution is to develop a "big pictu
real-world user doma ins.
the various types of applications into
information
Table 1 . 1 show s developme nts in the areas of artificial intelligence (AI),
(Stat) leadi ng to the curre nt view of data how to define a data mining query and whether a query language (like SQL) can
statistics
retrieval (IR), databases (DB) , and
development of the be developed to capture the many different types of data mining queries.
g. These different historical influences, which have led to the
minin
mining functions
rise to different views of what data
total data mining area, have given • �escrib�ng a lar�e database can be viewed as using approximation to help uncover
actually are (RG9 9] : hidden mformatwn about the data.
al infor­
very specific knowledge to more gener
• Induction is used to proceed from • When dealing with large databases, the impact of size and efficiency of developing
found in AI application s.
mation. This type of technique is often an abstract model can be thought of as a type of search problem.

• Because the primary objective of data


mining is to describe some characteris
tics

It is int�resting to t nk about the various data mining problems and how each may be
be viewe d as a type of com­ .
this approach can
of a set of data by a general model, VIewed m several different perspectives based on the viewpoint and background of the

press ion . Here the detailed data within the database are abstracted and comp
ressed
� .
�;
rese cher or d eloper. "W_
e �ention these different perspectives only to give the reader
in the mode l.
to a smaller description of the data
characteris tics that are found the . f�ll picture of data rmmng. Often, due to the varied backgrounds of the data mining
.
particIpants we find that the s�me problem� (and perhaps even the same solutions) are
ing :
ss itself can be viewed as a type of query .
• As stated earlier, the data mining proce descnbed differently. Indeed, different terrmnologies can lead to misunderstandings and
ongoi ng direction of data minin g research is
an
the underlying datab ase. Indeed,
14 Chapter 1 Introduction
Section 1 .4 Data Mining Metrics 15

apprehension among the different players. You can see statisticians voice co�cern over
6. Large datasets : The massive datasets associated with data mining create problems
the compounded use of estimates (approximation) with results being generalized when
when applying algorithms designed for small datasets. Many modeling applica­
they should not be. Database researchers often voice concern about the inefficiency of
tions grow exponentially on the dataset size and thus are too inefficient for larger
many proposed AI algorithms, particularly when used on very large databases. IR and
datasets. Sampling and parallelization are effective tools to attack this scalability
those interested in data mining of textual databases might be concerned about the fact
problem.
that many algorithms are targeted only to numeric data. The approach taken in this book
7. High dimensionality: A conventional database schema may be composed of many
is to examine data mining contributions from all these different disciplines together.
different attributes. The problem here is that not all attributes may be needed to
There are at least two issues that characterize a database perspective of examining solve a given data mining problem. In fact, the use of some attributes may interfere
data mining concepts: efficiency and scalability . Any solutions to data mining problems with the correct completion of a data mining task. The use of other attributes may
must be able to perform well against real-world databases. As part of the efficiency, we simply increase the overall complexity and decrease the efficiency of an algorithm.
are concerned about both the algorithms and the data structures used. Parallelization may This problem is sometimes referred to as the dimensionality curse, meaning that
be used to improve efficiency. In addition, how the proposed algorithms behave as the there are many attributes (dimensions) involved and it is difficult to determine
associated database is updated is also important. Many proposed data mining algorithms which ones should be used. One solution to this high dimensionality problem is
may work well against a static database, but they may be extremely inefficient as changes to reduce the number of attributes, which is known as dimensionality reduction.
are made to the database. As database practitioners, we are interested in how algorithms However, determining which attributes not needed is not always easy to do.
perform against very large databases, not "toy" problems. We also usually assume that
8. Multimedia data: Most previous data mining algorithms are targeted to traditional
the data are stored on disk and may even be distributed.
data types (numeric, character, text, etc.). The use of multimedia data such as is
found in GIS databases complicates or invalidates many proposed algorithms.
1.3 DATA MINING ISSUES 9. Missing data: During the preprocessing phase of KDD, missing data may be
replaced with estimates. This and other approaches to handling missing data can
There are many important implementation issues associated with data mining :
lead to invalid results in the data mining step.
10. Irrelevant data: Some attributes in the database might not be of interest to the
1. Human interaction: Since data mining problems are often not precisely stated,
data mining task being developed.
interfaces may be needed with both domain and technical experts. Technical experts
are used to formulate the queries and assist in interpreting the results. Users are 11. Noisy data: Some attribute values might be invalid or incorrect. These values are
needed to identify training data and desired results. often corrected before running data mining applications.

2. Overfitting: When a model is generated that is associated with a given database 12. Chan ging data: Databases cannot be assumed to be static. However, most data
state it is desirable that the model also fit future database states. Overfitting occurs mining algorithms do assume a static database. This requires that the algorithm be
whe � the model does not fit future states. This may be caused by assumptions that completely rerun anytime the database changes.
are made about the data or may simply be caused by the small size of the training 13. Integration: The KDD process is not currently integrated into normal data pro­
database. For example, a classification model for an employee database may be cessing activities. KDD requests may be treated as special, unusual, or one-time
developed to classify employees as short, medium, or tall. If the training database needs. This makes them inefficient, ineffective, and not general enough to be used
is quite small, the model might erroneously indicate that a short person is anyone on an ongoing basis. Integration of data mining functions into traditional DBMS
under five feet eight inches because there is only one entry in the training database systems is certainly a desirable goal.
under five feet eight. In this case, many future employees would be erroneously 14. Application: Determining the intended use for the information obtained from the
classified as short. Overfitting can arise under other circumstances as well, even data mining function is a challenge. Indeed, how business executives can effectively
though the data are not changing. use the output is sometimes considered the more difficult part, not the running of
3. Outliers: There are often many data entries that do not fit nicely into the derived the algorithms themselves. B ecause the data are of a type that has not previously
model. This becomes even more of an issue with very large databases. If a model been known, business practices may have to be modified to determine how to
is developed that includes these outliers , then the model may not behave well for effectively use the information uncovered.
data that are not outliers .
4. Interpretation of results : Currently, data mining output may require experts to These issues should be addressed by data mining algorithms and products.

correctly interpret the results, which might otherwise be meaningless to the average
database user. 1 .4 DATA MINING METRICS

5. Visualization of results : To easily view and understand the output of data mining
Measuring the effectiveness or usefulness of a data mining approach is not always
algorithms, visualization of the results is helpful.
straightforward. In fact, different metrics could be used for different techniques and
16 Chapter 1 Introduction
Section 1.7 The Future 17
also based on the interest level. From an overall business or usefulness perspective, a
The study of data mining from a database perspective involves looking at all types
measure such as return on investment (ROI) could be used. ROI examines the difference
of data mining applications and techniques. However, we are interested primarily in those
between what the data mining technique costs and what the savings or benefits from
that are of practical interest. While our interest is not limited to any particular type of
its use are. Of course, this would be difficult to measure because the return is hard to
algorithm or approach, we are concerned about the following implementation issues:
quantify. It could be measured as increased sales, reduced advertising expe�diture, or
.

both. In a specific advertising campaign implemented via targeted �a alog mmlmgs, e � • Scalability: Algorithms that do not scale up to perform well with massive real­
percentage of catalog recipients and the amount of �urchase per rectptent would provtde world datasets are of limited application. Related to this is the fact that techniques
.
one means to measure the effectiveness of the mmhngs.
should work regardless of the amount of available main memory.
In this text, however, we use a more computer science/database perspective to
measure various data mining approaches. We assume that the business management • Real-world data: Real-world data are noisy and have many missing attribute
has determined that a particular data mining application be made. They subsequently values. Algorithms should be able to work even in the presence of these problems.
will determine the overall effectiveness of the approach using some ROI (or related)
strategy. Our objective is to compare different alternatives to implementing a spe �ific • Update: Many data mining algorithms work with static datasets. This is not a


data mining task. T he metrics used include the traditional met cs of s?ace and ttme
.
realistic assumption.

based on complexity analysis. In some cases, such as accuracy m classtficatwn, more


• Ease of use: Although some algorithms may work well, they may not be well
specific �etrics targeted to a data mining task may be used.
received by users if they are difficult to use or understand.

1 .5 SOCIAL IM PLICATIONS OF DATA M I NI NG These issues are crucial if applications are to be accepted a:nd used in the workplace.


The integration of data mining techniques into normal day-to - ay activities has become
. .
T hroughout the text we will mention how techniques perforn1 in these and other imple­

commonplace. We are confronted daily with targeted adverttsmg, and busmesses have mentation categories.

become more efficient through the use of data mining activities to reduce costs. Data Data mining today is in a similar state as that of databases in the early 1960s. At

mining adversaries, however, are concerned that this informati �n is being obtained �t that time, each database application was implemented independently even though there

the cost of reduced privacy. Data mining applications can denve m�ch d mographtc � were many similarities between different applications. In the mid 1960s, an abundance

information concerning customers that was previously not known or �dden �� the dat� . of database management systems (DBMS) like tools (such as bill of material systems
The unauthorized use of such data could result in the disclosure of mformat10n that ts including DBOMP and CFMS) emerged. While these made the development of applica­
deemed to be confidential. tions easier, there were still different tools for different applications. The rise of DBMS

We have recently seen an increase in interest in data mining techniques tm ge ed � � occurred in the early 1970s. Their success has been due partly to the abstraction of data
to such applications as fraud detection, identifying criminal suspects, and predtctwn definition and access primitives to a small core of needed requirements. This abstraction

�i
of potential terrorists. T hese can be viewed as types of classifica on problems .
.
he � process has yet to be performed for data mining tasks. Each task is treated separately.

� ��
approach that is often used here is one of "profiling" the ty ical e av10r or ch:rractenstlcs Most data mining work (to date) has focused on specific algorithms to realize each indi­
involved. Indeed, many classification techniques work by tdenttfymg the attnbute values vidual data mining task. T here is no accepted abstraction to a small set of primitives.
.
that commonly occur for the target class. Subsequent records will be then cl ssified
.
� One goal of some database researchers is the development of such an abstraction.

based on these attribute values. Keep in mind that these approaches to classificatiOn are One crucial part of the database abstraction is query processing support. One reason

imperfect. Mistakes can be made. Just because an individual �akes a series of credit relational databases are so popular today is the development of SQL. It is easy to use

card purchases that are similar to those often made when a card (at least when compared with earlier query languages such as the DBTG or IMS DML)
IS stolen does not mean
and has become a standard query language implemented by all major DBMS vendors.
that the card is stolen or that the individual js a criminal.
SQL also has well-defined optimization strategies.· Although there currently is no corre­
Users of data mining techniques must be sensitive to these issues and must not
sponding data mining language, there is ongoing work in the area of extending SQL to
violate any privacy directives or guidelines.
support data mining tasks.

1 .6 DATA M I NI NG FROM A DATABASE PE RSPECTIVE


1 .7 THE FUTURE
Data mining can be studied from many different perspectives. An IR researcher p o a ly ��? The advent of the relational data model and SQL were milestones
in the evolution of
would concentrate on the use of data mining techniques to access text data; a statistiCian
database systems. Currently, data mining is little more than a set of tools that
� �
might look primarily at the historical techniques, includi g time eries an l�sis, � �ypoth­ to uncover previously hidden information in a database. While there are
can be used


esis testing, and applications of Bayes theorem; a machme learrung sp cialist might be
in this process, there is no all-encompassing model or approach.
many tools to aid

interested primarily in data mining algorithms that learn; and an algonthms researc er � Over the next few years,
not only will there be more efficient algorithms with better interface techniques,
but also
would be interested in studying and comparing algorithms based on type and complexity.
steps will be taken to develop an all-encompassing model for data mining
. While it may
18 Chapter 1 Introduction Section 1 . 9 B i bl iographic N otes 19

not look like the relational model, it probably will include similar items: �l�orithms, all steps in the KDD process, including the maintenance of the results of the data mining
data model, and metrics for goodness (like normal forms). Current data �rung tools step. The CRISP-DM life cycle contains the following steps: business understanding,
re uire much human inter�ction not only to define the r�quest, bu� also to m�erpret �he data understanding, data preparation, modeling, and evaluation deployment. The steps
re;ults. As the tools become better and more integrated, �his extensive burna� mteractwn involved in the CRISP-DM model can be summarized as the "the 5As:" assess, access,
is likely to decrease. The various data mining applic�twns are of man� diverse types, analyze, act, and automate.
.
so the development of a complete data mining model ts desrrable. A maJ or �e:'elopment
will be the creation of a sophisticated "query �anguage" that include� tradittonat s:;-L 1 .8 EXERCISES
functions as well as more complicated requests such as those found m OLAP on ne 1. Identify and describe the phases in the
KDD proces s. How does KDD differ from
analytic processing) and data mining applications . data mining?
.
A data mining query language (DMQL) based on SQL has been proposed. Unlike
2. Gather temperature data at one locatio
S QL where the access is assumed to be only to relational databases, DMQL llows n every hour starting at 8:00 A.M. for
12
straight hours on 3 different days. Plot

.
acce s to background information such as concept hierarchies. Anot er difference IS that
the three sets of time series data on the
same graph. Analyze the three curves . Do
� �
the retrieved data need not be a subset or aggregate of data from relatiOns . Thus, a MQL they behave in the same manner? Does
. there appear to be a trend in the tempe

statement must indicate the type of knowledge to be mined. Another dtfference IS that rature during the day? Are the three plots
. similar? Predict what the next temperature
a DMQL statement can indicate the necessary importance dt threshol that any mmed value would have been for the next hour
in each of the 3 days. Compare your predic
?
informa on should obey. A BNF statement of DMQL (from [Za199]) Is:
tf tion with the actual value that occurred.
3. Identify what work you performed in
each step of the KDD process for exerci
se
2. What were the data mining aCtivities
{DMQL) : : you completed?
=

4. Describe which of the data mining issues


USE DATABASE {database _name) you encountered while completing exer­
cise 2.
{USE HIERARCHY {hi erarchy_name) FOR {attribute ) }
5. Describe how each of the data minin
{ru l e_spec) g issues discussed in section 1 .3 are com­
RELATED TO {attr_or_agg_l i s t ) pounded by the use of real production databa
ses.
FROM {relation ( s ) ) 6. (Research) Find two other definitions for data
mining. Compare these definitions
[WHERE (condit ion)] with the one found in this chapter.
[ORDER B Y {order -l i st)] 7. (Research) Find at least three examples
of data mining applications that have
{WITH [{kinds_o f ) ] THRESHOLD {threshold_value) appeared in the business section of your local
newspaper or other news publication.
[ FOR {attr ibute ( s ) )] } Describe the data mining applications involv
ed. ·

The heart of a DMQL statement is the rule specification portion. This is where the 1 .9 BIBLIOGRAPHIC NOTES
true data mining request is made. The data mining request can be one of the follow-
Although many excellent books have been
ing [HFW+96] : published that examine data mining and
knowledge discovery in databases, most

1
are high-level books that target users of
data
• A generalized relation is obtained by generalizing data from input data mining techniques and business professional
s. There have been, however, some other
..

technical books that examine data minin


A characteristic rule is a condition that is satisfied by almost all records m a target g approaches and a1gorithms. An excelle
• nt text
that is written by one of the foremost expert
s in the area is Data Mining Concepts and
class.
Techniques by Jiawei Han and Micheline
Katnber [HKOl ] . 1bis book not only exami
nes
A discriminate rule is a condition that is satisfied by a target class but is different data mining algorithms, but also includ
• es a thorough coverage of data wareh
ousing,
OLAP, preprocessing, and data mining langua
from conditions satisfied in other classes. ge developments. Other books that provid
e
a technical survey of portions of data minin
g algorithms include [AdaOO] and [IiMS
A classification rule is a set of rules that are used to classify data. Ol].
• There have been several recent surveys
and overviews of data mining, including
special issues of the Communications of
The term knowledge and data discovery management system (KDDMS) has been the ACM in November 1 996 and Novem
ber
_ 1 999 , IEEE Transactions on Knowledge
coined to describe the future generation of data mining systems that mclu e not o ly data
� � and Data Engineering in December 1 996,
and
Computer in August 1 999. Other survey
mining tools but also techniques to manage the underlying dat , ensur Its constst ncy articles can be found: [FPSS 96c], [FPSS
� � � d 96b],
[GGR99a], [Man9 6], [Man9 7], and [RG99
and provide concurrency and recovery features. A KDD .M_S will provide access vta a ] . A popular tutorial booklet has been produ
ced
by Two Crows Corporation [Cor99] . A
hoc data mining queries that have been optimized for effictent access. complete discussion of the KDD process
is found
in [BA96 ]. Articles that examine the
A KDD process model, CRISP-DM (CRoss-Industry Standard Process for Data intersection between databases and data
mining
Mining) ��: arisen and is applicable to many different applications. The model addresses include [Cha97], [Cha9 8], [CHY96], [Fay9
8], and [HKMT95] . There have also been
20 Chapter 1 Introduction

several tutorials surveying data mining concepts: [Agr94], [Agr95], [Han96] , and [RS99].
C H A P T E R 2
A recent tutorial [Kei97] provided a thorough survey of visualization techniques as well
as a comprehensive bibliography.
The aspect of parallel and distributed data mining has become an important research
topic. A workshop on Large-Scale Parallel KDD Systems was held in 1999 [ZHOO].
Re l ated Co nce pts
The idea of developing an approach to unifying all data mining activities has
been proposed in [FPSS96b], [Man96], and [Man97]. The term KDDMS was first pro­
posed in [IM96]. A recent unified model and algebra that supports all maj or data mining
2.1 DATABASE/OLTP SYSTEMS
tasks has been proposed [JLNOO]. The 3W model views data as being divided into three
2.2 FUZZY SETS AND FUZZY LOGIC
dimensions. An algebra, called the dimension algebra, has been proposed to access this
2.3 IN FORMATION RETRIEVAL
three-dimensional world.
2.4 DECISION SU PPORT SYSTEMS
DMQL was developed at Simon Fraser University [HFW + 96].
2.5 DIME NSIONAL MODELING
There are several KDD and data mining resources. The ACM (Association for
2.6 DATA WAREHOUSING
Computing Machinery) has a special interest group, SIGKDD, devoted to the promotion
2.7 OLAP
and dissemination of KDD information. SIGKDD Explorations is a free newsletter pro­
2.8 WEB SEARCH ENGINES
duced by, ACM SIGKDD. The ACM SIGKDD home page contains a wealth of resources
2.9 STATISTICS
concerni�g KDD and data mining (www.acm.org/sigkdd).
2.10 MACHINE LEARNING
A vendor-led group, Data Mining Group (DMG), is active in the development of
2. 1 1 PATTERN MATCHING
data mining standards. Information about DMG can be found at www.dmg. org. The
2.12 SUM MARY
ISO/IEC standards group has created a final committee draft for an SQL standard includ­
2 . 1 3 EXERCISES
ing data mining extensions [ComO l ] . In addition, a proj ect begun by a consortium of
2.14 BIBLIOGRAPHIC NOTES
data mining vendors and users resulted in the development of the data mining process
model, CRISP-DM (see: www.crisp-dm.org).
There currently are several research journals related to data mining. These include
IEEE Transactions on Knowledge and Data Engineering published l:>y IEEE Computer
Society and Data Mining and Knowledge Discovery from Kluwer Academic Publish­
ers. International KDD conferences include the ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (KDD), the Conference on Information and Data mining applications have existed for thousands of years. For example, the classifi­
Knowledge Management (CIKM), the IEEE International Conference on Data Mining cation of plants as edible or nonedible is a data mining task. The development of the data
(ICDM), the European Conference on Principles of Data Mining and Knowledge Dis­ mining discipline has its roots in many other areas. In this chapter we examine many
covery (PKDD), and the Pacific-Asia Conference on Knowledge Discovery and Data concepts related to data mining. We briefly introduce each concept and indicate how it
Mining (PAKDD). KDnuggets News is an e-mail newsletter that is produced biweekly. is related to data mining.
It contains a wealth of KDD and data mining information for practitioners, users, and
researchers. Subscriptions are free at www.kdnuggets.com. Additional KDD resources 2.1 · DATABASE/OLTP SYSTEMS
can be found at Knowledge Discovery Central (www.kdcentral.com).
A database i s a collection of data usually associated with some organization or enterprise.
Unlike a simple set, data in a database are usually viewed to have a particular structure
or schema with which it is associated. For example, (/D, Name, Address, Salary, JobNo)
may be the schema for a personnel database. Here the schema indicates that each record
(or tuple) in the database has a value for each of these five attributes. Unlike a file, a
database is independent of the physical method used to store it on disk (or other media).
It also is independent of the applications that access it. A database management system
(DBMS) is the software used to access a database.
Data stored in a database are often viewed in a more abstract manner or data
model. This data model is used to describe the data, attributes, and relationships among
them. A data model is independent of the particular DBMS used to implement and
access the database. In effect, it can be viewed as a documentation and communication
tool to convey the type and structure of the actual data. A common data model is the

21

You might also like