0% found this document useful (0 votes)
10 views

DM Chapter 1

The document provides an overview of data mining, detailing its processes, architecture, functionalities, technologies, and applications. It outlines the steps involved in the knowledge discovery process, the types of data that can be mined, and the major issues faced in the field. Additionally, it compares data warehouses and data marts, discusses various applications across sectors, and contrasts classification and regression techniques.

Uploaded by

maharanadebu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

DM Chapter 1

The document provides an overview of data mining, detailing its processes, architecture, functionalities, technologies, and applications. It outlines the steps involved in the knowledge discovery process, the types of data that can be mined, and the major issues faced in the field. Additionally, it compares data warehouses and data marts, discusses various applications across sectors, and contrasts classification and regression techniques.

Uploaded by

maharanadebu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Chapter 1

Introduction

Data Mining: Data mining is the process of extracting useful information from large sets of
data. It involves using various techniques from statistics, machine learning, and database
systems to identify patterns, relationships, and trends in the data. This information can then be
used to make data-driven decisions, solve business problems, and uncover hidden insights.

KDD Process in Data Mining: Data mining is also referred as knowledge discovery from data
(KDD). Knowledge discovery process is depicted in figure 1.

Figure 1: Data mining as a step in the process of knowledge discovery

Knowledge discovery process consists of following steps:


(a) Data Cleaning − In this step, the noise and inconsistent data is removed.
(b) Data Integration − In this step, multiple data sources are combined.
(c) Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
(d) Data Transformation − In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
(e) Data Mining − In this step, intelligent methods are applied in order to extract data patterns.
(f) Pattern Evaluation − In this step, data patterns are evaluated.
(g) Knowledge Presentation − In this step, knowledge is represented.

Architecture of Data Mining System: Architecture of a typical data mining system is


provided in figure 2.

Figure 2: Architecture of a typical Data Mining

A detailed description of parts of data mining architecture is given below:


(a) Data Sources: Database, World Wide Web (WWW) and data warehouse are parts of data
sources. The data in these sources may be in the form of plain text, spreadsheets, or other
forms of media like photos or videos. WWW is one of the biggest sources of data.
(b) Database Server: The database server contains the actual data ready to be processed. It
performs the task of handling data retrieval as per the request of the user.
(c) Data Mining Engine: It is one of the core components of the data mining architecture that
performs all kinds of data mining techniques like association, classification,
characterization, clustering, prediction, etc.
(d) Pattern Evaluation Modules: They are responsible for finding interesting patterns in the
data and sometimes they also interact with the database servers for producing the result of
the user requests.
(e) Graphic User Interface: Since the user cannot fully understand the complexity of the data
mining process so graphical user interface helps the user to communicate effectively with
the data mining system.
(f) Knowledge Base: Knowledge Base is an important part of the data mining engine that is
quite beneficial in guiding the search for the result patterns. Data mining engines may also
sometimes get inputs from the knowledge base. This knowledge base may contain data
from user experiences. The objective of the knowledge base is to make the result more
accurate and reliable.

What Kinds of data can be mined?

• Relational Databases − A database system is also called a database management system.


It includes a set of interrelated data, called a database, and a set of software programs to
handle and access the data.
• Data warehouse – A data warehouse is a repository of information collected from multiple
sources, stored under a unified schema, and usually residing at a single site. Data
warehouses are constructed via a process of data cleaning, data integration, data
transformation, data loading, and periodic data refreshing.
• Transactional Databases − A transactional database includes a file where each record
defines a transaction. A transaction generally contains a unique transaction identity number
(trans ID) and a list of the items creating up the transaction (such as items purchased in a
store).
• Object-Relational Databases − Object-relational databases are assembled based on an
object-relational data model. This model continues the relational model by supporting a
rich data type for managing complex objects and object orientation.
• Temporal Database − A temporal database generally stores relational data that contains
time-related attributes. These attributes can include multiple timestamps, each having
several semantics.
• Sequence Database − A sequence database stores sequences of ordered events, with or
without a factual idea of time. For example, customer shopping sequences, Web click
streams, and biological sequences.
• Time-Series Database − A time-series database stores sequences of values or events
accessed over repeated measurements of time (e.g., hourly, daily, weekly). An example
includes data collected from the stock exchange, stock control, and the measurement of
natural phenomena (like temperature and wind).

Data Mining Functionalities: Data mining functionalities represent the patterns that need to
be found in data mining activities. Different data mining functionalities are:
(a) Data characterization – It is a summarization of the general characteristics of an object
class of data. The data corresponding to the user-specified class is generally collected by a
database query. The output of data characterization can be presented in multiple forms.
(b) Data discrimination – It is a comparison of the general characteristics of target class data
objects with the general characteristics of objects from one or a set of contrasting classes.
The target and contrasting classes can be represented by the user, and the equivalent data
objects fetched through database queries.
(c) Association Analysis − It analyses the set of items that generally occur together in a
transactional dataset.
(d) Classification − Classification is the procedure of discovering a model that represents and
distinguishes data classes or concepts, for the objective of being able to use the model to
predict the class of objects whose class label is anonymous. The derived model is
established on the analysis of a set of training data (i.e., data objects whose class label is
common).
(e) Prediction − It defines predict some unavailable data values or pending trends. An object
can be anticipated based on the attribute values of the object and attribute values of the
classes. It can be a prediction of missing numerical values or increase/decrease trends in
time-related information.
(f) Clustering − In cluster analysis, similar data are grouped—or clustered—together under
an unknown class label. Data is split into groups by clustering algorithms based on
similarities, and the data groups are more similar than the other data groups.
(g) Outlier analysis − Analysing outliers helps understand data quality. An outlier is a data
anomaly. A greater number of outliers in a data set implies a lower quality of the data.
Therefore, using a data set with high outliers is not a wise option for finding patterns in the
data or drawing any conclusions. This analysis technique is helpful when the data algorithm
fails to classify data and we encounter data with different attributes that don’t match any
other class or general model.
(h) Evolution analysis − It defines the trends for objects whose behaviour changes over some
time.

Which Technologies are used in data mining?


As a highly application-driven domain, data mining has incorporated many techniques from
other domains such as statistics, machine learning, pattern recognition, database and data
warehouse systems, information retrieval, visualization, algorithms, high performance
computing, and many application domains (Figure 3).
Figure 3: Data Mining adopts techniques from many domain

(a) Statistics: Statistics studies the collection, analysis, interpretation or explanation, and
presentation of data. Data mining has an inherent connection with statistics. A statistical
model is a set of mathematical functions that describe the behaviour of the objects in a
target class in terms of random variables and their associated probability distributions.
(b) Machine Learning: Machine learning investigates how computers can learn (or improve
their performance) based on data. A main research area is for computer programs to
automatically learn to recognize complex patterns and make intelligent decisions based on
data.
(c) Database Systems and Data Warehouses: Many data mining tasks need to handle large
data sets or even real-time, fast streaming data. Therefore, data mining can make good use
of scalable database technologies to achieve high efficiency and scalability on large data
sets. Moreover, data mining tasks can be used to extend the capability of existing database
systems to satisfy advanced users’ sophisticated data analysis requirements.
(d) Information Retrieval: Information retrieval (IR) is the science of searching for
documents or information in documents.

Major Issues in Data Mining: Data mining is a dynamic and fast-expanding field with great
strengths. However, there are some major issues.
(a) Mining Methodology: Researchers have been vigorously developing new data mining
methodologies. This involves the investigation of new kinds of knowledge, mining in
multidimensional space, integrating methods from other disciplines, and the consideration
of semantic tie among data objects. In addition, mining methodologies should consider
issues such as data uncertainty, noise, and incompleteness. Some mining methods explore
how user specified measures can be used to assess the interestingness of discovered patterns
as well as guide the discovery process.
(b) User Interaction: The user plays an important role in the data mining process. Interesting
areas of research include how to interact with a data mining system, how to incorporate a
user’s background knowledge in mining, and how to visualize and comprehend data mining
results.
(c) Efficiency and Scalability: Efficiency and scalability are always considered when
comparing data mining algorithms. As data amounts continue to multiply, these two factors
are especially critical.
(d) Diversity of Database Type: The wide diversity of database types brings about challenges
to data mining. These include:
• Handling complex types of data
• Mining dynamic, networked, and global data repositories
(e) Data Mining and Society: How does data mining impact society? What steps can data
mining take to preserve the privacy of individuals? Do we use data mining in our daily lives
without even knowing that we do? These questions raise the following issues:
• Social impacts of data mining
• Privacy-preserving data mining
• Invisible data mining

Difference between Data Warehouse and Data Mart


Data Warehouse Data Mart
1. Data warehouse is a Centralised system. Data mart is a decentralised system.
2. Lightly denormalization takes place. Highly denormalization takes place.
3. It is top-down model. It is a bottom-up model.
4. To build a warehouse is difficult. To build a mart is easy.
5. Fact constellation schema is used. Star and snowflake schema are used.
6. Complicated design process of creating Easy design process of creating schemas
schemas and views. and views.
7. It is flexible. It is not flexible.
8. It is data-oriented in nature. It is project-oriented in nature.
9. It has long life. It has short life than warehouse.
10. In this, data are contained in detail form. In this, data are contained in summarized
form.
11. It is vast in size. It is smaller than warehouse.
12. It collects data from various data sources. It generally stores data from a data
warehouse.
13. Long time for processing the data because Less time for processing the data because
of large data. of handling only a small amount of data.

Data Mining Applications: Large number of applications are using data mining concept.
Some of them are depicted in Figure 4.
(a) Scientific Analysis: Scientific simulations are generating bulks of data every day. This
includes data collected from nuclear laboratories, data about human psychology, etc. Data
mining techniques are capable of the analysis of these data. Now we can capture and store
more new data faster than we can analyse the old data already accumulated. Example of
scientific analysis:
• Sequence analysis in bioinformatics
• Classification of astronomical objects
• Medical decision support

Figure 4: Data Mining Applications

(b) Intrusion Detection: A network intrusion refers to any unauthorized activity on a digital
network. Network intrusions often involve stealing valuable network resources. Data
mining technique plays a vital role in searching intrusion detection, network attacks, and
anomalies. These techniques help in selecting and refining useful and relevant information
from large data sets. Data mining technique helps in classify relevant data for Intrusion
Detection System. Intrusion Detection system generates alarms for the network traffic
about the foreign invasions in the system. Example are:
• Detect security violations
• Misuse Detection
• Anomaly Detection

(c) Business Transactions: Every business industry is memorized for perpetuity. Such
transactions are usually time-related and can be inter-business deals or intra-business
operations. The effective and in-time use of the data in a reasonable time frame for
competitive decision-making is definitely the most important problem to solve for
businesses that struggle to survive in a highly competitive world. Data mining helps to
analyse these business transactions and identify marketing approaches and decision-
making. Examples are:
• Direct mail targeting
• Stock trading
• Customer segmentation
• Churn prediction (Churn prediction is one of the most popular Big Data use cases
in business)

(d) Market Basket Analysis: Market Basket Analysis is a technique that gives the careful
study of purchases done by a customer in a supermarket. This concept identifies the pattern
of frequent purchase items by customers. This analysis can help to promote deals, offers,
sale by the companies and data mining techniques helps to achieve this analysis task.
Examples are:
• Data mining concepts are in use for Sales and marketing to provide better customer
service, to improve cross-selling opportunities, to increase direct mail response
rates.
• Customer Retention in the form of pattern identification and prediction of likely
defections is possible by Data mining.
• Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behaviour etc.

(e) Education: For analysing the education sector, data mining uses Educational Data Mining
(EDM) method. This method generates patterns that can be used both by learners and
educators. By using data mining EDM we can perform some educational tasks:
• Predicting students’ admission in higher education
• Predicting students profiling
• Predicting student performance
• Teachers teaching performance
• Curriculum development
• Predicting student placement opportunities

(f) Research: A data mining technique can perform predictions, classification, clustering,
associations, and grouping of data with perfection in the research area. Rules generated by
data mining are unique to find results. In most of the technical research in data mining, we
create a training model and testing model. The training/testing model is a strategy to
measure the precision of the proposed model. It is called Train/Test because we split the
data set into two sets: a training data set and a testing data set. A training data set used to
design the training model whereas testing data set is used in the testing model. Examples
are:
• Classification of uncertain data.
• Information-based clustering.
• Decision support system
• Web Mining
• Domain-driven data mining
• IoT (Internet of Things) and Cybersecurity
• Smart farming IoT (Internet of Things)

(g) Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force
activity and their outcomes to improve the focusing of high-value physicians and figure out
which promoting activities will have the best effect in the following upcoming months,
Whereas the Insurance sector, data mining can help to predict which customers will buy
new policies, identify behaviour patterns of risky customers and identify fraudulent
behaviour of customers. Examples are:
• Claims analysis i.e. which medical procedures are claimed together
• Identify successful medical therapies for different illnesses
• Characterizes patient behaviour to predict office visits

(h) Transportation: A diversified transportation company with a large direct sales force can
apply data mining to identify the best prospects for its services. A large consumer
merchandise organization can apply information mining to improve its business cycle to
retailers. Examples are:
• Determine the distribution schedules among outlets
• Analyse loading patterns

(i) Financial/Banking Sector: A credit card company can leverage its vast warehouse of
customer transaction data to identify customers most likely to be interested in a new credit
product.
• Credit card fraud detection
• Identify ‘Loyal’ customers
• Extraction of information related to customers
• Determine credit card spending by customer groups

Comparison between Classification and Regression


Classification Regression
1. In this problem statement, the target In this problem statement, the target
variables are discrete. variables are continuous.
2. Problems like spam email classification Problems like house price prediction and
and disease prediction are solved using rainfall prediction are solved using
classification algorithms. regression algorithms.
3. In this algorithm, we try to find the best In this algorithm, we try to find the best-
possible decision boundary which can fit line which can represent the overall
separate the two classes with the trend in the data.
maximum possible separation.
4. Evaluation metrics like Accuracy, Evaluation metrics like Mean Squared
Precision, Recall, and F1-Score are used Error (MSE), Mean Absolute Percentage
to evaluate the performance of the Error (MAPE), and R2-Score are used to
classification algorithms. evaluate the performance of the
regression algorithms.
5. Here we face the problems like binary Here we face the problems like linear
classification or multi-class classification regression models as well as non-linear
problems. models.
6. Input data are independent variables and Input data are independent variables and
categorical dependent variable. continuous dependent variable.
7. The classification algorithm’s task is The regression algorithm’s task is
mapping the input value of x with the mapping input value (x) with continuous
discrete output variable of y. output variable (y).
8. Output is Categorical labels. Output is Continuous numerical values.
9. Objective is to predict categorical/class Objective is to predict continuous
labels. numerical values.
10. Examples of classification algorithms Examples of regression algorithms are:
are: Linear Regression, Polynomial
Logistic Regression, Decision Trees, Regression, Ridge Regression, Lasso
Random Forest, Support Vector Regression, Support Vector Regression
Machines (SVM), K-Nearest Neighbours (SVR), Decision Trees for Regression,
(KNN), Naive Bayes, Neural Networks, Random Forest Regression, K-Nearest
K-Means Clustering, Multi-layer Neighbours (KNN) Regression, Neural
Perceptron (MLP), etc. Networks for Regression, etc.

You might also like