DM Chapter 1
DM Chapter 1
Introduction
Data Mining: Data mining is the process of extracting useful information from large sets of
data. It involves using various techniques from statistics, machine learning, and database
systems to identify patterns, relationships, and trends in the data. This information can then be
used to make data-driven decisions, solve business problems, and uncover hidden insights.
KDD Process in Data Mining: Data mining is also referred as knowledge discovery from data
(KDD). Knowledge discovery process is depicted in figure 1.
Data Mining Functionalities: Data mining functionalities represent the patterns that need to
be found in data mining activities. Different data mining functionalities are:
(a) Data characterization – It is a summarization of the general characteristics of an object
class of data. The data corresponding to the user-specified class is generally collected by a
database query. The output of data characterization can be presented in multiple forms.
(b) Data discrimination – It is a comparison of the general characteristics of target class data
objects with the general characteristics of objects from one or a set of contrasting classes.
The target and contrasting classes can be represented by the user, and the equivalent data
objects fetched through database queries.
(c) Association Analysis − It analyses the set of items that generally occur together in a
transactional dataset.
(d) Classification − Classification is the procedure of discovering a model that represents and
distinguishes data classes or concepts, for the objective of being able to use the model to
predict the class of objects whose class label is anonymous. The derived model is
established on the analysis of a set of training data (i.e., data objects whose class label is
common).
(e) Prediction − It defines predict some unavailable data values or pending trends. An object
can be anticipated based on the attribute values of the object and attribute values of the
classes. It can be a prediction of missing numerical values or increase/decrease trends in
time-related information.
(f) Clustering − In cluster analysis, similar data are grouped—or clustered—together under
an unknown class label. Data is split into groups by clustering algorithms based on
similarities, and the data groups are more similar than the other data groups.
(g) Outlier analysis − Analysing outliers helps understand data quality. An outlier is a data
anomaly. A greater number of outliers in a data set implies a lower quality of the data.
Therefore, using a data set with high outliers is not a wise option for finding patterns in the
data or drawing any conclusions. This analysis technique is helpful when the data algorithm
fails to classify data and we encounter data with different attributes that don’t match any
other class or general model.
(h) Evolution analysis − It defines the trends for objects whose behaviour changes over some
time.
(a) Statistics: Statistics studies the collection, analysis, interpretation or explanation, and
presentation of data. Data mining has an inherent connection with statistics. A statistical
model is a set of mathematical functions that describe the behaviour of the objects in a
target class in terms of random variables and their associated probability distributions.
(b) Machine Learning: Machine learning investigates how computers can learn (or improve
their performance) based on data. A main research area is for computer programs to
automatically learn to recognize complex patterns and make intelligent decisions based on
data.
(c) Database Systems and Data Warehouses: Many data mining tasks need to handle large
data sets or even real-time, fast streaming data. Therefore, data mining can make good use
of scalable database technologies to achieve high efficiency and scalability on large data
sets. Moreover, data mining tasks can be used to extend the capability of existing database
systems to satisfy advanced users’ sophisticated data analysis requirements.
(d) Information Retrieval: Information retrieval (IR) is the science of searching for
documents or information in documents.
Major Issues in Data Mining: Data mining is a dynamic and fast-expanding field with great
strengths. However, there are some major issues.
(a) Mining Methodology: Researchers have been vigorously developing new data mining
methodologies. This involves the investigation of new kinds of knowledge, mining in
multidimensional space, integrating methods from other disciplines, and the consideration
of semantic tie among data objects. In addition, mining methodologies should consider
issues such as data uncertainty, noise, and incompleteness. Some mining methods explore
how user specified measures can be used to assess the interestingness of discovered patterns
as well as guide the discovery process.
(b) User Interaction: The user plays an important role in the data mining process. Interesting
areas of research include how to interact with a data mining system, how to incorporate a
user’s background knowledge in mining, and how to visualize and comprehend data mining
results.
(c) Efficiency and Scalability: Efficiency and scalability are always considered when
comparing data mining algorithms. As data amounts continue to multiply, these two factors
are especially critical.
(d) Diversity of Database Type: The wide diversity of database types brings about challenges
to data mining. These include:
• Handling complex types of data
• Mining dynamic, networked, and global data repositories
(e) Data Mining and Society: How does data mining impact society? What steps can data
mining take to preserve the privacy of individuals? Do we use data mining in our daily lives
without even knowing that we do? These questions raise the following issues:
• Social impacts of data mining
• Privacy-preserving data mining
• Invisible data mining
Data Mining Applications: Large number of applications are using data mining concept.
Some of them are depicted in Figure 4.
(a) Scientific Analysis: Scientific simulations are generating bulks of data every day. This
includes data collected from nuclear laboratories, data about human psychology, etc. Data
mining techniques are capable of the analysis of these data. Now we can capture and store
more new data faster than we can analyse the old data already accumulated. Example of
scientific analysis:
• Sequence analysis in bioinformatics
• Classification of astronomical objects
• Medical decision support
(b) Intrusion Detection: A network intrusion refers to any unauthorized activity on a digital
network. Network intrusions often involve stealing valuable network resources. Data
mining technique plays a vital role in searching intrusion detection, network attacks, and
anomalies. These techniques help in selecting and refining useful and relevant information
from large data sets. Data mining technique helps in classify relevant data for Intrusion
Detection System. Intrusion Detection system generates alarms for the network traffic
about the foreign invasions in the system. Example are:
• Detect security violations
• Misuse Detection
• Anomaly Detection
(c) Business Transactions: Every business industry is memorized for perpetuity. Such
transactions are usually time-related and can be inter-business deals or intra-business
operations. The effective and in-time use of the data in a reasonable time frame for
competitive decision-making is definitely the most important problem to solve for
businesses that struggle to survive in a highly competitive world. Data mining helps to
analyse these business transactions and identify marketing approaches and decision-
making. Examples are:
• Direct mail targeting
• Stock trading
• Customer segmentation
• Churn prediction (Churn prediction is one of the most popular Big Data use cases
in business)
(d) Market Basket Analysis: Market Basket Analysis is a technique that gives the careful
study of purchases done by a customer in a supermarket. This concept identifies the pattern
of frequent purchase items by customers. This analysis can help to promote deals, offers,
sale by the companies and data mining techniques helps to achieve this analysis task.
Examples are:
• Data mining concepts are in use for Sales and marketing to provide better customer
service, to improve cross-selling opportunities, to increase direct mail response
rates.
• Customer Retention in the form of pattern identification and prediction of likely
defections is possible by Data mining.
• Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behaviour etc.
(e) Education: For analysing the education sector, data mining uses Educational Data Mining
(EDM) method. This method generates patterns that can be used both by learners and
educators. By using data mining EDM we can perform some educational tasks:
• Predicting students’ admission in higher education
• Predicting students profiling
• Predicting student performance
• Teachers teaching performance
• Curriculum development
• Predicting student placement opportunities
(f) Research: A data mining technique can perform predictions, classification, clustering,
associations, and grouping of data with perfection in the research area. Rules generated by
data mining are unique to find results. In most of the technical research in data mining, we
create a training model and testing model. The training/testing model is a strategy to
measure the precision of the proposed model. It is called Train/Test because we split the
data set into two sets: a training data set and a testing data set. A training data set used to
design the training model whereas testing data set is used in the testing model. Examples
are:
• Classification of uncertain data.
• Information-based clustering.
• Decision support system
• Web Mining
• Domain-driven data mining
• IoT (Internet of Things) and Cybersecurity
• Smart farming IoT (Internet of Things)
(g) Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force
activity and their outcomes to improve the focusing of high-value physicians and figure out
which promoting activities will have the best effect in the following upcoming months,
Whereas the Insurance sector, data mining can help to predict which customers will buy
new policies, identify behaviour patterns of risky customers and identify fraudulent
behaviour of customers. Examples are:
• Claims analysis i.e. which medical procedures are claimed together
• Identify successful medical therapies for different illnesses
• Characterizes patient behaviour to predict office visits
(h) Transportation: A diversified transportation company with a large direct sales force can
apply data mining to identify the best prospects for its services. A large consumer
merchandise organization can apply information mining to improve its business cycle to
retailers. Examples are:
• Determine the distribution schedules among outlets
• Analyse loading patterns
(i) Financial/Banking Sector: A credit card company can leverage its vast warehouse of
customer transaction data to identify customers most likely to be interested in a new credit
product.
• Credit card fraud detection
• Identify ‘Loyal’ customers
• Extraction of information related to customers
• Determine credit card spending by customer groups