0% found this document useful (0 votes)
51 views28 pages

DMV Module1

The document outlines a course on Data Mining and Visualization, focusing on foundational concepts, preprocessing techniques, and the KDD process. It discusses the importance of data mining in various industries, including retail, finance, healthcare, and education, highlighting its applications and functionalities. The document also details the steps involved in the KDD process, emphasizing the transformation of raw data into actionable knowledge through data mining techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views28 pages

DMV Module1

The document outlines a course on Data Mining and Visualization, focusing on foundational concepts, preprocessing techniques, and the KDD process. It discusses the importance of data mining in various industries, including retail, finance, healthcare, and education, highlighting its applications and functionalities. The document also details the steps involved in the KDD process, emphasizing the transformation of raw data into actionable knowledge through data mining techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Data Mining and Visualization

( MMCA311A)

Course Instructor
Mrs.B.Sathyabama
Assistant Professor
Department of Master of Computer Applications
R V Institute of Technology and Management
Bangalore -560076.
Module - 1
Foundations of Data Mining and Data Preprocessing Introduction to Data
Mining & Preprocessing Techniques: Introduction to data mining: Motivation,
architecture, KDD process. Types of data: Structured, semi-structured,
unstructured.

Data preprocessing: Cleaning, integration, reduction, transformation, Missing


Values and Noisy Data. Data summarization and visualization techniques for
preprocessing analysis. Implementation using Python: Pandas, NumPy for basic
preprocessing.
Data:
● Data means raw facts, figures, or information.
● Example: numbers, names, dates, marks in an exam, temperature readings, etc.
● By itself, data may not have much meaning until we organize or analyze it.

Mining:
● Mining means digging to find something valuable (like coal or gold in the ground).
● In computers, mining means searching through a large amount of data to find useful patterns
or knowledge.
● Example: A shopping website checks past purchase data to find out which products are often
bought together.
● So, Data Mining = extracting useful information (knowledge) from raw data, just like mining
gold from rocks.
What is data mining
● Extraction of interesting (non-trivial, implicit, previously unknown and potentially
useful) patterns or knowledge from huge amount of data – Data mining: a
misnomer?
Alternative names
● Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting,
business intelligence, etc.

● “Data mining is the process of extracting meaningful patterns and insights from
large datasets using statistical, computational, and machine learning techniques.
Below is a step-by-step guide to performing data mining effectively”.
Data Mining — On what kind of data?
▢ Relational Databases:
• A database system, also called a database management system (DBMS),
consists of a collection of interrelated data, known as a database tables, and a
set of software programs to manage and access these data.
• E.g.: SQL Server, Oracle etc.

▢ Data Warehouses:
● A data warehouse is a repository of information collected
from multiple sources.
● It is constructed after pre-processing of data. (Data cleaning,
Data integration, Data transformation, Data loading, and
Periodic data refreshing etc.)
● E.g.: Stock Market, D-Mart, Big Bazar etc.
▢ Transactional Databases:
● Transactional database consists of a file where each record represents a
transaction.
● A transaction typically includes a unique transaction identity number (TID)
and a list of the items making up the transaction (such as items purchased
in a store).
● E.g.: Online shopping on Flipkart, Amazon etc.
▢ Other Data/Databases
● Spatial data (Maps or Location related data)
● Engineering design data (Designs of Buildings, Offices Structures data)
● Hypertext and multimedia data (Including text, image, video and audio
data), the World Wide Web (WWW — a huge, widely distributed information
repository made available on the Internet).
Motivation: Why Data Mining?
● “Necessity is the Mother of all Inventions”
● It has been estimated that the amount of information in the world doubles
every 10 months.
● There is a tremendous increase in the amount of data recorded and stored on
digital media as well as individual sources.
● Since the 1960’s, database and information technology has been changed
systematically from primitive file processing systems to powerful database
systems.
● The research and development in database systems since the 1970’s has led
to the development of relational database systems.
Motivation for Data Mining: An Example
Netflix collects user ratings of movies (data) → What types of movies you will
like (knowledge) → Recommend new movies to you (action) → Users stay with
Netflix (goal)

Gene sequences of cancer patients (data) → Which genes lead to cancer?


(knowledge) → Appropriate treatment (action) → Save life (goal)

Road traffic (data) → Which road is likely to be congested? (knowledge) →


Suggest better routes to drivers (action) → Save time and energy (goal)

Summary:
The overall goal of the data mining process is to extracts useful information from
large datasets and converts it into actionable knowledge.

Data Mining is about taking big data → learning something useful → taking
action → achieving a goal.
Data Mining Functionalities
● Data mining functionalities can be classified into two categories:
 Descriptive
 Predictive
 Descriptive
● This task presents the general properties of data stored in a database.
● The descriptive tasks are used to find out patterns in data.
● E.g.: Cluster, Trends, etc.
 Predictive
● These tasks predict the value of one attribute on the basis of values of
other attributes.
● E.g.: Festival customer/product sale prediction at store
Potential Applications
● Data mining is widely used across industries to uncover hidden patterns, predict trends, and
improve decision-making. Here are some real-world examples of data mining in action:
1. Retail and E-Commerce: Personalized Recommendations
● E-commerce giants like Amazon, Flipkart, and Myntra use data mining to analyze customer
shopping behavior. By studying past purchases, browsing history, and product preferences, they
generate personalized recommendations.
● Example: If a customer frequently buys skincare products, Amazon may suggest related items
like serums or sunscreen. This technique, known as association rule mining, helps increase
sales and improve customer experience.
2. Banking and Finance: Fraud Detection
● Banks and financial institutions use data mining in the KDD process to detect fraudulent
transactions. By analyzing millions of transactions, they can identify unusual patterns that
indicate fraud.
● Example: If a customer who usually makes small purchases suddenly withdraws a large sum
from an unknown location, the bank’s fraud detection system flags the transaction for review.
Machine learning algorithms help in real-time fraud detection and risk assessment.
3. Healthcare: Disease Prediction and Diagnosis
● Data mining is transforming the healthcare sector by helping doctors predict
diseases based on patient records and medical history. Hospitals analyze patient
data to identify risk factors for conditions like diabetes, heart disease, and cancer.
● Example: A hospital analyzing patient data might find that individuals with high
blood pressure and obesity have a higher risk of heart disease. By identifying
patterns, doctors can recommend lifestyle changes and early treatments.
4. Digital Marketing: Targeted Advertising
● Companies like Google, Facebook, and Instagram use data mining to serve
personalized ads based on user interests and online behavior.
● Example: If a user frequently searches for fitness equipment, they will start seeing
ads for protein supplements, workout gear, and gym memberships. This is an
example of predictive analytics, where past data is used to anticipate user
preferences.
5. Telecommunications: Customer Churn Prediction
● Telecom companies analyze customer data to identify users who are likely to switch to
a competitor. They use clustering and classification techniques to predict churn and
offer customized retention strategies.
● Example: If a telecom company notices that a customer has reduced data usage and
called customer service multiple times with complaints, they may offer a discount or
special plan to retain them

6. Manufacturing: Quality Control and Predictive Maintenance


● Manufacturing industries use data mining to monitor equipment performance and
predict failures before they occur.
● Example: A car manufacturing plant uses sensor data to detect anomalies in machine
operations. If a machine starts showing irregular temperature spikes, data mining tools
can predict a possible failure and schedule maintenance before a breakdown occurs.
7. Education: Student Performance Analysis
● Educational institutions use data mining to track student performance, identify
learning gaps, and improve teaching strategies.
● Example: Universities analyze student attendance, test scores, and assignment
submissions to predict academic performance. If a student shows declining
grades, early intervention programs can be introduced to help them improve.
8. Sports Analytics: Player Performance and Game Strategies
● Data mining is revolutionizing sports by helping teams analyze player
performance and optimize game strategies.
● Example: Cricket and football teams analyze player movements, past game
records, and opponent strategies to refine tactics. In the Indian Premier League
(IPL), data mining helps teams decide on the best batting order, bowler
selections, and game strategies based on past match data.
KDD Process: Several Key Steps
● KDD refers to the overall process of discovering useful knowledge from data.
It involves the evaluation and possibly interpretation of the patterns to make
the decision of what qualifies as knowledge.
● It also includes the choice of encoding schemes, preprocessing, sampling,
and projections of the data prior to the data mining step.
● Data mining refers to the application of algorithms for extracting patterns from
data without the additional steps of the KDD process.
● Objective of Pre-processing on data is to remove noise from data or to
remove redundant data.
● There are mainly 4 types of Pre-processing Activities included in KDD Process
that is shown in fig. as Data cleaning, Data integration, Data transformation,
Data reduction.
● Knowledge Discovery in Databases (KDD) refers to the complete process of uncovering valuable
knowledge from large datasets.
● It starts with the selection of relevant data, followed by preprocessing to clean and organize it,
transformation to prepare it for analysis, data mining to uncover patterns and relationships, and
concludes with the evaluation and interpretation of results, ultimately producing valuable knowledge or
insights.
● KDD is widely utilized in fields like machine learning, pattern recognition, statistics, artificial intelligence,
and data visualization.
● The KDD process is iterative, involving repeated refinements to ensure the accuracy and reliability of the
knowledge extracted. The whole process consists of the following steps:
● Data Selection
● Data Preprocessing (Cleaning and Integration)
● Data Transformation
● Data Mining (Pattern Discovery)
● Pattern Evaluation and Knowledge Representation
1. Data Selection: Identifying Relevant Data Sources
Before beginning data analysis, it is important to select relevant data
from multiple sources.
Sources of Data for KDD
● Databases – Traditional relational databases (e.g., MySQL, Oracle).
● Data Warehouses – Large repositories integrating structured data.
● Web Data – Online browsing history, social media interactions.
● Sensor Data – IoT devices collecting real-time information.
Example: A bank analyzing customer transactions for fraud detection
selects transaction logs, account details, and customer
demographics for analysis.
2. Data Preprocessing: Cleaning and Integration
Raw data is often messy, with missing values, duplicates, and
inconsistencies. This step improves data quality and prepares it for further
processing.
Tasks in Data Preprocessing
• Handling Missing Data – Filling missing values with averages or
removing incomplete records.
• Removing Duplicates – Identifying and merging redundant entries.
• Data Integration – Combining data from different sources into a unified
dataset.
Example: A hospital analyzing patient records merges data from multiple
departments (cardiology, orthopedics, general medicine) to create a
unified patient database.
3. Data Transformation: Converting Data into a Usable Format
● Raw data often needs transformation to standardize formats and
enhance analysis.
● Common Data Transformation Techniques
● Normalization – Scaling numerical values to a common range.
● Feature Selection – Selecting the most relevant variables for analysis.
● Aggregation – Summarizing data (e.g., weekly sales instead of daily
sales).
● Example: In stock market analysis, raw price data is transformed into
moving averages and volatility indexes to better understand trends.
4. Data Mining: Discovering Patterns and Relationships
This is the core step where analytical techniques and machine learning
algorithms are applied to identify patterns.
Common Data Mining Techniques
• Classification – Assigning data to predefined categories (e.g., spam
vs. non-spam emails).
• Clustering – Grouping similar data points (e.g., customer
segmentation).
• Association Rule Learning – Identifying relationships between
items (e.g., “People who buy Noodles often buy Ketchups”).
Example: A supermarket uses data mining to find that customers
buying bread often buy butter, leading to product placement
strategies that increase sales.
5. Pattern Evaluation and Knowledge Representation
Once patterns are identified, the next step is to evaluate their
usefulness and present them in an understandable format.
Methods of Pattern Evaluation
• Visualizations – Graphs, heat maps, and dashboards.
• Reports – Summaries of key insights.
• Decision Rules – Defining actionable strategies based on findings.
Example: A digital marketing agency uses customer purchase
behavior data to create audience segments and optimize ad
targeting.
A Typical DM System Architecture
● Knowledge base: This is the domain knowledge that is
used to guide the search or evaluate the interestingness of
resulting patterns.
● Such knowledge can include concept hierarchies, used to
organize attributes or attribute values into different levels
of abstraction.
● Data warehouses typically provide a simple and concise view
around particular subject issues by excluding data that are not
useful in the decision support process.
● Knowledge such as user beliefs, which can be used to assess a
pattern’s interestingness based on its unexpectedness, may also
be included. Other examples of domain knowledge are additional
interestingness constraints or thresholds, and metadata (e.g.,
describing data from multiple heterogeneous sources).
● Data mining engine: This is essential to the data mining
system and ideally consists of a set of functional modules
for tasks such as characterization, association and
correlation analysis, classification, prediction, cluster
analysis, outlier analysis, and evolution analysis.
● Pattern evaluation module: This component typically
employs interestingness measures and interacts with the
data mining modules so as to focus the search toward
interesting patterns.
● It may use interestingness thresholds to filter out
discovered patterns.
● Alternatively, the pattern evaluation module may be
integrated with the mining module, depending on the
implementation of the data mining method used.
● For efficient data mining, it is highly recommended to push
the evaluation of pattern interestingness as deep as
possible into the mining process so as to confine the
search to only the interesting patterns.

You might also like