Data Mining
Data Mining
• Definition: Data mining is the process of discovering patterns, correlations, and anomalies within
large datasets to predict outcomes. It uses statistical, machine learning, and database systems to
extract useful information.
• Functionalities:
o Classification: Assigning data to predefined categories.
o Clustering: Grouping similar data points.
o Regression: Predicting a continuous value.
o Association Rule Mining: Discovering relationships between variables.
o Anomaly Detection: Identifying outliers.
o Sequential Patterns: Finding regular sequences over time.
Data Processing
• Steps: Collecting, cleaning, transforming, and integrating data to prepare it for analysis.
• Importance: Ensures the quality and accuracy of the data used in mining processes.
• Data Cleaning: Removing noise and correcting inconsistencies to ensure data quality.
• Data Integration: Combining data from different sources.
• Data Transformation: Converting data into appropriate formats or structures.
• Data Reduction: Reducing the volume of data to make analysis more efficient while maintaining
data integrity.
Data Cleaning
• Missing Values: Techniques to handle missing data include ignoring the tuple, filling in the missing
value, or using a global constant.
• Noisy Data: Can be handled through binning, regression, or clustering.
o Binning: Smoothing data by grouping values into bins.
o Clustering: Grouping similar data points to identify outliers.
o Regression: Using statistical methods to predict missing values.
• Inconsistent Data: Detecting and correcting inconsistent data entries.
• Data Integration: Combining multiple datasets from different sources to provide a unified view.
• Data Transformation: Involves normalization (scaling data to a smaller range) and aggregation
(summarizing data).
Data Reduction
• Importance: Reduces the volume of data while maintaining its analytical value.
• Techniques:
o Dimensionality Reduction: Reducing the number of attributes.
o Numerosity Reduction: Reducing the data volume through methods like regression models
and histograms.
Apriori Algorithm
• Decision Tree
o Definition: A tree-like model used to make decisions based on input features.
o Construction: Nodes represent features, branches represent decision rules, and leaves
represent outcomes.
o Advantages: Easy to understand and interpret.
• Bayesian Classification
o Based on Bayes' Theorem: Uses probability to predict the category of a data point.
o Naive Bayes Classifier: Assumes independence between features.
o Application: Commonly used for text classification and spam detection.
• K-Nearest Neighbour (K-NN) Classifiers
o Definition: A simple, instance-based learning algorithm.
o Function: Classifies a data point based on the majority class of its k-nearest neighbors.
o Advantages: Easy to implement, effective with a small amount of data.
o Disadvantages: Computationally intensive with large datasets.
• Data Warehousing: A data warehouse is a centralized repository for storing large volumes of data
from multiple sources. It is designed for query and analysis rather than transaction processing.
• Purpose: Enables organizations to consolidate data, perform analytics, and generate insights for
decision-making.
Delivery Process
• Concept: Data is modeled as dimensions and facts, allowing complex queries and analysis.
• Components:
o Dimensions: Attributes or perspectives for analyzing data (e.g., time, geography).
o Facts: Quantitative data points (e.g., sales revenue).
Data Cubes
• Definition: Multi-dimensional arrays of data, allowing data to be viewed and analyzed from multiple
perspectives.
• Operations:
o Slice: Extracting a subset of data along a specific dimension.
o Dice: Extracting a subcube by selecting specific values from multiple dimensions.
o Roll-up: Aggregating data along a dimension (e.g., daily to monthly sales).
o Drill-down: Breaking down aggregated data into finer details.
• Star Schema: A simple schema where a central fact table is connected to dimension tables.
• Snowflake Schema: An extension of the star schema with normalized dimension tables.
• Fact Constellations: Multiple fact tables sharing dimension tables, representing complex
relationships.
Concept Hierarchy
• Definition: Organizes data into a hierarchical structure, allowing different levels of abstraction.
• Example: Geography dimension with levels such as country, state, and city.
Process Architecture
• Components: Data sources, ETL process, data warehouse, and front-end tools for querying and
analysis.
3-Tier Architecture
• Layers:
o Bottom Tier: Data warehouse server (RDBMS).
o Middle Tier: OLAP server for multi-dimensional analysis.
o Top Tier: Front-end tools for reporting and data mining.
Data Marting
• Definition: A subset of the data warehouse, focused on a specific business area or department.
• Purpose: Provides more targeted and efficient access to data for specific user groups.
Aggregation
• Purpose: Summarizes detailed data for analysis, improving query performance.
• Types: SUM, AVG, COUNT, MAX, MIN.
Historical Information
Query Facility
• Capabilities: Allows complex queries for data analysis, supporting multi-dimensional analysis and
ad-hoc queries.
• OLAP (Online Analytical Processing): Tools and techniques for multi-dimensional analysis of
data.
• Functions:
o Roll-up and Drill-down: Aggregating and breaking down data.
o Slice and Dice: Viewing data from different perspectives.
o Pivoting: Rotating data axes for alternative views.
OLAP Servers
• Types:
o ROLAP (Relational OLAP): Uses relational databases to store and manage warehouse data.
o MOLAP (Multidimensional OLAP): Uses multi-dimensional databases for faster
processing.
o HOLAP (Hybrid OLAP): Combines ROLAP and MOLAP, leveraging the strengths of both.
• Purpose: Integrates data mining techniques with OLAP for advanced analytics and pattern
discovery.
• Security: Ensures data integrity and protection against unauthorized access. Implemented through
user authentication, access control, and encryption.
• Backup: Regular data backups to prevent data loss.
• Recovery: Processes to restore data in case of system failure or data corruption.