Module 1 Introduction Notes Ml
Module 1 Introduction Notes Ml
Module-1
Introduction: Need for Machine Learning, Machine Learning Explained, Machine Learning in Relation
to other Fields, Types of Machine Learning, Challenges of Machine Learning, Machine Learning Process,
Machine Learning Applications.
Understanding Data – 1: Introduction, Big Data Analysis Framework, Descriptive Statistics, Univariate
Data Analysis and Visualization.
Chapter-1, 2 (2.1-2.5)
Need for Machine Learning
Historical Challenges
o Data was scattered across different archive systems, making integration difficult.
o Lack of awareness about software tools to extract useful insights.
Reasons for Machine Learning Popularity
1. High Data Volume: Companies like Facebook, Twitter, and YouTube generate vast amounts
of data, doubling every year.
2. Reduced Storage Cost: Lower hardware costs make data capture, processing, storage, and
transmission easier.
3. Advanced Algorithms: The rise of deep learning has introduced complex and efficient
machine learning algorithms.
Knowledge Pyramid
o Data: Raw facts (numbers, text, etc.).
o Information: Processed data (patterns, associations).
o Knowledge: Condensed information (historical patterns, trends).
o Intelligence: Applied knowledge for actions.
o Wisdom: Human-like decision-making ability.
Data: A list of temperatures recorded hourly (e.g., 30°C, 32°C, 31°C, 29°C).
Information: The average temperature for the day is 30.5°C.
Knowledge: The temperature tends to drop in the evening based on past records.
Intelligence: Carry an umbrella or wear light clothing based on the weather forecast.
Wisdom: Choosing the best time to step out based on weather, personal health, and planned
activities.
Need for Machine Learning
o Helps organizations analyze archival data for better decision-making.
o Aids in designing new products and improving business processes.
o Supports the development of effective decision support systems.
1|Page M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
2|Page M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
o
Example: Object detection improves as more labeled images (experience) are provided.
ML Process
1. Data Collection: Gathering relevant data.
2. Abstraction: Identifying key concepts from data.
3. Generalization: Converting abstract concepts into actionable intelligence.
4. Heuristic Formation: Making educated guesses based on patterns.
5. Evaluation & Course Correction: Refining models for better accuracy.
1.3.2 Machine Learning, Data Science, Data Mining, and Data Analytics
3|Page M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
Data Science: An umbrella term covering multiple fields, including ML. It focuses on data
collection and analysis.
Big Data: Deals with massive datasets characterized by:
1. Volume – Large amounts of data (e.g., Facebook, YouTube).
2. Variety – Multiple formats (text, images, videos).
3. Velocity – High-speed data generation and processing.
Data Mining: Extracts hidden patterns from data, while ML uses these patterns for prediction.
Data Analytics: Converts raw data into useful insights. Predictive analytics is closely related to
ML.
Pattern Recognition: Uses ML algorithms to classify and analyze patterns in data.
1.3.3 Machine Learning and Statistics
Statistics: A mathematical field that sets hypotheses, validates them, and finds relationships in data.
Machine Learning: Focuses more on automation and requires fewer assumptions than traditional
statistics.
Key Differences:
o Statistics relies on rigorous mathematical models and theoretical foundations.
o ML is more tool-based, focusing on learning from data with minimal manual intervention.
Some view ML as an evolution of "old statistics," recognizing their deep connection.
4|Page M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
5|Page M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
o
Key algorithms: Decision Trees, Random Forest, SVM, Naïve Bayes, Neural Networks.
6|Page M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
7|Page M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
9|Page M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
1. Understanding the Business – Identify business objectives, define the problem statement, and
determine whether a single algorithm is sufficient.
2. Understanding the Data – Collect data, analyze its characteristics, and match patterns to
hypotheses.
3. Data Preparation – Clean the raw data, handle missing values, and prepare it for model training.
4. Modeling – Apply machine learning algorithms to detect patterns and generate predictive models.
5. Evaluation – Assess model performance using statistical analysis, accuracy metrics, and domain
expertise.
6. Deployment – Implement the trained model to improve business processes or handle new situations.
1.7 Applications of Machine Learning
Machine learning is widely used across various domains. Some key applications include:
1. Sentiment Analysis – NLP-based analysis of emotions in text (e.g., product and movie reviews).
2. Recommendation Systems – Personalized suggestions in e-commerce (Amazon), streaming
platforms (Netflix), etc.
3. Voice Assistants – AI-powered assistants like Alexa, Siri, and Google Assistant.
4. Navigation & Transportation – Google Maps, Uber route optimization.
10 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
11 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
Chapter 2:
Understanding Data – 1: Introduction
Facts and Data in Computer Systems
All facts are considered data.
Bits encode facts in numbers, text, images, audio, and video.
Data can be human-interpretable (numbers, text) or machine-interpretable (images, videos).
Organizations store large volumes of data (Gigabytes, Terabytes, Exabytes).
Byte = 8 bits, KB = 1024 bytes, MB ≈ 1000 KB, GB ≈ 1,000,000 KB, 1TB = 1000 GB, 1EB =
1,000,000 TB.
Data Sources
Flat Files, Databases, Data Warehouses store data.
Operational Data: Used in daily business processes (e.g., sales records).
Non-Operational Data: Used for decision-making (e.g., historical trends).
Data vs. Information
12 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
JSON (JavaScript Object Notation): Common data exchange format in machine learning.
15 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
2. Data Management Layer – Manages storage, indexing, and retrieval of data. This includes
preprocessing tasks like data cleaning, deduplication, and transformation for efficient query
execution.
3. Data Analytics Layer – The core of Big Data processing, where machine learning algorithms,
statistical models, and predictive analytics are applied. The processing is done as shown below:
Cloud Computing
Definition: Pay-per-usage model providing shared processing power, storage, and services over the
Internet.
Service Models:
o SaaS (Software as a Service): Access to software applications via the cloud.
o PaaS (Platform as a Service): Platform to develop and run applications.
o IaaS (Infrastructure as a Service): Access to infrastructure like servers, storage, and OS.
Deployment Models:
o Public Cloud: Open to the public, managed by a third-party vendor.
o Private Cloud: Owned by a single organization, providing secure access.
o Community Cloud: Shared by multiple organizations for common goals.
o Hybrid Cloud: Combination of two or more cloud types.
Characteristics:
1. Shared Infrastructure – Shared physical resources (storage, networking).
2. Dynamic Provisioning – Resource allocation based on demand.
3. Dynamic Scaling – Automatic expansion and contraction of resources.
4. Network Access – Access through the Internet.
5. Utility-based Metering – Billing based on usage.
6. Multitenancy – Supports multiple customers.
7. Reliability – Ensures consistent, reliable service.
Grid Computing
Definition: Parallel and distributed computing model connecting multiple nodes to function as a
virtual supercomputer.
Features:
o Connects thousands of nodes as a cluster using middleware software.
o Tasks are divided and processed in parallel across multiple nodes.
o Suitable for complex applications requiring high computing power.
Definition: Uses parallel processing to solve complex scientific, engineering, and business problems
at high speed.
Components:
1. Compute – Networked servers to process tasks.
2. Network – Connects compute nodes for communication.
3. Storage – Stores and retrieves data outputs.
Features:
o Combines thousands of compute nodes working in parallel.
o Suitable for tasks requiring large-scale, fast computations.
4. Presentation Layer – The final stage that involves visualization techniques such as dashboards and
applications to interpret and display results effectively.
1. Open/Public Data
These datasets are freely available and have minimal copyright restrictions. Examples include:
Government Census Data – Demographic, economic, and social statistics collected by
governments.
Digital Libraries – Large repositories containing textual and image-based documents.
Scientific Databases – Collections of genomic, biological, and research data.
Healthcare Databases – Patient records, insurance data, and medical research information.
2. Social Media Data
Generated by platforms like Twitter, Facebook, YouTube, and Instagram, social media data includes:
Text posts, comments, and messages.
Images and videos.
Likes, shares, and interactions.
3. Multimodal Data
Includes diverse formats such as text, images, audio, and video. Examples:
Image Archives – Large repositories of labeled images, combined with metadata.
World Wide Web (WWW) – A vast source of structured and unstructured data distributed across
the internet.
18 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
1. Ignore the Tuple – Discard records with missing values (useful if missing data is minimal).
2. Manual Filling – Domain experts manually enter missing values (time-consuming for large
datasets).
3. Global Constant Substitution – Replace missing values with a generic label like "Unknown" or
"Infinity".
4. Attribute Mean Substitution – Fill missing values with the mean of the attribute.
5. Class-Specific Mean – Use the average of similar class records for substitution.
6. Predictive Methods – Use machine learning models like decision trees to estimate missing values.
These methods help reduce bias and improve dataset quality but may introduce estimation errors.
19 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
20 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
2. z-Score Normalization
Scales values using mean (μ) and standard deviation (σ):
Data Reduction
Reduces dataset size without losing significant information.
Techniques:
1. Data Aggregation – Summarizing data.
2. Feature Selection – Removing irrelevant attributes.
3. Dimensionality Reduction – Techniques like PCA (Principal Component Analysis) reduce data
dimensions for better processing.
Descriptive Statistics
Definition
Summarizes and describes datasets.
Does not focus on machine learning algorithms.
Helps in Exploratory Data Analysis (EDA) for understanding data before applying ML techniques.
Dataset and Data Types
A dataset consists of multiple data objects (records, vectors, patterns, etc.).
Each data object has multiple attributes (characteristics of an object).
21 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
22 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
23 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
Pie Chart: Represents percentage frequency distribution; useful for showing proportions.
24 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
Histogram: Depicts frequency distributions; helps identify data shape, skewness, and mode.
Dot Plot: Similar to bar charts but less cluttered; visually identifies high and low values.
3. Central Tendency
Summarizes data by finding the central point.
I. Mean: Arithmetic average; affected by extreme values.
25 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
26 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
II. Median
The median is the middle value in an ordered dataset.
If the total number of values is odd, the median is the middle value.
If the total number of values is even, the median is the average of the two middle values.
In grouped data, the median is found using the formula:
where:
L1 = lower boundary of the median class
N= total number of observations
cf = cumulative frequency before the median class
f = frequency of the median class
i = class width
Example
Consider a dataset: 10, 20, 30, 40, 50
The median is 30 (middle value).
For an even dataset: 10, 20, 30, 40, 50, 60
The median is (30+40)/2 = 35
2. Mode
The mode is the most frequently occurring value in a dataset.
27 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
If a dataset has one mode, it is unimodal (e.g., 10, 20, 20, 30 → mode = 20).
If it has two modes, it is bimodal (e.g., 10, 20, 20, 30, 30 → modes = 20, 30).
If it has three or more modes, it is trimodal/multimodal.
Example
Dataset: 5, 7, 8, 8, 10, 12, 12, 12, 15
The mode is 12 (occurs most frequently).
For grouped data, mode is calculated using a specific formula similar to median.
This simplified explanation should help you grasp the concepts clearly! 🚀
4. Dispersion
Measures the spread of data around the central value.
28 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
Interquartile Range (IQR): Difference between the third (Q3) and first quartile (Q1); helps detect
outliers.
29 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
Five-Point Summary: Definition: A box plot is a graphical method to display data distribution using five
key values.
Five-Number Summary:
Box Representation:
The box represents the middle 50% of the data (from Q1 to Q3).
A line inside the box represents the median (Q2).
Whiskers:
Skewness:
If the median is not in the center of the box, the data is skewed.
Example:
30 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
31 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
32 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
33 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
Q-Q Plot: Assesses if data follows a normal distribution; points align along the 45-degree reference
line in normal cases.
Exercise questions:
Here are the questions along with their answers based on the machine learning concepts:
34 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
Long Questions
1. Explain in detail the machine learning process model.
Machine learning follows a structured process to build models that can analyze data and make
predictions. One of the most widely used process models for machine learning is CRISP-DM (Cross-
Industry Standard Process for Data Mining). It consists of six important steps that ensure a
systematic approach to solving business problems using machine learning techniques.
1.1 Business Understanding
The first step is to clearly define the business objective. This involves identifying the problem that needs to
be solved and understanding how machine learning can provide a solution.
Example: A retail company wants to predict customer churn. The goal is to develop a model that identifies
customers likely to leave and implement strategies to retain them.
1.2 Data Understanding
In this phase, data is collected from multiple sources, and an exploratory analysis is conducted. Key
activities include identifying missing values, data distribution, and potential relationships between variables.
Example: The retail company gathers past customer transaction records, demographics, and engagement
levels to understand which factors influence customer retention.
1.3 Data Preparation
The raw data collected is cleaned and transformed into a usable format. This includes handling missing
values, removing duplicates, encoding categorical data, normalizing numerical values, and feature
engineering.
36 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
Example: Converting customer transaction logs into structured numerical features such as purchase
frequency, average spending, and time since last purchase.
1.4 Modeling
This step involves selecting an appropriate machine learning algorithm and training it using prepared data.
Various algorithms like Decision Trees, Neural Networks, and Support Vector Machines are tested to
determine the best model.
Example: The company trains multiple models to predict customer churn and compares their accuracy.
1.5 Evaluation
The trained models are tested using validation datasets to measure their accuracy, precision, recall, and other
performance metrics. Cross-validation techniques are also used to check for overfitting.
Example: If a model predicts customer churn with 85% accuracy, it is further analyzed to ensure it
generalizes well to unseen data.
1.6 Deployment
The final step involves deploying the model into production. The model is integrated into the business
process to make real-time predictions and assist in decision-making. Regular monitoring ensures that the
model continues to perform well.
Example: The churn prediction model is integrated into the company’s CRM system to alert managers about
customers at risk of leaving.
2. List out and briefly explain the classification algorithms.
Classification algorithms are supervised learning techniques used to categorize data into predefined classes.
They are widely used in spam detection, medical diagnosis, and customer segmentation. Below are some
important classification algorithms:
2.1 Decision Trees
Decision trees use a tree-like structure where each internal node represents a decision based on a
feature, and each leaf node represents a class label.
Strengths: Easy to interpret, handles both numerical and categorical data.
Weaknesses: Prone to overfitting, especially on small datasets.
2.2 Support Vector Machines (SVM)
SVM finds the optimal hyperplane that separates different classes in a high-dimensional space.
Strengths: Effective in high-dimensional spaces, robust against overfitting.
Weaknesses: Computationally expensive for large datasets.
2.3 Random Forest
An ensemble learning method that builds multiple decision trees and combines their predictions. The
final output is determined by majority voting.
Strengths: Reduces overfitting, performs well on complex datasets.
Weaknesses: Requires more computational power compared to a single decision tree.
37 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
Chapter 2 : Exercise:
Question 1: What is univariate data?
Answer: Univariate data consists of a single variable and is analyzed to understand its distribution, central
tendency, and dispersion. Examples include students' test scores, height measurements, or weights.
Question 7: Why are central tendency and dispersion measures important for data miners?
Answer: They help understand data distribution, detect outliers, and make predictions for decision-making.
Problems:
40 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
Sol: 1. Mean
2.
4. Variance σ2
σ2 =
σ2 =
41 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
√𝛔2
or
σ =
Sol: 1. Mean
2. Geometric Mean
or
42 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
Sol:
Step 1: Find Minimum and Maximum
Minimum = 5
Maximum = 30
Step 2: Find the Median (Q2)
Since the dataset has 6 values (even count), the median is the average of the
Q1=Median(5,10,15)=10
Q3=Median(20,25,30)=25
43 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
Now, we plot the box plot using the five-number summary. I will generate the box plot for you.
44 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
45 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
46 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
47 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
48 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
49 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
50 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
A)
51 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
B)
52 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
53 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
54 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
55 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
56 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
57 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
58 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
59 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
60 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
61 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B
62 | P a g e M a c h i n e L e a r n i n g – M o d u l e 1 – b y P r o f. S a ra n ya . B