0% found this document useful (0 votes)
2 views13 pages

AI Module3 CH2

The document provides an overview of data's significance in machine learning, explaining its types, roles, and the importance of data quality. It discusses Big Data Analytics, its characteristics, applications, and the four types of analytics: descriptive, diagnostic, predictive, and prescriptive. Additionally, it outlines the Big Data Analytics Framework and the importance of descriptive statistics and univariate data analysis in understanding and preparing data for machine learning.

Uploaded by

deekshagowdam22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views13 pages

AI Module3 CH2

The document provides an overview of data's significance in machine learning, explaining its types, roles, and the importance of data quality. It discusses Big Data Analytics, its characteristics, applications, and the four types of analytics: descriptive, diagnostic, predictive, and prescriptive. Additionally, it outlines the Big Data Analytics Framework and the importance of descriptive statistics and univariate data analysis in understanding and preparing data for machine learning.

Uploaded by

deekshagowdam22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Module 3: Understanding Data

What is data? Explain its importance in machine learning.


Introduction
Data is a collection of raw facts, figures, or symbols that can be processed to extract
meaningful information. In machine learning, data is not just a resource—it is the core that
enables models to learn and make predictions. Without data, no machine learning system
can be trained or evaluated.

Definition: Data is any collection of values, numbers, characters, or images that can be
recorded and used for further analysis. It may be structured (in tables), unstructured (like
images), or semi-structured (like JSON/XML formats).

Types of data relevant in ML

• Structured: Stored in rows and columns (e.g., spreadsheets).

• Unstructured: Includes free-text, images, audio, video.

• Semi-structured: Organized using tags or markup languages.

Role of data in Machine Learning

1. Training Models: Algorithms learn from historical data. The better the quality and
variety of data, the better the model performs.

2. Validation and Testing: Helps evaluate the model’s performance on unseen cases.

3. Feature Engineering: Important patterns and insights are extracted from data,
influencing the final predictions.

Example
In a spam email detection system, thousands of emails (labeled as spam or not) are fed to a
model. From this data, the model learns what patterns or words typically occur in spam
emails.

Why data is critical in ML

• Machine Learning is fundamentally data-driven.

• Poor quality or biased data results in poor and biased predictions.

• The size and diversity of data affect model generalization.

Explain different types of data with suitable examples.


Introduction
Understanding the types of data is essential in machine learning, as different types require
different preprocessing methods and influence the choice of algorithms. The classification
of data can be based on structure, measurement scale, and its nature.
1. Based on Structure

a) Structured Data

• Highly organized and easily searchable.

• Stored in tabular formats such as spreadsheets or relational databases.

• Machine learning models like decision trees, logistic regression, and SVM work well
with structured data.

• Example: A table with student IDs, names, test scores.

b) Unstructured Data

• No pre-defined format.

• Includes images, videos, emails, audio recordings, and raw text.

• Requires preprocessing (e.g., NLP for text, CNNs for images).

• Example: Product reviews, tweets, recorded customer service calls.

c) Semi-Structured Data

• Falls between structured and unstructured data.

• Has tags or keys to define data hierarchy and structure.

• Common in web APIs and NoSQL databases.

• Example: XML, JSON, HTML documents.

2. Based on Measurement Scales (from Textbook 2)

a) Qualitative Data (Categorical)


Descriptive and not numerically measurable.

• Nominal: No inherent order.

o Example: Gender (Male/Female), Department (HR/Finance).

• Ordinal: Ordered categories.

o Example: Customer satisfaction (Low, Medium, High).

b) Quantitative Data (Numerical)


Measured in numbers. Can be used for statistical calculations.

• Discrete: Countable whole numbers.

o Example: Number of students in a class.


• Continuous: Can take any value within a range.

o Example: Height, Temperature.

Why this classification is important in ML

• Algorithms treat numeric and categorical data differently.

• Preprocessing techniques like normalization, encoding, and binning are based on


data type.

• Helps determine statistical techniques and visualization tools.

Different types of data require different strategies for handling, analysis, and modeling.
Understanding data types is the foundation for applying machine learning correctly and
effectively. Whether structured or unstructured, categorical or continuous, each type plays a
unique role in data-driven decision making.

What are Big Data Analytics? Explain its characteristics with suitable examples.
Introduction
Big Data Analytics refers to the process of examining large and varied data sets—often
called big data—to uncover hidden patterns, unknown correlations, market trends, and
customer preferences. It enables organizations to make better decisions using massive
amounts of data generated in real time.

Definition (Textbook-based)
Big data is defined as datasets that are so large, complex, or fast-changing that traditional
data processing methods are inadequate to handle them. Big Data Analytics applies
statistical and machine learning techniques to this data to extract useful insights.

Characteristics of Big Data – The 5Vs

1. Volume

• Refers to the sheer amount of data generated every second.

• Example: Facebook processes over 4 petabytes of data every day.

2. Velocity

• The speed at which new data is created and needs to be processed.

• Example: Twitter users post hundreds of thousands of tweets per minute.

3. Variety
• Data comes in multiple formats: text, images, audio, logs, emails.

• Example: An e-commerce company collects structured order records, unstructured


customer reviews, and images of products.

4. Veracity

• Refers to the uncertainty or trustworthiness of data.

• Inaccurate or incomplete data leads to misleading analytics.

• Example: In social media, fake news or bots may distort data reliability.

5. Value

• Extracting actionable insights that provide business advantage.

• Example: Recommending products based on customer purchase patterns.

Applications of Big Data Analytics

1. Healthcare: Analyzing patient records and diagnostic images to predict diseases.

2. Finance: Fraud detection through real-time transaction analysis.

3. Retail: Personalized recommendations using customer browsing behavior.

4. Agriculture: Crop yield prediction using sensor and weather data.

Benefits of Big Data Analytics

• Increases efficiency and automation.

• Improves forecasting and trend analysis.

• Enables personalization at scale.

• Supports real-time decision-making.

Conclusion
Big Data Analytics is not just about handling massive datasets—it’s about transforming that
data into meaningful insights. With the right infrastructure and tools, organizations can
unlock immense value and gain a competitive edge.

Explain the four types of analytics with real-world examples.


Introduction
Analytics is the science of analyzing raw data to make conclusions and support decision-
making. In machine learning and data science, analytics is broadly classified into four
types—descriptive, diagnostic, predictive, and prescriptive—each offering increasing levels
of insight and actionability.

1. Descriptive Analytics

Purpose: Answers “What happened?”

• It summarizes past events and trends using statistical methods.

• Reports, dashboards, and scorecards fall under this category.

• Example: An e-commerce company uses descriptive analytics to display monthly


sales and customer acquisition trends.

Tools Used: Excel, Tableau, SQL, business intelligence tools.

2. Diagnostic Analytics

Purpose: Answers “Why did it happen?”

• It digs deeper into the data to understand causes behind a trend or outcome.

• Involves techniques like correlation analysis, drill-down, and data mining.

• Example: If sales dropped in a region, diagnostic analytics might reveal that


stockouts or poor weather were the reasons.

Tools Used: SQL queries, statistical analysis, Python (pandas, seaborn).

3. Predictive Analytics

Purpose: Answers “What is likely to happen?”

• Uses machine learning and statistical models to forecast future outcomes based on
historical data.

• Example: A telecom company predicts which customers are at risk of leaving


(churn) using past call and complaint records.

Techniques: Regression, decision trees, time-series forecasting.

4. Prescriptive Analytics

Purpose: Answers “What should we do?”

• Suggests optimal actions using simulation, optimization, and decision rules.


• Combines outputs of predictive models with business goals.

• Example: A logistics company suggests the best delivery route based on real-time
traffic and weather predictions.

Tools Used: Operations research tools, ML frameworks, optimization libraries.

Comparison Table

Type Question Answered Use Case

Descriptive What happened? Monthly sales report

Diagnostic Why did it happen? Decline in user engagement analysis

Predictive What will happen? Forecasting demand

Prescriptive What should we do? Route optimization for deliveries

Conclusion
Each type of analytics serves a unique purpose in the data analysis pipeline. While
descriptive and diagnostic analytics help understand the past and present, predictive and
prescriptive analytics empower businesses to take proactive, data-driven decisions for the
future.

Describe the Big Data Analytics Framework. Explain each stage with examples.
Introduction
A Big Data Analytics Framework is a structured approach for collecting, storing, processing,
analyzing, and visualizing massive datasets to generate actionable insights. The framework
provides a step-by-step pipeline to manage big data efficiently and integrate it with machine
learning and business intelligence systems.

Key Stages in Big Data Analytics Framework

1. Data Acquisition (Collection)

This stage involves gathering data from multiple sources such as web logs, social media, IoT
devices, mobile applications, sensors, and transactional systems.

Example: A fitness app collects heart rate and step count from wearable devices every
second.

Tech Used: APIs, web scrapers, data ingestion tools like Apache Flume, Kafka.
2. Data Storage

The collected data must be stored in a scalable and reliable environment. Depending on the
type (structured or unstructured), storage solutions may vary.

Example: Sensor data from a smart city project is stored in a distributed file system for later
analysis.

Tech Used: HDFS (Hadoop Distributed File System), NoSQL databases like MongoDB, cloud
storage like AWS S3.

3. Data Preprocessing (Cleaning & Transformation)

Raw data is often noisy, incomplete, and inconsistent. Preprocessing prepares it for
analysis.

Tasks Include:

• Removing duplicates

• Handling missing values

• Converting formats (e.g., from text to numeric)

Example: Cleaning and formatting millions of social media posts before sentiment analysis.

Tools: Python (Pandas), Apache Spark, OpenRefine

4. Data Processing & Analysis

In this stage, the cleaned data is analyzed to discover patterns, trends, or predictions using
algorithms and models.

Example: An online store predicts future sales using past purchase trends and seasonal
data.

Tools: Apache Spark, Hadoop MapReduce, machine learning models (SVM, regression, etc.)

5. Data Visualization & Reporting

After extracting insights, the results must be communicated clearly using charts,
dashboards, and reports.

Example: A business dashboard displays key KPIs like monthly sales, churn rate, and
inventory levels.

Tools: Tableau, Power BI, Python (Matplotlib/Seaborn), Excel


Summary Table

Stage Description Tools/Examples

Data Acquisition Collecting data from sources IoT, APIs, Flume, Kafka

Data Storage Storing structured/unstructured data HDFS, NoSQL, AWS S3

Data Preprocessing Cleaning, transforming raw data Pandas, Spark, OpenRefine

Processing/Analysis Analyzing data for insights Spark, ML models, SQL

Presenting results using Tableau, Power BI,


Visualization
charts/dashboards Matplotlib

Conclusion
The Big Data Analytics Framework provides a roadmap to systematically handle massive
datasets, ensuring that the data is accurate, processed efficiently, and turned into valuable
business insights. Following this structured pipeline helps organizations make data-driven
decisions at scale.

What are descriptive statistics? Explain key measures of central tendency and
dispersion with suitable examples.
Introduction
Descriptive statistics refer to methods that summarize and describe the essential features of
a dataset. These statistics provide simple quantitative descriptions of a collection of data
and are typically the first step in data analysis before applying machine learning algorithms.

Descriptive statistics are broadly categorized into two types:

1. Measures of Central Tendency – Indicate the average or center of data.

2. Measures of Dispersion (Variability) – Indicate the spread of data around the center.

1. Measures of Central Tendency

These measures show where most values in a dataset lie.

a) Mean (Arithmetic Average)

• Sum of all values divided by the number of values.

• Sensitive to extreme values (outliers).

• Example:
Data:10,15,20,25
Mean = (10+15+20+25)/4 = 17.5
b) Median

• The middle value when the data is sorted.

• More robust than mean in skewed datasets.

• Example:
Data: 3, 7, 8, 12, 20 → Median = 8

c) Mode

• Most frequently occurring value in the dataset.

• Useful for categorical data.

• Example:
Data: Red, Blue, Blue, Green → Mode = Blue

2. Measures of Dispersion (Spread)

These measures show how spread out the values are around the mean.

a) Range

• Difference between the maximum and minimum values.

• Example:
Data: 4, 10, 15 → Range = 15 - 4 = 11

b) Variance

• Average of the squared differences from the mean.

• Gives an idea of how far data points are from the mean.

c) Standard Deviation (SD)

• Square root of the variance.

• Easier to interpret as it is in the same unit as the data.

• A high SD indicates widely spread values; low SD shows tightly packed values.

• Example:
If average salary = ₹50,000 with SD = ₹5,000, then most employees earn between
₹45,000 and ₹55,000.

Importance of Descriptive Statistics in Machine Learning

• Helps in understanding the data before applying models.


• Identifies skewness, outliers, and anomalies.

• Aids in deciding preprocessing techniques like normalization or scaling.

• Provides insights for feature engineering and data cleaning.

Conclusion
Descriptive statistics provide the foundation for analyzing and interpreting data.
Understanding measures like mean, median, variance, and standard deviation helps in
making informed decisions and preparing data effectively for machine learning models.

What is univariate data analysis? Explain its techniques and applications with
examples.
Introduction
Univariate data analysis is the simplest form of data analysis, where only one variable is
analyzed at a time. The purpose is to understand the underlying distribution, detect
outliers, and summarize the variable’s characteristics using statistics and visualizations.

The term “uni-variate” literally means “one variable.” Since only one variable is involved,
there are no relationships or comparisons between variables—unlike bivariate or
multivariate analysis.

Techniques Used in Univariate Analysis

There are two broad types of variables: categorical and numerical, and the technique used
depends on the type of data.

A) For Categorical Data (e.g., gender, product category)

1. Frequency Distribution

Lists each category and the number of times it occurs.

Helps in identifying the most and least common categories.

Example:
Gender Distribution → Male: 60%, Female: 40%

2. Bar Charts / Pie Charts

Visualize proportions of categories.

Bar chart shows each category as a bar; pie chart uses segments of a circle.

B) For Numerical Data (e.g., age, income)


1. Measures of Central Tendency and Dispersion

Mean, median, mode (central values)

Standard deviation, variance, range (spread of data)

2. Histograms

Shows the frequency distribution of continuous data.

Useful for detecting skewness or normality.

Example: Histogram of ages shows a right-skewed distribution.

3. Box Plots

Visualizes median, quartiles, and outliers in data.

Example: A box plot of salaries might reveal one very high outlier.

Importance of Univariate Analysis in Machine Learning

First step in Exploratory Data Analysis (EDA)


Understands how a single feature behaves.

Guides data preprocessing


Helps identify missing values, imbalanced classes, and need for scaling.

Improves feature engineering


Detects variables that need transformation (e.g., log transformation for skewed features).

Real-Life Example

An e-commerce company analyzing the “number of items per order” may discover:

Most users buy 1–2 items.

A few high-value users buy 10+ items (outliers).

This helps in segmenting users and running targeted offers.

Conclusion
Univariate analysis helps summarize and understand individual features of a dataset. It is a
fundamental step in the machine learning pipeline, providing insights into the nature of the
data and guiding downstream processes such as cleaning, visualization, and model design.
What is data visualization? Explain its importance and describe common
visualization techniques with suitable examples.
Introduction
Data visualization is the graphical representation of information and data. It allows analysts,
decision-makers, and even non-technical stakeholders to understand data trends, outliers,
and patterns quickly and intuitively. In machine learning, data visualization is a core part of
Exploratory Data Analysis (EDA), which helps in interpreting features and results.

Why Data Visualization is Important

1. Simplifies complex data


Converts massive datasets into easy-to-read visuals.

2. Reveals hidden patterns and trends


Trends like seasonality, cycles, or anomalies can be spotted visually.

3. Assists in feature selection and preprocessing


Visualizations reveal skewed distributions, outliers, and data imbalance.

4. Communicates insights effectively


Dashboards and charts help in sharing results with non-technical teams.

5. Improves model transparency


In model interpretation, tools like SHAP and LIME use visualizations to show feature
importance.

Types of Data Visualization Techniques

A) For Categorical Data

1. Bar Charts

• Displays frequency of each category.

• Example: Visualizing sales by region.

2. Pie Charts

• Shows proportions as slices of a pie.

• Example: Market share of smartphone brands.

B) For Numerical Data

1. Histograms

• Show frequency distribution of continuous variables.


• Example: Distribution of customer ages in a retail store.

2. Box Plots

• Visualize median, quartiles, and outliers.

• Example: Box plot of salary ranges in an organization.

3. Line Charts

• Track changes over time (time series).

• Example: Website traffic over months.

4. Scatter Plots

• Show relationship between two numeric variables.

• Example: Relationship between study hours and marks.

Tools Commonly Used

• Excel & Google Sheets – Basic business analysis

• Matplotlib / Seaborn (Python) – Scientific plotting

• Power BI / Tableau – Interactive dashboards

• Plotly / D3.js – Web-based visualizations

Real-World Example
A telecom company uses dashboards to monitor customer churn. A histogram reveals that
most churns occur in users with low recharge amounts. A time-series line chart shows
peaks in churn during price hikes.

Conclusion
Data visualization is more than just making charts—it’s about storytelling with data. It
makes the complex comprehensible and drives informed decisions. In the ML pipeline, it
plays a key role from data exploration to communicating final results.

You might also like