AI Module3 CH2
AI Module3 CH2
Definition: Data is any collection of values, numbers, characters, or images that can be
recorded and used for further analysis. It may be structured (in tables), unstructured (like
images), or semi-structured (like JSON/XML formats).
1. Training Models: Algorithms learn from historical data. The better the quality and
variety of data, the better the model performs.
2. Validation and Testing: Helps evaluate the model’s performance on unseen cases.
3. Feature Engineering: Important patterns and insights are extracted from data,
influencing the final predictions.
Example
In a spam email detection system, thousands of emails (labeled as spam or not) are fed to a
model. From this data, the model learns what patterns or words typically occur in spam
emails.
a) Structured Data
• Machine learning models like decision trees, logistic regression, and SVM work well
with structured data.
b) Unstructured Data
• No pre-defined format.
c) Semi-Structured Data
Different types of data require different strategies for handling, analysis, and modeling.
Understanding data types is the foundation for applying machine learning correctly and
effectively. Whether structured or unstructured, categorical or continuous, each type plays a
unique role in data-driven decision making.
What are Big Data Analytics? Explain its characteristics with suitable examples.
Introduction
Big Data Analytics refers to the process of examining large and varied data sets—often
called big data—to uncover hidden patterns, unknown correlations, market trends, and
customer preferences. It enables organizations to make better decisions using massive
amounts of data generated in real time.
Definition (Textbook-based)
Big data is defined as datasets that are so large, complex, or fast-changing that traditional
data processing methods are inadequate to handle them. Big Data Analytics applies
statistical and machine learning techniques to this data to extract useful insights.
1. Volume
2. Velocity
3. Variety
• Data comes in multiple formats: text, images, audio, logs, emails.
4. Veracity
• Example: In social media, fake news or bots may distort data reliability.
5. Value
Conclusion
Big Data Analytics is not just about handling massive datasets—it’s about transforming that
data into meaningful insights. With the right infrastructure and tools, organizations can
unlock immense value and gain a competitive edge.
1. Descriptive Analytics
2. Diagnostic Analytics
• It digs deeper into the data to understand causes behind a trend or outcome.
3. Predictive Analytics
• Uses machine learning and statistical models to forecast future outcomes based on
historical data.
4. Prescriptive Analytics
• Example: A logistics company suggests the best delivery route based on real-time
traffic and weather predictions.
Comparison Table
Conclusion
Each type of analytics serves a unique purpose in the data analysis pipeline. While
descriptive and diagnostic analytics help understand the past and present, predictive and
prescriptive analytics empower businesses to take proactive, data-driven decisions for the
future.
Describe the Big Data Analytics Framework. Explain each stage with examples.
Introduction
A Big Data Analytics Framework is a structured approach for collecting, storing, processing,
analyzing, and visualizing massive datasets to generate actionable insights. The framework
provides a step-by-step pipeline to manage big data efficiently and integrate it with machine
learning and business intelligence systems.
This stage involves gathering data from multiple sources such as web logs, social media, IoT
devices, mobile applications, sensors, and transactional systems.
Example: A fitness app collects heart rate and step count from wearable devices every
second.
Tech Used: APIs, web scrapers, data ingestion tools like Apache Flume, Kafka.
2. Data Storage
The collected data must be stored in a scalable and reliable environment. Depending on the
type (structured or unstructured), storage solutions may vary.
Example: Sensor data from a smart city project is stored in a distributed file system for later
analysis.
Tech Used: HDFS (Hadoop Distributed File System), NoSQL databases like MongoDB, cloud
storage like AWS S3.
Raw data is often noisy, incomplete, and inconsistent. Preprocessing prepares it for
analysis.
Tasks Include:
• Removing duplicates
Example: Cleaning and formatting millions of social media posts before sentiment analysis.
In this stage, the cleaned data is analyzed to discover patterns, trends, or predictions using
algorithms and models.
Example: An online store predicts future sales using past purchase trends and seasonal
data.
Tools: Apache Spark, Hadoop MapReduce, machine learning models (SVM, regression, etc.)
After extracting insights, the results must be communicated clearly using charts,
dashboards, and reports.
Example: A business dashboard displays key KPIs like monthly sales, churn rate, and
inventory levels.
Data Acquisition Collecting data from sources IoT, APIs, Flume, Kafka
Conclusion
The Big Data Analytics Framework provides a roadmap to systematically handle massive
datasets, ensuring that the data is accurate, processed efficiently, and turned into valuable
business insights. Following this structured pipeline helps organizations make data-driven
decisions at scale.
What are descriptive statistics? Explain key measures of central tendency and
dispersion with suitable examples.
Introduction
Descriptive statistics refer to methods that summarize and describe the essential features of
a dataset. These statistics provide simple quantitative descriptions of a collection of data
and are typically the first step in data analysis before applying machine learning algorithms.
2. Measures of Dispersion (Variability) – Indicate the spread of data around the center.
• Example:
Data:10,15,20,25
Mean = (10+15+20+25)/4 = 17.5
b) Median
• Example:
Data: 3, 7, 8, 12, 20 → Median = 8
c) Mode
• Example:
Data: Red, Blue, Blue, Green → Mode = Blue
These measures show how spread out the values are around the mean.
a) Range
• Example:
Data: 4, 10, 15 → Range = 15 - 4 = 11
b) Variance
• Gives an idea of how far data points are from the mean.
• A high SD indicates widely spread values; low SD shows tightly packed values.
• Example:
If average salary = ₹50,000 with SD = ₹5,000, then most employees earn between
₹45,000 and ₹55,000.
Conclusion
Descriptive statistics provide the foundation for analyzing and interpreting data.
Understanding measures like mean, median, variance, and standard deviation helps in
making informed decisions and preparing data effectively for machine learning models.
What is univariate data analysis? Explain its techniques and applications with
examples.
Introduction
Univariate data analysis is the simplest form of data analysis, where only one variable is
analyzed at a time. The purpose is to understand the underlying distribution, detect
outliers, and summarize the variable’s characteristics using statistics and visualizations.
The term “uni-variate” literally means “one variable.” Since only one variable is involved,
there are no relationships or comparisons between variables—unlike bivariate or
multivariate analysis.
There are two broad types of variables: categorical and numerical, and the technique used
depends on the type of data.
1. Frequency Distribution
Example:
Gender Distribution → Male: 60%, Female: 40%
Bar chart shows each category as a bar; pie chart uses segments of a circle.
2. Histograms
3. Box Plots
Example: A box plot of salaries might reveal one very high outlier.
Real-Life Example
An e-commerce company analyzing the “number of items per order” may discover:
Conclusion
Univariate analysis helps summarize and understand individual features of a dataset. It is a
fundamental step in the machine learning pipeline, providing insights into the nature of the
data and guiding downstream processes such as cleaning, visualization, and model design.
What is data visualization? Explain its importance and describe common
visualization techniques with suitable examples.
Introduction
Data visualization is the graphical representation of information and data. It allows analysts,
decision-makers, and even non-technical stakeholders to understand data trends, outliers,
and patterns quickly and intuitively. In machine learning, data visualization is a core part of
Exploratory Data Analysis (EDA), which helps in interpreting features and results.
1. Bar Charts
2. Pie Charts
1. Histograms
2. Box Plots
3. Line Charts
4. Scatter Plots
Real-World Example
A telecom company uses dashboards to monitor customer churn. A histogram reveals that
most churns occur in users with low recharge amounts. A time-series line chart shows
peaks in churn during price hikes.
Conclusion
Data visualization is more than just making charts—it’s about storytelling with data. It
makes the complex comprehensible and drives informed decisions. In the ML pipeline, it
plays a key role from data exploration to communicating final results.