0% found this document useful (0 votes)
12 views31 pages

Ad3301 Apr May 2024 Answer Key

The document outlines the syllabus and exam structure for the Data Exploration and Visualization course at Anna University for April/May 2024. It includes questions and answers on key responsibilities of data analysts, tools for data analysis and visualization, data transformation techniques, and various statistical methods. Additionally, it covers practical applications of pivot tables, line plots, histograms, and 3D data visualization, providing insights into their creation and interpretation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views31 pages

Ad3301 Apr May 2024 Answer Key

The document outlines the syllabus and exam structure for the Data Exploration and Visualization course at Anna University for April/May 2024. It includes questions and answers on key responsibilities of data analysts, tools for data analysis and visualization, data transformation techniques, and various statistical methods. Additionally, it covers practical applications of pivot tables, line plots, histograms, and 3D data visualization, providing insights into their creation and interpretation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Subject Code: AD3301

Subject Name: Data Exploration and Visualization


Anna University Examination: April / May 2024
PART A
(2 MARKS)
1. Mention the key responsibilities of a data analyst.
Ans:
Data analysts are responsible for collecting, cleaning, interpreting, and presenting data. They use
statistical tools to identify patterns, prepare reports, and assist in decision-making.

2. Name some of the best tools used for data analysis and data visualization.
Ans:
For data analysis: Python, R, SQL.
For data visualization: Tableau, Power BI, Matplotlib, and Seaborn.

3. List the software and hardware components required for data visualization.
Ans:
• Software: Tableau, Excel, Python
• Hardware: High RAM, multi-core processor, GPU for rendering, high-resolution monitor

4. Draw and label a rough contour plot of the joint probability density function. When P = -0.4, ρ
= -0.4.
Ans:
The contour plot should show elliptical curves tilted downward, indicating a moderate negative
correlation. (To be drawn manually on paper.)
5. Difference between normalized scaling and standardized scaling.
Ans:
• Normalized Scaling: Rescales data to [0,1] using (x - min)/(max - min)
• Standardized Scaling: Converts data to have mean = 0 and standard deviation = 1 using (x -
mean)/std

6. Illustrate important steps to be followed in preparing a base map.


Ans:
Steps include:
1. Collect spatial data
2. Georeference the data
3. Choose coordinate system
4. Add map layers
5. Label features
6. Verify for accuracy

7. The diagram represents the sales of Superclene toothpaste over the last few years. Give a reason
why it is misleading.
Ans:
The Y-axis does not start from zero, which visually exaggerates small differences in sales, making the
chart misleading.

8. How do you find the correlation of a scatter plot?


Ans:
Observe the trend of data points.
• Upward trend = positive correlation
• Downward trend = negative correlation
Use Pearson’s correlation coefficient for numerical value.

9. Define least square method in time series.


Ans:
It is a method to fit a trend line by minimizing the sum of squared differences between observed and
estimated values in time series data.

10. List the techniques used in smoothing time series.


Ans:
1. Simple Moving Average
2. Weighted Moving Average
3. Exponential Smoothing
4. LOESS/LOWESS
5. Gaussian Smoothing
PART-B
(13 MARKS)

11 (a) (i) Discuss about Descriptive Statistics in Exploratory Analysis. (7 Marks)


Answer:

Descriptive statistics summarize and organize the characteristics of a dataset, playing a vital role
in Exploratory Data Analysis (EDA) by helping understand the structure and patterns in data.
Measures of Central Tendency
• Mean:
o The arithmetic average of values in a dataset.
o Sensitive to outliers.
• Median:
o The middle value when the data is sorted.
o More robust than the mean when outliers are present.
• Mode:
o The value that appears most frequently in the dataset.
Measures of Dispersion
• Range:
o Difference between the maximum and minimum values.
o Indicates spread but is sensitive to extreme values.
• Variance:
o The average of the squared differences from the mean.
o Represents how spread out the data points are.
• Standard Deviation:
o The square root of the variance.
o Most commonly used to measure the amount of variation in data.
Measures of Shape
• Skewness:
o Describes the asymmetry of the data distribution.
o Positive skew = tail on right, Negative skew = tail on left.
• Kurtosis:
o Describes the “tailedness” or peak of the distribution.
o High kurtosis = heavy tails; Low kurtosis = light tails.
Frequency Distribution
• Uses tables, histograms, and bar charts to show how often values occur.
Five-number Summary
• Consists of: Minimum, Q1, Median, Q3, Maximum
• Used in box plots to visualize data spread and detect outliers.
Data Visualization in Descriptive Statistics
• Histograms: Show frequency distribution
• Box Plots: Show quartiles and outliers
• Bar Charts: Visualize categorical data
11 (a) (ii) Explain in detail about Data Transformation Techniques. (6 Marks)

Answer

Data transformation converts data into a proper format or structure to improve the performance
of machine learning models and enhance interpretation. Techniques used are

Normalization (Min-Max Scaling)


• Scales data into a range of 0 to 1.
• Formula: (x−min)/(max−min)(x - min) / (max - min)
• Useful when features have different scales.
Standardization (Z-score Scaling)
• Transforms data to have mean = 0 and standard deviation = 1.
• Formula: (x−μ)/σ(x - \mu) / \sigma
• Suitable for normally distributed data.
Logarithmic Transformation
• Reduces right skewness.
• Makes data more symmetric and helps linearize exponential trends.
Square Root and Cube Root Transformations
• Help reduce skewness in moderate-skewed data.
• Preserve zero and positive values, useful for count data.
Encoding Categorical Variables
• Label Encoding: Converts categories to integers (e.g., Male = 0, Female = 1)
• One-Hot Encoding: Creates binary columns for each category (used in ML models)
Binning (Discretization)
• Converts continuous variables into discrete categories or intervals.
• Example: Age groups (0–18, 19–35, etc.)
Handling Skewness
• Apply Box-Cox or Yeo-Johnson transformations for non-linear distributions.
Feature Scaling Tools in Python
• Scikit-learn:
o MinMaxScaler, StandardScaler, PowerTransformer used for transformations.
11 (b) (i) Explain in detail about Comparative Statistics in Exploratory Analysis. (6 Marks)

Answer:
Comparative Statistics involves comparing two or more groups or variables to identify
differences, relationships, or patterns. It is a crucial component of Exploratory Data Analysis (EDA)
and helps in understanding how different variables behave across categories.
Purpose of Comparative Statistics
• Understand variability across groups (e.g., comparing regions, genders, time periods).
• Support decision-making by highlighting significant differences.
• Detect outliers or unusual behavior across subgroups.
• Serve as a precursor to inferential statistics (like hypothesis testing).
Common Comparative Statistical Measures
• Group-wise Mean / Median / Mode:
Helps understand central tendency within each subgroup.
• Standard Deviation & Variance (per group):
Measures the spread of data across different categories.
• Range & Interquartile Range (IQR):
Useful to compare dispersion in different datasets.
• Proportions & Percentages:
Used when comparing categorical variables (e.g., % of male vs female customers).
Visualization Tools for Comparison
• Box Plots by Group:
Show distribution, outliers, and spread for each category.
• Bar Charts / Clustered Bar Graphs:
Useful for visual comparison of frequencies or averages.
• Violin Plots:
Combine box plot and kernel density to visualize distribution by category.
• Side-by-side Histograms:
Useful for comparing frequency distributions.
Tabular Comparison Techniques
• Cross-tabulations (Contingency Tables):
Summarize categorical data for two variables.
• Pivot Tables (in Excel / Python):
Allow quick aggregation and comparison of metrics by row and column groups.
Basic Statistical Tests for Comparison
• T-tests:
Used to compare means between two groups.
• ANOVA (Analysis of Variance):
Used when comparing more than two group means.
Example Use Case
Scenario: A retail company compares average monthly sales across three branches (North, South,
Central).
• Uses box plots and summary tables.
• Finds South branch has higher average sales but also more variability.
• Insights: Better performance in South but less consistency → Need targeted strategy.
11 (b) (ii) Discuss in detail about the practical use of Pivot Table in data science with suitable
example. (7 Marks).

Answer
A Pivot Table is an interactive tool used to summarize large datasets quickly. It allows users to
aggregate, group, and rearrange data dynamically to gain insights — widely used in Excel, Power BI,
and Python (Pandas).
Core Functions of Pivot Tables
• Summarize data using Sum, Count, Average, Max, Min
• Perform grouping (e.g., by date, category, region)
• Create multi-level views using rows and columns
• Support filters to drill down into subsets of data
• Enable quick insights without formulas or coding

Steps in Creating a Pivot Table (Generalized)


1. Select data range or DataFrame
2. Choose row labels (e.g., Product category)
3. Choose column labels (e.g., Region or Year)
4. Select values to summarize (e.g., Total Sales)
5. Apply filters if needed (e.g., specific product line)

Tools That Support Pivot Tables


• Excel: Built-in pivot table feature
• Power BI / Tableau: Drag-and-drop visual pivot tables
• Python (Pandas): df.pivot_table(index, columns, values, aggfunc)

Practical Use Cases in Data Science


• SalesAnalysis:
Compare monthly sales by product and region.
• CustomerSegmentation:
Count number of customers by age group and location.
• PerformanceMonitoring:
Track employee productivity across departments.
• Healthcare Analytics:
Summarize patient count by disease type and ward.

Example: Supermarket Sales Dataset


Product Region Sales
Soap East 100
Soap West 200
Shampoo East 150
Shampoo West 250
Pivot Table Summary (Sum of Sales):
Product East West
Soap 100 200
Shampoo 150 250
Insight: Shampoo performs better across both regions.

Advantages of Using Pivot Tables


• Quick aggregation of large datasets
• No programming required (especially in Excel)
• Interactive and customizable for different views
• Saves time during data cleaning and EDA.
12 (a)(i) Define line plot. With an example, explain how to create a line plot to visualize the
trend. (6 Marks)

Answer

Definition:
A line plot (or line graph) is a type of chart used to display information as a series of data points
connected by straight lines.
It is useful to visualize trends over time or sequential data.

Key Characteristics:
• X-axis: Represents the independent variable (e.g., time).
• Y-axis: Represents the dependent variable (e.g., temperature, sales).
• Data points: Mark actual measurements.
• Lines: Connect data points to show the trend.

Use Cases:
• Tracking sales over months
• Monitoring temperature variation by day
• Observing stock price movement over time
• Measuring sensor output across time intervals

Steps to Create a Line Plot:


1. Prepare the data
o Organize values into a sequence (X, Y pairs).
o Example:
Day Sales
1 200
2 240
3 300
2. Choose software/tool
o Excel, Python (Matplotlib), Google Sheets, R
3. Plot X and Y axes
o X-axis: Time (Day)
o Y-axis: Value (Sales)
4. Mark data points
o Each (x, y) pair is marked with a dot or point.
5. Connect the points
o Use lines to join the points in sequence.
6. Add labels & title
o Label axes with units, add a chart title, and legend if needed.
Example (Python using Matplotlib):
import matplotlib.pyplot as plt

days = [1, 2, 3, 4, 5]
sales = [200, 220, 250, 300, 320]

plt.plot(days, sales, marker='o')


plt.title("Daily Sales Trend")
plt.xlabel("Day")
plt.ylabel("Sales")
plt.grid(True)
plt.show()
Output
12 (a)(ii) The following table gives the lifetime of 400 neon lamps. Draw the histogram for the
below data. (7 Marks)

Given Frequency Table:


Lifetime (in hours) Number of Lamps
300–400 14
400–500 56
500–600 60
600–700 86
700–800 74
800–900 62
900–1000 48

Step-by-Step Procedure to Draw Histogram:


1. Identify class intervals
o Equal width = 100 (from 300–1000)
o These become the bins for the histogram.
2. Mark X-axis and Y-axis
o X-axis: Lifetime intervals
o Y-axis: Frequency (number of lamps)
3. Choose a suitable scale
o Y-axis: Use scale like 10 units = 1 cm, up to maximum value 86
4. Draw bars
o Each class interval becomes a bar.
o Height = frequency value (no gap between bars).
5. Label graph
o Add title: “Histogram of Neon Lamp Lifetimes”
o Label axes: Lifetime (hrs) and Number of Lamps

Table for Histogram Plotting:


Interval Frequency Midpoint (for optional line chart)
300–400 14 350
400–500 56 450
500–600 60 550
600–700 86 650
700–800 74 750
800–900 62 850
900–1000 48 950
Sketching Guidelines:
• Use graph sheet if allowed
• No gaps between bars
• Equal bar widths (class width = 100)
• Bar heights based on frequencies:
o Tallest bar: 86 (600–700 hrs class)
o Shortest bar: 14 (300–400 hrs class)

Insights from Histogram:


• The most frequent lamp lifetime is in 600–700 hrs range.
• Distribution is slightly skewed to the left, with higher frequencies around the center.
• Useful for analyzing product lifespan consistency.

Histogram program-Python
import sys
import matplotlib

import matplotlib.pyplot as plt

bins = [300, 400, 500, 600, 700, 800, 900, 1000]


frequencies = [14, 56, 60, 86, 74, 62, 48]

plt.hist(bins[:-1], bins=bins, weights=frequencies, edgecolor='black', align='left', rwidth=0.9)


plt.title("Histogram of Neon Lamp Lifetimes")
plt.xlabel("Lifetime (in hours)")
plt.ylabel("Number of Lamps")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

#Two lines to make our compiler able to draw:


plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
Output
12 (b)(i) Explain in detail about 3D Data Visualization, its components and its working flow
with suitable example. (6 Marks)

Answer:

What is 3D Data Visualization?


3D data visualization represents data in three dimensions (X, Y, Z), providing deeper insights into
multi-variable relationships, patterns, and structures that are hard to visualize in 2D.

Components of 3D Visualization
1. Axes (X, Y, Z)
o Define the dimensions of the plot.
o Each axis represents a variable or measurement.
2. 3D Plotting Engine
o Software or libraries that support rendering in 3D (e.g., Matplotlib 3D, Plotly, VTK).
3. Camera / Perspective
o Allows rotating, zooming, and panning the view.
4. Color, Shape, and Size Encodings
o Used to represent additional variables (e.g., intensity, category).
5. Interactivity Tools
o Tools like sliders, tooltips, or selection for real-time data interaction (in dashboards).
o
Tools Supporting 3D Visualization
• Matplotlib (Python) – Axes3D for basic 3D scatter, surface, and wireframe plots.
• Plotly (Python/JS) – Interactive, web-based 3D graphs.
• Tableau / Power BI – Supports limited 3D in dashboards.
• Unity3D / WebGL – Advanced 3D modeling and immersive visualization.

Working Flow of 3D Visualization


Step 1: Import and clean data
Step 2: Select three variables to plot (X, Y, Z)
Step 3: Choose 3D plot type (scatter, surface, contour, etc.)
Step 4: Apply encoding (color/size) for 4th or 5th variable if needed
Step 5: Render and rotate the plot to observe patterns
Step 6: Add titles, labels, and interaction controls (if supported)
Example (Using Python Matplotlib)
import sys
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

x = [1, 2, 3, 4]
y = [10, 15, 20, 25]
z = [5, 6, 2, 3]

ax.scatter(x, y, z)
ax.set_xlabel('X Axis')
ax.set_ylabel('Y Axis')
ax.set_zlabel('Z Axis')
plt.title("3D Scatter Plot")
plt.show()

plt.savefig(sys.stdout.buffer)
sys.stdout.flush()

Output

Applications of 3D Visualization
• Climate or weather modeling (Temp vs. Lat vs. Altitude)
• Geospatial analysis (Longitude, Latitude, Elevation)
• Scientific simulations (Molecular structures, Fluid dynamics)
• Financial markets (Time vs. Price vs. Volume)
12 (b)(ii) Discuss in detail about text and annotation. (7 Marks)

Answer:

Purpose of Text and Annotation in Visualization:


Text and annotation help clarify, highlight, and explain key parts of a plot or chart, improving
readability and storytelling.
Types of Text in Data Visualization
1. Title
o Explains what the chart is about.
2. Axis Labels
o Identify units and variables on X, Y, and Z axes.
3. Legends
o Describe symbols, colors, or line types used in the chart.
4. Tick Labels
o Values shown along axes.
Annotations
Annotations are custom text notes or arrows placed near specific data points to emphasize:
• Outliers
• Peaks / Valleys
• Specific events
• Trends or anomalies
Annotation Components
1. Text: Descriptive message (e.g., “Max Value Here”)
2. Arrow/Box: Optional pointer to exact location
3. Coordinates: Position (x, y) where annotation is placed
4. Style: Font, color, size, angle
Adding Annotation in Python (Matplotlib)
import sys
import matplotlib.pyplot as plt
x = [1, 2, 3, 4]
y = [100, 150, 130, 170]
plt.plot(x, y, marker='o')
plt.title("Sales Trend")
plt.xlabel("Quarter")
plt.ylabel("Sales")

plt.annotate('Peak Sales', xy=(4, 170), xytext=(3, 180), arrowprops=dict( facecolor='red', shrink=0.05))

plt.grid(True)
plt.show()
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
Output

Best Practices for Text & Annotation


• Keep labels short and meaningful
• Avoid overlapping text
• Use contrasting color for visibility
• Place annotations close to the point of interest
• Use consistent font and size

Applications
• Marking outliers in scatter plots
• Highlighting max/min points in line graphs
• Explaining sudden spikes/dips in trend
• Adding comments or references in dashboards
13. (a)(i) Does universe frequency distribution have variable? Justify in detail. (7 Marks)

Answer

Definition of Universe Frequency Distribution:


• A universe refers to the entire population or complete set of data points under observation.
• A frequency distribution shows how often each value or range of values occurs.
Presence of Variable:
Yes, universe frequency distributions always involve one or more variables.
Justification with Explanation:
• A variable is a measurable attribute (e.g., marks, age, income).
• The frequency distribution groups the values of this variable and counts how often they occur.
• Hence, without a variable, frequency distribution has no basis to organize data.
Example:
Suppose the variable is Age of students in a class:
Age Group Frequency
18–20 15
21–23 20
24–26 10
Here:
• Age is the variable.
• The distribution is constructed by counting how many students fall into each age range.
Conclusion:
The frequency distribution cannot exist without a variable because it’s the basis for classification and
counting. So, a universe frequency distribution always includes variables.
13. (a)(ii) Explain in detail about scaling and standardizing. (6 Marks)

Answer

Why Transformation is Needed:


• Features with different units/scales can negatively impact model performance.
• Scaling and standardization bring uniformity to the data.

1. Scaling (Normalization):
• Rescales data between a fixed range, typically [0, 1].
• Useful for distance-based algorithms like KNN, SVM.

2. Standardization (Z-score Normalization):


• Converts data to have mean = 0 and standard deviation = 1.
• Useful when data is normally distributed or when comparing features with different units.

Summary Table:
Transformation Output Range Use When
Scaling [0, 1] Features with different scales
Standardizing ~N(0, 1) For normally distributed data
13. (b) Time Series Modeling: Why Time Series Model is Better (13 Marks)

Answer
Problem Context:
Consider the Training of 2 models for same data set with two different techniques.
You trained:
• Model 1: Decision Tree
• Model 2: Time Series Regression Model
Conclusion : Model 2 gave better performance.

Understand Decision Tree Limitations


• Treats each row independently, without order.
• Can't handle autocorrelation or sequential patterns.
• Prone to overfitting in noisy time data.
Advantages of Time Series Regression
• Respects time order of data.
• Uses:
o Trend component
o Seasonal component
o Lag features and moving averages
• Better suited for forecasting.
Example
Assume dataset = daily website visits
• Decision Tree might split based on thresholds (e.g., day = weekend)
• Time Series model detects that weekends always have 30% more visits

Reasons Why Time Series Model Performed Better:


Factor Time Series Model Decision Tree
Time Awareness Yes No
Trend Capture Yes No
Seasonality Yes No
Suitable for Forecasting Yes No

Conclusion:
Model 2 (Time Series Regression) respects temporal structure and provides more accurate,
generalized forecasts, making it more suitable for time-based data than a decision tree.
14. (a)(i) Contingency Table with Example (7 Marks)

Answer
Definition:
• A contingency table (also known as a cross-tabulation or cross table) is a matrix-style
table used to display the frequency distribution of two or more categorical variables.
• It helps analyse the relationship or association between those variables.
• A contingency table is a powerful tool for organizing categorical data and determining
whether relationships exist between variables.
• It forms the foundation for statistical tests of independence.
Structure:
A basic 2×2 contingency table looks like this:
Category A Category B Total
Group 1 a b a+b
Group 2 c d c+d
Total a+c b+d N
• Rows represent one variable.
• Columns represent the other variable.
• Each cell shows the frequency (count) of occurrences.
Use Case:
Suppose we want to study whether gender affects purchase behaviour.
Gender Purchased Not Purchased Total
Male 30 20 50
Female 50 10 60
Total 80 30 110
• Here, the two variables are:
o Gender (Male, Female)
o Purchase Decision (Purchased, Not Purchased)
Interpretation:
• Out of 50 males, 30 purchased the product.
• Out of 60 females, 50 purchased the product.
• This suggests females are more likely to purchase in this dataset.
Applications of Contingency Tables:
• Used in Chi-square tests to test independence between variables.
• Helps in understanding categorical relationships.
• Common in market research, social sciences, healthcare analytics, etc.
Advantages:
• Easy to construct and interpret.
• Provides quick summary of data interaction.
• Facilitates statistical hypothesis testing (like Chi-square test).
14. (a)(ii) Percentage Table with Example (6 Marks)
Definition:
• A percentage table is a statistical table that shows the relative frequencies of categories
expressed as percentages rather than raw counts.
• It helps in comparing different groups or categories, especially when totals differ or when
visual clarity is needed.
• A percentage table is a vital tool to simplify raw numerical data into relative comparisons,
making the insights easier to interpret, communicate, and visualize.

Purpose:
• To convert raw frequencies into relative measures.
• To highlight proportions instead of absolute counts.
• To support decision-making, especially in marketing, business analysis, and surveys.

Conversion Formula:

Example:
Let’s say a shop sold 100 items consisting of 3 different products.
Product Units Sold Percentage (%)
Soap 20 (20 / 100) × 100 = 20%
Paste 30 (30 / 100) × 100 = 30%
Shampoo 50 (50 / 100) × 100 = 50%
Total 100 100%

Interpretation:
• 50% of all sales were Shampoo.
• Soap and Paste contributed 20% and 30% respectively.
• This helps the shop identify which product is in higher demand.

Benefits of Percentage Tables:


• Makes it easier to compare across categories.
• Supports graphical representation like pie charts and stacked bar charts.
• Useful in summarizing survey data or market share.

Application Areas:
• Business analytics: Market share comparison
• Education: Exam result summaries by subject
• Healthcare: Disease cases by percentage of population
• Finance: Portfolio asset allocation
14. (b) Scatter Plot, Correlation & Analysis (13 Marks)
Given Data:
No. of games 3 5 2 6 7 1 2 7 1 7
Scores 80 90 75 80 90 50 65 85 40 100
Step-by-Step Instructions:
Step 1: Plot Scatter Plot
• X-axis: No. of games
• Y-axis: Scores
• Plot all 10 points
Step 2: Identify Correlation Pattern
• As no. of games increases, scores also increase → Positive correlation.
Justification:
• Scores improve with more games played
• e.g., 1 game → 40, 2 games → 65, 7 games → 100
• Clear upward trend → Strong Positive Correlation
Scatter Plot correlation and types.
1. Definition of Scatter Plot
A scatter plot (also called a scatter diagram or scatter graph) is a type of plot used to visualize
the relationship between two numerical variables.
• Each point represents one observation.
• X-axis – Independent variable (e.g., time, number of games).
• Y-axis – Dependent variable (e.g., score, height).
2. Purpose of a Scatter Plot
• To identify patterns, trends, or relationships between variables.
• To detect correlation (positive, negative, or none).
• To identify outliers or clusters in data.
3. Sample Dataset Example
Let’s say we are studying the relationship between the number of games played and the scores
obtained:
No. of Games 3 5 2 6 7 1 2 7 1 7
Scores 80 90 75 80 90 50 65 85 40 100
• Plot X = Games, Y = Scores
4. Interpretation of the Scatter Plot
• As the number of games increases, the scores also tend to increase.
• This indicates a positive relationship between the two variables.
5. Types of Correlation in Scatter Plots

Correlation Type Description Visual Pattern


Positive Both variables increase together Upward slope
Negative One increases, the other decreases Downward slope
No correlation No pattern between variables Random scatter
Examples:
1. Positive Correlation:
o Hours Studied ↑ → Marks ↑
o Scatter plot rises left to right
2. Negative Correlation:
o Temperature ↑ → Ice Cream Sales ↓ (in winter months)
o Scatter plot falls left to right
3. No Correlation:
o Height vs Favorite Color
o Points scattered randomly.
7. Scatter Plot Code (Python with Matplotlib)

import matplotlib.pyplot as plt


import numpy as np
games = np.array([3, 5, 2, 6, 7, 1, 2, 7, 1, 7])
scores = np.array([80, 90, 75, 80, 90, 50, 65, 85, 40, 100])
# Best-fit line
slope, intercept = np.polyfit(games, scores, 1)
trend = slope * games + intercept
plt.scatter(games, scores, color='blue', label='Scores')
plt.plot(games, trend, '--', color='grey', label='Best-fit line')
plt.title("Scatter Plot: Games Played vs Scores")
plt.xlabel("Number of Games Played")
plt.ylabel("Scores")
plt.grid(True)
plt.legend()
plt.show()

8. Use Cases of Scatter Plots


• Education: Study time vs marks
• Health: Calorie intake vs weight
• Business: Ad spending vs sales
• Sports: Practice sessions vs performance
15 (a)(i) Explain the main components of time series data. Which of these would be most
prevalent in data relating to unemployment? (6 Marks)

Answer
What is Time Series Data?
Time series data is a collection of data points measured sequentially over time at equal
intervals such as daily, monthly, quarterly, or yearly.
Examples:
• Daily stock prices
• Monthly unemployment rates
• Annual temperature data
Main Components of Time Series:
Time series data typically consists of four key components:

1. Trend Component (T):


• Refers to the long-term upward or downward movement in the data.
• Represents the general direction of the data over time.
• Can be linear or non-linear.
Example:
If unemployment steadily rises from 5% to 8% over 5 years, this indicates a positive trend.

2. Seasonal Component (S):


• Short-term recurring patterns that occur at fixed time intervals (e.g., daily, weekly,
monthly).
• Usually driven by calendar-based factors like holidays, seasons, or financial quarters.
Example:
Ice cream sales rise in summer and drop in winter – this is seasonality.

3. Cyclical Component (C):


• Non-fixed, long-term fluctuations that happen due to economic or business cycles.
• Unlike seasonality, cycles do not follow a fixed calendar interval.
• Duration can range from several months to years.
Example:
Unemployment increases during a recession and falls during a boom.
4. Irregular/Residual Component (I):
• Random, unpredictable variations caused by unexpected events such as strikes,
pandemics, or natural disasters.
• Also called noise in the data.

Most Prevalent in Unemployment Data:


In time series data related to unemployment, the most prevalent components are:
Trend:
• Unemployment may increase or decrease over the years due to economic growth,
automation, policy changes, etc.
Cyclic:
• Unemployment is heavily influenced by economic cycles (recessions, booms, financial
crises).
• These cycles are not fixed and can last for years, distinguishing them from seasonality.
Final Answer:
The cyclical component is most prevalent in unemployment data, followed by the trend.
15 (a)(ii) Suppose... views increase in Jan–Mar and decrease in Nov–Dec. Does this
represent seasonality? Justify. (7 Marks)

Given Statement:
• You are a data scientist at Times of India.
• You observed:
o Views increase from January to March
o Views decrease in November and December
o This pattern repeats every year

What is Seasonality?
Seasonality is the component of time series data that shows repetitive patterns or fluctuations
at regular intervals (e.g., daily, monthly, quarterly).

Key characteristics of seasonality:


• Occurs at fixed time intervals
• Is predictable and cyclic
• Related to calendar events or behavioral trends

Analysis of the Situation:


Month Observed Behavior
Jan–Mar Views Increase
Apr–Oct Moderate Views
Nov–Dec Views Decrease
Does it Represent Seasonality?
Yes, the behaviour indicates seasonality.
Justification:
• The changes in viewership occur regularly every year, making them periodic.
• The fluctuation is linked to specific months (calendar effect).
• Likely caused by:
o New Year resolutions
o Exam preparations
o End-of-year holidays
• This shows a stable, predictable pattern tied to the time of year.
Seasonality vs Other Components:
Component Present? Why?
Seasonality Yes Repeats yearly at fixed intervals
Trend No No long-term increase or decrease indicated
Cyclic No Not tied to economic/business cycles
Irregular No Pattern is not random or one-time
15 b) Suppose the following data represent total revenues.
(in millions of constant 1995 dollars) by a car rental agency over the 11 year period 1990 to
2000;
4.0, 5.0, 7.0, 6.0, 8.0, 9.0, 5.0, 2.0, 3.5, 5.5, 6.5
Compute the 5-year moving averages for this annual time series.

Answer
To solve this problem, we need to compute the 5-year moving averages for the given 11-year
time series data from 1990 to 2000.

Given Data (Revenues in Millions of Constant 1995 Dollars)


Year Revenue
1990 4.0
1991 5.0
1992 7.0
1993 6.0
1994 8.0
1995 9.0
1996 5.0
1997 2.0
1998 3.5
1999 5.5
2000 6.5

5-Year Moving Average Formula:


• A 5-year moving average is calculated by taking the average of 5 consecutive years.
• The centered average is usually placed at the middle year of the 5-year span (i.e., the
3rd year in the 5-year window).

Calculation of 5-Year Moving Averages:


Years Covered Sum Average (Sum ÷ 5) Centered Year
1990–1994 4.0+5.0+7.0+6.0+8.0 = 30.0 30.0 ÷ 5 = 6.0 1992
1991–1995 5.0+7.0+6.0+8.0+9.0 = 35.0 35.0 ÷ 5 = 7.0 1993
1992–1996 7.0+6.0+8.0+9.0+5.0 = 35.0 35.0 ÷ 5 = 7.0 1994
1993–1997 6.0+8.0+9.0+5.0+2.0 = 30.0 30.0 ÷ 5 = 6.0 1995
1994–1998 8.0+9.0+5.0+2.0+3.5 = 27.5 27.5 ÷ 5 = 5.5 1996
1995–1999 9.0+5.0+2.0+3.5+5.5 = 25.0 25.0 ÷ 5 = 5.0 1997
1996–2000 5.0+2.0+3.5+5.5+6.5 = 22.5 22.5 ÷ 5 = 4.5 1998
Final Answer Table:
Year 5-Year Moving Average
1992 6.0
1993 7.0
1994 7.0
1995 6.0
1996 5.5
1997 5.0
1998 4.5

Python code:
# Year and revenue data
years = list(range(1990, 2001))
revenues = [4.0, 5.0, 7.0, 6.0, 8.0, 9.0, 5.0, 2.0, 3.5, 5.5, 6.5]

# Compute 5-year moving averages


moving_averages = []
centered_years = []

for i in range(len(revenues) - 4):


five_year_avg = sum(revenues[i:i+5]) / 5
moving_averages.append(five_year_avg)
centered_years.append(years[i + 2]) # center of 5 years

# Display results
print("Year\t5-Year Moving Average")
for year, avg in zip(centered_years, moving_averages):
print(f"{year}\t{avg:.1f}")

Output
Year 5-Year Moving Average
1992 6.0
1993 7.0
1994 7.0
1995 6.0
1996 5.5
1997 5.0
1998 4.5
16(a) Time Series Forecasting Methods (13 marks)
Given Actual Data:
Period 1 2 3 4 5 6 7 8 9 10
Actual 974 766 727 849 693 655 854 742 717 852

1. Naïve Forecast
Forecast for period t = actual value at (t-1)
Start from Period 2:
Period Forecast
2 974
3 766
4 727
5 849
6 693
7 655
8 854
9 742
10 717

2. 3-Period Moving Average

Start from Period 4:


Period Calculation Forecast
4 (974 + 766 + 727)/3 = 822.3 822
5 (766 + 727 + 849)/3 = 780.7 781
6 (727 + 849 + 693)/3 = 756.3 756
7 (849 + 693 + 655)/3 = 732.3 732
8 (693 + 655 + 854)/3 = 734.0 734
9 (655 + 854 + 742)/3 = 750.3 750
10 (854 + 742 + 717)/3 = 771.0 771

3. 4-Period Moving Average


Start from Period 5
Period Calculation Forecast
5 (974+766+727+849)/4 = 829.0 829
6 (766+727+849+693)/4 = 758.8 759
7 (727+849+693+655)/4 = 731.0 731
8 (849+693+655+854)/4 = 762.8 763
9 (693+655+854+742)/4 = 736.0 736
10 (655+854+742+717)/4 = 742.0 742
4. Weighted Moving Average (3-2-1)

Weights: 3 (most recent), 2, 1


Start from Period 4:
Period Forecast Calculation Forecast
4 (3×727 + 2×766 + 1×974)/6 = 781.0 781
5 (3×849 + 2×727 + 1×766)/6 = 794.3 794
6 (3×693 + 2×849 + 1×727)/6 = 750.7 751
7 (3×655 + 2×693 + 1×849)/6 = 700.0 700
8 (3×854 + 2×655 + 1×693)/6 = 761.0 761
9 (3×742 + 2×854 + 1×655)/6 = 764.8 765
10 (3×717 + 2×742 + 1×854)/6 = 735.0 735

5. Weighted Moving Average (1-4-5)


Weights: 1 (oldest), 4, 5 (most recent)
Start from Period 4:

Period Calculation Forecast


4 (1×974 + 4×766 + 5×727)/10 = 764.5 765
5 (1×766 + 4×727 + 5×849)/10 = 784.6 785
6 (1×727 + 4×849 + 5×693)/10 = 752.4 752
7 (1×849 + 4×693 + 5×655)/10 = 678.6 679
8 (1×693 + 4×655 + 5×854)/10 = 748.1 748
9 (1×655 + 4×854 + 5×742)/10 = 770.3 770
10 (1×854 + 4×742 + 5×717)/10 = 729.5 730

6. Exponential Smoothing α = 0.1

Assume initial forecast F1 = A1 = 974


Period Forecast (α = 0.1) Rounded
2 974 974
3 974 + 0.1(766 - 974) = 951.6 952
4 951.6 + 0.1(727 - 951.6) = 929.14 929
5 929.14 + 0.1(849 - 929.14) = 920.23 920
6 920.23 + 0.1(693 - 920.23) = 897.51 898
7 897.51 + 0.1(655 - 897.51) = 873.26 873
8 873.26 + 0.1(854 - 873.26) = 871.33 871
9 871.33 + 0.1(742 - 871.33) = 858.5 859
10 858.5 + 0.1(717 - 858.5) = 844.85 845
7. Exponential Smoothing α = 0.8
Same formula as above:
Start from F1 = 974
Period Forecast (α = 0.8) Rounded
2 974 974
3 974 + 0.8(766 - 974) = 807.2 807
4 807.2 + 0.8(727 - 807.2) = 743.36 743
5 743.36 + 0.8(849 - 743.36) = 826.49 826
6 826.49 + 0.8(693 - 826.49) = 715.98 716
7 715.98 + 0.8(655 - 715.98) = 663.2 663
8 663.2 + 0.8(854 - 663.2) = 817.56 818
9 817.56 + 0.8(742 - 817.56) = 757.85 758
10 757.85 + 0.8(717 - 757.85) = 725.57 726

16(b) Average Seasonal Movement

Step 1: Given Quarterly Production Data


Year Q1 Q2 Q3 Q4
2002 3.5 3.8 3.7 3.5
2003 3.6 4.2 3.4 4.1
2004 3.4 3.9 3.7 4.2
2005 4.2 4.5 3.8 4.4
2006 3.9 4.4 4.2 4.6

Step 2: Compute Annual Averages


We calculate the average production for each year.
Year Total Annual Average
2002 3.5+3.8+3.7+3.5 = 14.5 14.5 / 4 = 3.625
2003 3.6+4.2+3.4+4.1 = 15.3 15.3 / 4 = 3.825
2004 3.4+3.9+3.7+4.2 = 15.2 15.2 / 4 = 3.800
2005 4.2+4.5+3.8+4.4 = 16.9 16.9 / 4 = 4.225
2006 3.9+4.4+4.2+4.6 = 17.1 17.1 / 4 = 4.275

Step 3: Compute Seasonal Indices


We compute the seasonal movement by comparing each quarter to its year’s average:
Seasonal Index = Quarter Value ÷ Annual Average
Year Q1 SI Q2 SI Q3 SI Q4 SI
2002 3.5/3.625 = 0.9655 3.8/3.625 = 1.0483 3.7/3.625 = 1.0207 3.5/3.625 = 0.9655
2003 3.6/3.825 = 0.9412 4.2/3.825 = 1.0974 3.4/3.825 = 0.8894 4.1/3.825 = 1.0714
2004 3.4/3.8 = 0.8947 3.9/3.8 = 1.0263 3.7/3.8 = 0.9737 4.2/3.8 = 1.1053
2005 4.2/4.225 = 0.9941 4.5/4.225 = 1.0655 3.8/4.225 = 0.8994 4.4/4.225 = 1.0414
2006 3.9/4.275 = 0.9123 4.4/4.275 = 1.0292 4.2/4.275 = 0.9824 4.6/4.275 = 1.0760
Step 4: Compute Average Seasonal Index for Each Quarter
Quarter Avg Seasonal Index
Q1 (0.9655 + 0.9412 + 0.8947 + 0.9941 + 0.9123) / 5 = 0.9416
Q2 (1.0483 + 1.0974 + 1.0263 + 1.0655 + 1.0292) / 5 = 1.0534
Q3 (1.0207 + 0.8894 + 0.9737 + 0.8994 + 0.9824) / 5 = 0.9531
Q4 (0.9655 + 1.0714 + 1.1053 + 1.0414 + 1.0760) / 5 = 1.0519

Step 5: Convert Seasonal Indices to Movements


Multiply each index by 100 to get seasonal movements (%):
Quarter Average Seasonal Movement
Q1 0.9416 × 100 = 94.16
Q2 1.0534 × 100 = 105.34
Q3 0.9531 × 100 = 95.31
Q4 1.0519 × 100 = 105.19

Final Answer: Average Seasonal Movements


Quarter Seasonal Movement (%)
Q1 94.16
Q2 105.34
Q3 95.31
Q4 105.19

Justification:
• Q2 and Q4 have above-average production → Seasonal peak periods.
• Q1 and Q3 have below-average production → Off-peak or slower quarters.
• These patterns help in seasonal adjustment and forecasting future trends in quarterly production.

You might also like