100% found this document useful (1 vote)
4K views98 pages

CS3352 FDS QP Solved (Anna University)

This document contains solved question papers for the CS 3352 - Foundations of Data Science course at Thiruvalluvar College of Engineering and Technology, prepared by Kalaivani V. It includes questions on data science concepts, data types, data processing challenges, and the data science process, along with examples and explanations. The document is structured into two parts: Part A consists of short answer questions, while Part B explores data processing challenges and the data science process in detail.

Uploaded by

chitra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
4K views98 pages

CS3352 FDS QP Solved (Anna University)

This document contains solved question papers for the CS 3352 - Foundations of Data Science course at Thiruvalluvar College of Engineering and Technology, prepared by Kalaivani V. It includes questions on data science concepts, data types, data processing challenges, and the data science process, along with examples and explanations. The document is structured into two parts: Part A consists of short answer questions, while Part B explores data processing challenges and the data science process in detail.

Uploaded by

chitra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd

THIRUVALLUVAR COLLEGE OF ENGINEERING AND

TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING

ANNA UNIVERSITY SOLVED QUESTION


PAPERS

CS 3352-FOUNDATIONS OF DATA SCIENCE


III SEMESTER

PREPARED BY

KALAIVANI.V

AP/CSE
B.E/B.Tech. DEGREE EXAMINATIONS, NOV/DEC-2022

CS 3352 – FOUNDATIONS OF DATA SCIENCE

Answer ALL questions.


PART A-(10 x 2 = 20 marks)

1. Define Data Science and Big Data.


Data Science:
An interdisciplinary field that uses scientific methods, processes, algorithms, and
systems to extract knowledge and insights from structured and unstructured data. It combines
elements from statistics, computer science, and domain expertise to analyze and interpret
complex datasets.
Big Data:
Extremely large datasets that may be analyzed computationally to reveal patterns,
trends, and associations, especially relating to human behavior and interactions. It is
characterized by the "three Vs": Volume (large amount of data), Velocity (high speed of data
generation and processing), and Variety (diverse types of data).

2 List an overview of common errors in retrieving data and which cleansing solutions to
be employed.

Error Cleansing Solution


Missing values Impute using mean/median/mode or drop rows
Duplicate entries Use .drop_duplicates()
Inconsistent formats (e.g. date) Convert to uniform format (e.g. pd.to_datetime())
Incorrect data types Use .astype() to cast to correct type
Outliers Treat with Z-score or IQR filtering
Mislabeling Correct using mapping/dictionaries

3. Classify the below list of data into their types: (a) ethnic group (b) age (c) family size
(d) academic major (e) sexual preference (f) IQ score (g) net worth (dollars) (h) third-
place finish (i) gender (j) temperature and write a brief note on them.

Item Type Note


(a) Ethnic group Categorical (Nominal) No inherent order
(b) Age Quantitative (Continuous) Measured in years, can take any value
(c) Family size Quantitative (Discrete) Countable whole number
(d) Academic major Categorical (Nominal) No order among majors
(e) Sexual preference Categorical (Nominal) Labels without numeric value
(f) IQ score Quantitative (Continuous) Can be decimal, e.g. 101.5
(g) Net worth ($) Quantitative (Continuous) Can be in decimal (money)
(h) Third-place finish Ordinal (Categorical) Implies rank/order
(i) Gender Categorical (Nominal) Male, Female, etc.
(j) Temperature Quantitative (Continuous) Can have decimal, measured
4. Differentiate discrete and continuous variables.

Discrete Variable Continuous Variable


Takes countable values Takes any value within a range
Example: Number of students Example: Height, Weight, Temperature
Usually integers Can include decimals
Often finite values Infinite possible values

5. What is a percentile rank? Give an example.


Percentile Rank:
The percentage of scores in a distribution that a particular score is greater than or
equal to. It indicates the relative standing of an individual within a group.
Example:
If a student scores in the 90th percentile on a test, it means they scored as well as or
better than 90% of the students who took the same test.

6. Consider Helen sent 10 greeting cards to her friends and she received back 8 cards,
what is the kind of relationship it is? Brief on it.

This describes a reciprocal relationship or mutual relationship. In this context, it


signifies that the act of sending cards was largely reciprocated by the friends, indicating a
strong and active connection within their social network. The high return rate suggests mutual
engagement and appreciation.

7. List the attributes of a Numpy array. Give an example for it.


Common attributes of a NumPy array:
.shape: Tuple representing the dimensions of the array (rows, columns, etc.).
.ndim: Number of dimensions of the array.
.size: Total number of elements in the array.
.dtype: Data type of the elements in the array.
.itemsize: Size in bytes of each element of the array.
Example:
Python
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(f"Shape: {arr.shape}") # Output: Shape: (2, 3)
print(f"Dimensions: {arr.ndim}") # Output: Dimensions: 2
print(f"Size: {arr.size}") # Output: Size: 6
print(f"Data type: {arr.dtype}") # Output: Data type: int64

8. Create a data frame with key and data pairs as Key-Data pair as A-10, B-20, A-40, C-
5, B-10, C-10. Find the sum of each key and display the result as each key group.

import pandas as pd
data = {'Key': ['A', 'B', 'A', 'C', 'B', 'C'], 'Data': [10, 20, 40, 5, 10, 10]}
df = pd.DataFrame(data)
# Group by 'Key' and sum the 'Data'
result = df.groupby('Key')['Data'].sum()
print("Sum of each key group:")
print(result)

9. What is the purpose of errorbar function in Matplotlib? Give an example.


Purpose:
The errorbar() function in Matplotlib is used to create plots with error bars, which
visually represent the uncertainty or variability in data points. This is crucial for
showing the reliability of measurements or calculations.
Example:
Python
import matplotlib.pyplot as plt
import numpy as np
x = np.array([1, 2, 3, 4])
y = np.array([2, 4, 1, 5])
y_error = np.array([0.5, 0.8, 0.3, 0.7]) # Error in y values
plt.errorbar(x, y, yerr=y_error, fmt='o', capsize=5, label='Data with Error Bars')
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Plot with Error Bars")
plt.legend()
plt.grid(True)
plt.show()

10. Showcase 3-dimensional drawing in Matplotlib with corresponding Python Code.


Python

import matplotlib.pyplot as plt


import numpy as np
from mpl_toolkits.mplot3d import Axes3D
# Create data for a 3D plot (e.g., a helix)
t = np.linspace(-4 * np.pi, 4 * np.pi, 100)
z = np.linspace(-2, 2, 100)
r = z**2 + 1
x = r * np.sin(t)
y = r * np.cos(t)
# Create a 3D figure and axes
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
# Plot the 3D line
ax.plot(x, y, z, label='3D Helix')
# Set labels for axes
ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
ax.set_zlabel('Z-axis')
ax.set_title('3D Drawing in Matplotlib (Helix)')
ax.legend()
plt.show()
PART B (5 × 13 = 65 marks)

11. (a) Examine the different facets of data with the challenges in their processing.
Structured Data

Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing structured
data. The term structured data refers to data that is identifiable because it is organized in a
structure. The most common form of structured data or records is a database where specific
information is stored based on a methodology of columns and rows.
An Excel table is an example of structured data.

Unstructured Data
Unstructured data is data that does not follow a specified format. Row and columns
are not used for unstructured data. Therefore it is difficult to retrieve required information.
Unstructured data has no identifiable structure. The unstructured data can be in the form of
Text: (Documents, email messages, customer feedbacks), audio, video, images. Email is an
example of unstructured data.
Even today in most of the organizations more than 80 % of the data are in
unstructured form. This carries lots of information. But extracting information from these
various sources is a very big challenge.
Natural Language

Natural language is a special type of unstructured data. Natural language processing


enables machines to recognize characters, words and sentences, then apply meaning and
understanding to that information. This helps machines to understand language as humans do.
Natural language processing is the driving force behind machine intelligence in many modern
real-world applications. The natural language processing community has had success in entity
recognition, topic recognition, summarization, text completion and sentiment analysis.
For natural language processing to help machines understand human language, it must
go through speech recognition, natural language understanding and machine translation. It is
an iterative process comprised of several layers of text analysis.
Machine - Generated Data
Machine-generated data is an information that is created without human interaction as
a result of a computer process or application activity. This means that data entered manually
by an end-user is not recognized to be machine-generated. Machine data contains a definitive
record of all activity and behavior of our customers, users, transactions, applications, servers,
networks, factory machinery and so on.
Examples of machine data are web server logs, call detail records, network event logs
and telemetry. Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions
generate machine data. Machine data is generated continuously by every processor-based
system, as well as many consumer-oriented systems. It can be either structured or
unstructured. In recent years, the increase of machine data has surged. The expansion of
mobile devices, virtual servers and desktops, as well as cloud- based services and RFID
technologies, is making IT infrastructures more complex.
Graph-based or Network Data
Graphs are data structures to describe relationships and interactions between entities
in complex systems. In general, a graph contains a collection of entities called nodes and
another collection of interactions between a pair of nodes called edges. Nodes represent
entities, which can be of any object type that is relevant to our problem domain. By
connecting nodes with edges, we will end up with a graph (network) of nodes.
Graph databases are capable of sophisticated fraud prevention. With graph
databases, we can use relationships to process financial and purchase transactions in near-real
time. With fast graph queries, we are able to detect that, for example, a potential purchaser is
using the same email address and credit card as included in a known fraud case. Graph
databases can also help user easily detect relationship patterns such as multiple people
associated with a personal email address or multiple people sharing the same IP address but
residing in different physical addresses.

Audio, Image and Video


Audio, image and video are data types that pose specific challenges to a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be
challenging for computers. The terms audio and video commonly refers to the time-based
media storage format for sound/music and moving pictures information. Audio and video
digital recording, also referred as audio and video codecs, can be uncompressed, lossless
compressed or lossy compressed depending on the desired quality and use cases.
It is important to remark that multimedia data is one of the most important sources of
information and knowledge; the integration, transformation and indexing of multimedia data
bring significant challenges in data management and analysis. Many challenges have to be
addressed including big data, multidisciplinary nature of Data Science and heterogeneity.
Streaming Data
Streaming data is data that is generated continuously by thousands of data sources,
which typically send in the data records simultaneously and in small sizes (order of
Kilobytes). Streaming data includes a wide variety of data such as log files generated by
customers using your mobile or web applications, ecommerce purchases, in-game player
activity, information from social networks, financial trading floors or geospatial services and
telemetry from connected devices or instrumentation in data centers.
(Or)
11.(b) Explore the various steps associated with data science process and explain any
three steps of it with suitable diagrams and example.

Data Science Process

• Step 1: Discovery or Defining research goal

This step involves acquiring data from all the identified internal and external sources, which
helps to answer the business question.

• Step 2: Retrieving data

It collection of data which required for project. This is the process of gaining a
business understanding of the data user have and deciphering what each piece of data means.
This could entail determining exactly what data is required and the best methods for
obtaining it. This also entails determining what each of the data points means in terms of the
company. If we have given a data set from a client, for example, we shall need to know what
each column and row represents.

• Step 3: Data preparation

Data can have many inconsistencies like missing values, blank columns, an incorrect
data format, which needs to be cleaned. We need to process, explore and condition data
before modeling. The cleandata, gives the better predictions.

• Step 4: Data exploration

Data exploration is related to deeper understanding of data. Try to understand how


variables interact with each other, the distribution of the data and whether there are outliers.
To achieve this use descriptive statistics, visual techniques and simple modeling. This steps is
also called as Exploratory Data Analysis.

• Step 5: Data modeling

In this step, the actual model building process starts. Here, Data scientist distributes
datasets for training and testing. Techniques like association, classification and clustering are
applied to the training data set. The model, once prepared, is tested against the "testing"
dataset.

• Step 6: Presentation and automation

Deliver the final base lined model with reports, code and technical documents in this
stage. Model is deployed into a real-time production environment after thorough testing. In
this stage, the key findings are communicated to all stakeholders. This helps to decide if the
project results are a success or a failure based on the inputs from the model.

Defining Research Goals

• To understand the project, three concept must understand: what, why and how.
a) What is expectation of company or organization?

b) Why does a company's higher authority define such research value?

c) How is it part of a bigger strategic picture?

1. Learning the business domain :

Understanding the domain area of the problem is essential. In many cases, data
scientists will have deep computational and quantitative knowledge that can be broadly
applied across many disciplines. Data scientists have deep knowledge of the methods,
techniques and ways for applying heuristics to a variety of business and conceptual problems.

2. Resources :

As part of the discovery phase, the team needs to assess the resources available to
support the project. In this context, resources include technology, tools, systems, data and
people.

3. Frame the problem :

Framing is the process of stating the analytics problem to be solved. At this point, it is
a best practice to write down the problem statement and share it with the key stakeholders.
Each team member may hear slightly different things related to the needs and the problem
and have somewhat different ideas of possible solutions.

4. Identifying key stakeholders:

The team can identify the success criteria, key risks and stakeholders, which should
include anyone who will benefit from the project or will be significantly impacted by the
project. When interviewing stakeholders, learn about the domain area and any relevant
history from similar analytics projects.

5. Interviewing the analytics sponsor:

The team should plan to collaborate with the stakeholders to clarify and frame the
analytics problem. At the outset, project sponsors may have a predetermined solution that
may not necessarily realize the desired outcome. In these cases, the team must use its
knowledge and expertise to identify the true underlying problem and appropriate solution.
This person understands the problem and usually has an idea of a potential working solution.

6. Developing initial hypotheses:

This step involves forming ideas that the team can test with data. Generally, it is best
to come up with a few primary hypotheses to test and then be creative about developing
several more. These Initial Hypotheses form the basis of the analytical tests the team will use
in later phases and serve as the foundation for the findings in phase.

Retrieving Data
Retrieving required data is second phase of data science project. Sometimes Data
scientists need to go into the field and design a data collection process. Many companies will
have already collected and stored the data and what they don't have can often be bought from
third parties.

Most of the high quality data is freely available for public and commercial use. Data
can be stored in various format. It is in text file format and tables in database. Data may be
internal or external.

1. Start working on internal data, i.e. data stored within the company

First step of data scientists is to verify the internal data. Assess the relevance and
quality of the data that's readily in company. Most companies have a program for maintaining
key data, so much of the cleaning work may already be done. This data can be stored in
official data repositories such as databases, data marts, data warehouses and data lakes
maintained by a team of IT professionals.

Data repository is also known as a data library or data archive. This is a general term
to refer to a data set isolated to be mined for data reporting and analysis. The data repository
is a large database infrastructure, several databases that collect, manage and store data sets for
data analysis, sharing and reporting.

• Data repository can be used to describe several ways to collect and store data:

Data warehouse is a large data repository that aggregates data usually from multiple
sources or segments of a business, without the data being necessarily related. Data lake is a
large data repository that stores unstructured data that is classified and tagged with metadata.
Data marts are subsets of the data repository. These data marts are more targeted to what the
data user needs and easier to use.

Build the Models

To build the model, data should be clean and understand the content properly. The
components of model building are as follows:

a) Selection of model and variable

b) Execution of model

c) Model diagnostic and model comparison

• Building a model is an iterative process. Most models consist of the following main steps:

1. Selection of a modeling technique and variables to enter in the model

2. Execution of the model

3. Diagnosis and model comparison


Model and Variable Selection
• For this phase, consider model performance and whether project meets all the requirements
to use model, as well as other factors:

1. Must the model be moved to a production environment and, if so, would it be easy to
implement?

2. How difficult is the maintenance on the model: how long will it remain relevantif left
untouched?

3. Does the model need to be easy to explain?

Model Execution
Various programming language is used for implementing the model. For model
execution, Python provides libraries like StatsModels or Scikit-learn. These packages use
several of the most popular techniques. Coding a model is a nontrivial task in most cases, so
having these libraries available can speed up the process. Following are the remarks on
output:

a) Model fit: R-squared or adjusted R-squared is used.

b) Predictor variables have a coefficient: For a linear model this is easy to interpret.

c) Predictor significance: Coefficients are great, but sometimes not enough evidence exists
to show that the influence is there.

Linear regression works if we want to predict a value, but for classify something,
classification models are used. The k-nearest neighbors method is one of the best method.

Following commercial tools are used :

1. SAS enterprise miner: This tool allows users to run predictive and descriptive models
based on large volumes of data from across the enterprise.

2. SPSS modeler: It offers methods to explore and analyze data through a GUI.

3. Matlab: Provides a high-level language for performing a variety of data analytics,


algorithms and data exploration.

12. (a) Demonstrate the different types of variables used in data analysis with an
example for each.

A variable is a characteristic or property that can take on different values. The


weights can be described not only as quantitative data but also as observations for a
quantitative variable, since the various weights take on different numerical values. By the
same token, the replies can be described as observations for a qualitative variable, since the
replies to the Facebook profile question take on different values of either Yes or No. Given
this perspective, any single observation can be described as a constant, since it takes on only
one value.

Discrete and Continuous Variables

Quantitative variables can be further distinguished as discrete or continuous. A


discrete variable consists of isolated numbers separated by gaps. Discrete variables can only
assume specific values that you cannot subdivide. Typically, you count discrete values, and
the results are integers.

Examples

 Counts- such as the number of children in a family. (1, 2, 3, etc., but never 1.5)
 These variables cannot have fractional or decimal values. You can have 20 or 21 cats,
but not 20.5
 The number of heads in a sequence of coin tosses.
 The result of rolling a die.
 The number of patients in a hospital.
 The population of a country.

While discrete variables have no decimal places, the average of these values can be
fractional. For example, families can have only a discrete number of children: 1, 2, 3, etc.
However, the average number of children per family can be 2.2.

A continuous variable consists of numbers whose values, at least in theory, have no


restrictions. Continuous variables can assume any numeric value and can be meaningfully
split into smaller parts. Consequently, they have valid fractional and decimal values. In fact,
continuous variables have an infinite number of potential values between any two points.
Generally, you measure them using a scale.

Examples of continuous variables include weight, height, length, time, and


temperature. Durations, such as the reaction times of grade school children to a fire alarm;
and standardized test scores, such as those on the Scholastic Aptitude Test (SAT).

Independent and Dependent Variables

Independent Variable

In an experiment, an independent variable is the treatment manipulated by the


investigator.

 Independent variables (IVs) are the ones that you include in the model to explain or
predict changes in the dependent variable.
 Independent indicates that they stand alone and other variables in the model do not
influence them.
 Independent variables are also known as predictors, factors, treatment variables,
explanatory variables, input variables, x-variables, and right-hand variables—because
they appear on the right side of the equals sign in a regression equation.
 It is a variable that stands alone and isn't changed by the other variables you are
trying to measure. For example, someone's age might be an independent variable.
Other factors (such as what they eat, how much they go to school, how much
television they watch)

Dependent Variable

When a variable is believed to have been influenced by the independent variable, it is


called a dependent variable. In an experimental setting, the dependent variable is measured,
counted, or recorded by the investigator.

 The dependent variable (DV) is what you want to use the model to explain or
predict. The values of this variable depend on other variables.
 It’s also known as the response variable, outcome variable, and left-hand
variable. Graphs place dependent variables on the vertical, or Y, axis.
 a dependent variable is exactly what it sounds like. It is something that depends
on other factors.

For example the blood sugar test depends on what food you ate, at which time you ate
etc. Unlike the independent variable, the dependent variable isn’t manipulated by the
investigator. Instead, it represents an outcome: the data produced by the experiment.

Confounding Variable

An uncontrolled variable that compromises the interpretation of a study is known as a


confounding variable. Sometimes a confounding variable occurs because it’s impossible to
assign subjects randomly to different conditions.

(Or)

(b) The number of friends reported by Facebook users is summarized in the following frequency
distribution. FRIENDS
Interval (Friends) Frequency (f)
400+ 2
350–399 5
300–349 12
250–299 17
200–249 23
150–199 49
100–149 27
50–99 29
0–49 30
Total 200

(i) What is the shape of this distribution?


(ii) Find the relative frequencies.
(iii) Find the approximate percentile rank of the interval 300-349.
(iv) Convert to a histogram.
(v) Why would it not be possible to convert to a stem and leaf display?
Answer:

(i) What is the shape of this distribution?

The distribution is right-skewed (positively skewed).

Looking at the frequencies:

 The highest frequency (49) is in the 150-199 interval.


 The frequencies generally start relatively high (36), dip slightly, then peak, and then
consistently decrease towards the higher friend counts (2, 5).
 This pattern indicates that the tail of the distribution extends more to the right (higher
values), which is characteristic of a right-skewed distribution. The histogram below
further illustrates this shape.

(ii) Find the relative frequencies.

The relative frequencies are calculated by dividing each interval's frequency by the
total number of users (200).

FRIENDS Interval Frequency Relative Frequency


0-49 36 0.180
50-99 29 0.145
100-149 27 0.135
150-199 49 0.245
200-249 23 0.115
250-299 17 0.085
300-349 12 0.060
350-399 5 0.025
400-above 2 0.010
Total 200 1.000

(iii) Find the approximate percentile rank of the interval 300-349.

To find the approximate percentile rank of the interval 300-349, we use the formula:

Total FrequencyCumulative Frequency below the class+(0.5×Frequency of the class)×100


 Frequencies below 300-349:
o 0-49: 36

o 50-99: 29

o 100-149: 27

o 150-199: 49

o 200-249: 23
o 250-299: 17

 Cumulative Frequency below 300-349 = 36+29+27+49+23+17=181


 Frequency of the 300-349 interval = 12
 Total Frequency = 200

The approximate percentile rank of the interval 300-349 is 93.50. This means that
approximately 93.5% of Facebook users reported having 349 or fewer friends.

(iv) Convert to a histogram.

The histogram below visually represents the frequency distribution of friends among
Facebook users. The x-axis shows the number of friends, and the y-axis shows the frequency
(number of users).

(v) Why would it not be possible to convert to a stem and leaf display?

It would not be possible to convert this frequency distribution into a stem and leaf
display because a stem and leaf display requires raw, individual data points or at least
precise numerical values within each bin.

1. Grouped Data: You are provided with grouped data (frequency distribution), where
individual data points are aggregated into intervals. For example, you know 36 users
have between 0-49 friends, but you don't know the exact number of friends for each
of those 36 users (e.g., whether they have 10, 25, or 48 friends).
2. Loss of Individual Detail: A stem and leaf display requires the "leaf" part to
represent the actual trailing digit(s) of individual data points. Since this individual
detail is lost when data is grouped into a frequency distribution, you cannot
reconstruct a stem and leaf plot. You only know the count within each bin, not the
specific values that contribute to that count.

In essence, a stem and leaf display needs more granular information than what a frequency
distribution provides.
13. (a) (i) Categorize the different types of relationships using Scatter plots. (7)

Scatter plots are powerful graphical tools used to visualize the relationship between
two numerical variables. By observing the pattern of points on a scatter plot, we can
categorize different types of relationships. Here are the main types:

1. Positive Linear Relationship

 Description: As the values of one variable increase, the values of the other variable
also tend to increase. The points on the scatter plot generally form an upward-sloping
straight line.
 Example: The relationship between hours studied for an exam and exam score.
Generally, as the hours studied increase, the exam score tends to increase.

2. Negative Linear Relationship

 Description: As the values of one variable increase, the values of the other variable
tend to decrease. The points on the scatter plot generally form a downward-sloping
straight line.
 Example: The relationship between number of hours spent watching TV and
grades (GPA). Often, as the hours spent watching TV increase, grades might tend to
decrease.

3. No Relationship (or Zero Correlation)

 Description: There is no discernible pattern or trend between the two variables. The
points on the scatter plot appear randomly scattered, forming a cloud with no clear
direction. Changes in one variable do not predict changes in the other.
 Example: The relationship between a person's shoe size and their IQ score. There is
no expected correlation between these two variables.

4. Non-linear Relationship (Curvilinear)

 Description: The variables are related, but the relationship does not follow a straight
line. Instead, the points form a curve (e.g., U-shaped, inverted U-shaped, exponential,
logarithmic).
 Sub-types and Examples:
o Curvilinear (e.g., U-shaped):
Example: The relationship between age and reaction time. Reaction time might
initially decrease (improve) with age, then increase (worsen) in older age, forming a U-shape.

Curvilinear (e.g., Inverted U-shaped):

Example: The relationship between stress level and performance. Low stress might
lead to low performance, moderate stress to high performance, and very high stress to low
performance, forming an inverted U-shape.

5. Strong vs. Weak Relationships

In addition to the direction (positive, negative, none) and form (linear, non-linear), we can
also describe the strength of the relationship, which refers to how closely the points cluster
around the trend (line or curve).

• Strong Relationship: Points are tightly clustered around the trend line or curve. This
indicates a high degree of correlation, meaning changes in one variable are strongly
associated with predictable changes in the other.

o Example (Strong Positive Linear): High correlation between daily ice


cream sales and daily temperature.
 Weak Relationship: Points are more scattered and loosely clustered around the trend
line or curve. This indicates a low degree of correlation, meaning changes in one
variable are only weakly associated with changes in the other, or there's more
variability.
o Example (Weak Positive Linear): Low correlation between height and
weekly exercise hours. While generally more exercise is good, individual
variations mean the relationship might be loose.

By observing these patterns on a scatter plot, data analysts can gain valuable insights
into the interdependencies between variables, which is crucial for decision-making, predictive
modeling, and further statistical analysis.

13.(a).(ii) Each of the following pairs represents the number of licensed


drivers (X) and the number of cars (Y) for seven houses in my
neighborhood:
Drivers (X) Cars (Y)
5 4
5 3
2 2
2 2
3 2
1 1
2 2
Total Sigma 20
(1) Construct a scatter plot to verify a lack of pronounced curvilinearity. (2)

(2) Determine the least squares equation for these data. (Remember, you will first have
to calculate r. SSy and SSx) (2)

(3) Determine the standard error of estimate, Sy/x, given that n=7. (2)

Answer:

(a) Construct a scatterplot to verify a lack of pronounced curvilinearity.


To construct a scatterplot, plot each pair of (Drivers (X), Cars (Y)) as a point on a
graph. The x-axis represents the number of drivers, and the y-axis represents the number of
cars. The points are: (5,4), (5,3), (2,2), (2,2), (3,2), (1,1), (2,2).
Visually inspecting the scatterplot will show if the points roughly follow a straight line or
exhibit a clear curve. If they appear to follow a relatively straight line, it indicates a lack of
pronounced curvilinearity.
(Or)

13.(b) (i) In studies dating back over 100 years, it's well established that regression
toward the mean occurs between the heights of fathers and the heights of their adult
Sons.
Indicate whether the following statements are true or false.
(1) Sons of tall fathers will tend to be shorter than their fathers. (1)
(2) Sons of short fathers will tend to be taller than the mean for all sons. (1)
(3) Every son of a tall father will be shorter than his father. (1)
(4) Taken as a group, adult sons are shorter than their fathers. (1)
(5) Fathers of tall sons will tend to be taller than their sons. (1)
(6) Fathers of short sons will tend to be taller than their sons but shorter than the mean
for all fathers. (1)

Answer:
Regression toward the mean is a statistical phenomenon that describes the tendency of
extreme values on one measurement to be closer to the average on a second measurement. In
the context of father-son heights, it means:

 Extremely tall fathers tend to have sons who are tall, but slightly shorter than
themselves (regressing toward the average height of sons).
 Extremely short fathers tend to have sons who are short, but slightly taller than
themselves (regressing toward the average height of sons).

 Importantly, this phenomenon describes the relationship between individual pairs or


groups at the extremes, not a shift in the overall average height of the population
across generations (which can change due to other factors like nutrition).

Let's evaluate each statement:

(1) Sons of tall fathers will tend to be shorter than their fathers.

 True. This is a direct application of regression toward the mean. If a father is


exceptionally tall (an extreme value), his son's height, while likely still above the
population average, will tend to be closer to that average, meaning the son will
typically be shorter than his very tall father.

(2) Sons of short fathers will tend to be taller than the mean for all sons.

 False. Sons of short fathers will tend to be taller than their fathers (regressing up
towards the mean), but their height will likely still be below the overall mean height
for all sons. For example, if the average son is 175 cm, a very short father (e.g., 160
cm) might have a son who is 165 cm. This son is taller than his father, but still shorter
than the overall mean for all sons.

(3) Every son of a tall father will be shorter than his father.

 False. Regression toward the mean describes a tendency or a statistical average effect.
It does not apply to every single individual case. It's possible for some sons of tall
fathers to be even taller than their fathers due to genetic variation or environmental
factors.

(4) Taken as a group, adult sons are shorter than their fathers.

 False. Regression toward the mean does not imply a change in the population mean
over generations. The average height of adult sons is generally similar to, or in many
populations, even slightly taller than, their fathers due to improved nutrition and
health over time (a secular trend). The phenomenon describes the movement of
individual extreme values towards the mean, not a shift in the mean itself.

(5) Fathers of tall sons will tend to be taller than their sons.

 False. This is the reverse application of regression toward the mean. If a son is
exceptionally tall (an extreme value), his father's height, while likely above average,
will tend to be closer to the average height of fathers. Therefore, the father will
typically be shorter than his exceptionally tall son.

(6) Fathers of short sons will tend to be taller than their sons but shorter than the mean
for all fathers.

• True.

"taller than their sons": If a son is exceptionally short, his father's height (if above the
son's height) will regress towards the average, meaning the father will likely be taller than his
very short son.

"shorter than the mean for all fathers": Since the son is short, it's likely the father is
also on the shorter side, and his height (regressing from the son's extreme short height) will
tend to be closer to the mean of all fathers, but still below it, consistent with being the father
of a short son.

13.(b).(ii) Interpret the value of r2 in correlation based analysis. (7)

The value of r2 (read as "r-squared"), also known as the coefficient of determination,


is a crucial metric in correlation-based analysis, particularly in the context of linear
regression. It provides a measure of how well the regression model explains the variability of
the dependent variable.

Here's a detailed interpretation of r2:

1. Range:
o The value of r2 always falls between 0 and 1 (inclusive).

o It's the square of the Pearson correlation coefficient (r), so it's always non-
negative.

2. Core Interpretation: Proportion of Variance Explained


o The most important interpretation of r2 is that it represents the proportion (or
percentage) of the total variance in the dependent variable (Y) that can be
explained by the independent variable(s) (X) through the linear
regression model.

o In simpler terms, it tells you how much of the variability you observe in Y is
accounted for by the variations in X.

3. What Specific r2 Values Imply:

o r2=0: This indicates that the independent variable(s) (X) explains none of the
variance in the dependent variable (Y). There is no linear relationship between
X and Y, and the model does not improve prediction over simply using the
mean of Y.

o r2=1: This indicates that the independent variable(s) (X) explains all of the
variance in the dependent variable (Y). There is a perfect linear relationship,
meaning all data points fall exactly on the regression line. This is rare in real-
world data outside of deterministic relationships.

o 0< r2<1 (Most Common Scenario):

 An r2 value between 0 and 1 indicates that a certain proportion of the


variance in Y is explained by X.

 For example, if r2=0.60, it means that 60% of the variability in Y can


be explained by the linear relationship with X. The remaining 40% of
the variability in Y is unexplained by the model, possibly due to other
factors, measurement error, or inherent randomness.

4. "Goodness of Fit":

o r2 is often used as a measure of the "goodness of fit" of a linear regression


model. A higher r2 generally suggests a better fit, meaning the model's
predictions are closer to the actual observed data points.

5. Limitations and Nuances:

o Causation: A high r2 does not imply causation. It only indicates an


association. There might be confounding variables or the relationship could be
coincidental.

o Context Dependency: What constitutes a "good" r2 value is highly dependent


on the field of study. In some physical sciences, an r2 of 0.90 or higher might
be expected. In social sciences or psychology, an r2 of 0.30 or 0.40 might be
considered significant and valuable due to the complexity of human behavior.

o Linearity Assumption: r2 specifically measures the strength of a linear


relationship. If the true relationship between variables is non-linear, r2 might
be low even if there's a strong non-linear association.
o Doesn't Show Direction: r2 does not tell you the direction of the relationship
(positive or negative). For that, you need to look at the sign of the original
correlation coefficient (r) or the slope of the regression line.

14. (a) Imagine you have a series of data that represents the amount of precipitation
each day for a year in a given city. Load the daily rainfall statistics for the city of
Chennai in 2021 which is given in a cav file Chennairainfall2021.csv using Pandas
generate a histogram for rainy days, and find out the days that have high rainfall.

Answer:

To perform the requested analysis, the following steps using Python with the Pandas
and Matplotlib libraries are required:
 Load the Data:
Use Pandas to load the Chennai Rainfall 2021.csv file into a DataFrame. Assume the CSV
contains a column named 'Rainfall' representing daily precipitation and a 'Date' column for
the corresponding date.
 Generate a Histogram:
Create a histogram of the 'Rainfall' column to visualize the distribution of daily rainfall
amounts. This will help understand the frequency of different rainfall levels.
 Identify High Rainfall Days:
Define a threshold to determine what constitutes "high rainfall." Filter the DataFrame to
display only the days where the rainfall exceeds this defined threshold.

import pandas as pd

import matplotlib.pyplot as plt

# 1. Load the data

try:

df = pd.read_csv('Chennai Rainfall 2021.csv')

except FileNotFoundError:

print("Error: 'Chennai Rainfall 2021.csv' not found. Please ensure the file is in the correct
directory.")

exit()

# Ensure 'Date' column is in datetime format for potential future use

df['Date'] = pd.to_datetime(df['Date'])

# 2. Generate a histogram for rainy days


plt.figure(figsize=(10, 6))

plt.hist(df['Rainfall'], bins=30, color='skyblue', edgecolor='black')

plt.title('Histogram of Daily Rainfall in Chennai (2021)')

plt.xlabel('Rainfall (mm)')

plt.ylabel('Frequency of Days')

plt.grid(axis='y', alpha=0.75)

plt.show()

# 3. Find out the days that have high rainfall

# Define a threshold for high rainfall (e.g., 50 mm)

high_rainfall_threshold = 50

high_rainfall_days = df[df['Rainfall'] > high_rainfall_threshold]

print(f"\nDays with rainfall exceeding {high_rainfall_threshold} mm in Chennai (2021):")

print(high_rainfall_days[['Date', 'Rainfall']].sort_values(by='Rainfall', ascending=False))

(Or)

(b) Consider that, an E-Commerce organization like Amazon, have different regions
sales as NorthSales, SouthSales, WestSales, EastSales.csv files. They want to combine
North and West region sales and South and East sales to find the aggregate sales of
these collaborating regions Help them to do so using Python code.

Answer:

To help the E-Commerce organization combine their regional sales data, I'll provide
Python code using the Pandas library. Since I don't have access to your specific .csv files, I
will demonstrate this by creating sample DataFrames that mimic the structure of your
NorthSales.csv, SouthSales.csv, WestSales.csv, and EastSales.csv files.

The code will perform the following steps:

1. Create Sample Data: Generate sample sales data for North, South, West, and East
regions.
2. Load Data (Simulated): Treat these sample DataFrames as if they were loaded
from .csv files.
3. Combine North and West Sales: Merge the sales data from the North and West
regions.
4. Combine South and East Sales: Merge the sales data from the South and East
regions.
5. Calculate Aggregate Sales: Sum the sales for each of the newly combined regions.
6. Display Results: Print the combined sales dataframes and their total aggregate sales.

Python
import pandas as pd

# --- Step 1: Create Sample DataFrames (simulating loading from CSV files) ---
# In a real scenario, you would replace these with:
# north_sales_df = pd.read_csv('NorthSales.csv')
# south_sales_df = pd.read_csv('SouthSales.csv')
# west_sales_df = pd.read_csv('WestSales.csv')
# east_sales_df = pd.read_csv('EastSales.csv')

# Sample data for North Region


north_data = {
'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-
05']),
'Sales': [1500, 1600, 1450, 1700, 1550]
}
north_sales_df = pd.DataFrame(north_data)
print("--- North Region Sales (Sample Data) ---")
print(north_sales_df)
print("\n")

# Sample data for West Region


west_data = {
'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-
05']),
'Sales': [1200, 1300, 1150, 1400, 1250]
}
west_sales_df = pd.DataFrame(west_data)
print("--- West Region Sales (Sample Data) ---")
print(west_sales_df)
print("\n")

# Sample data for South Region


south_data = {
'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-
05']),
'Sales': [2000, 2100, 1950, 2200, 2050]
}
south_sales_df = pd.DataFrame(south_data)
print("--- South Region Sales (Sample Data) ---")
print(south_sales_df)
print("\n")

# Sample data for East Region


east_data = {
'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-
05']),
'Sales': [1800, 1900, 1750, 2000, 1850]
}
east_sales_df = pd.DataFrame(east_data)
print("--- East Region Sales (Sample Data) ---")
print(east_sales_df)
print("\n")

# --- Step 2: Combine North and West Region Sales ---


# We use pd.concat to stack the DataFrames vertically.
# 'ignore_index=True' resets the index for the combined DataFrame.
north_west_sales_df = pd.concat([north_sales_df, west_sales_df], ignore_index=True)

print("--- Combined North & West Region Sales ---")


print(north_west_sales_df)
print("\n")

# --- Step 3: Combine South and East Region Sales ---


south_east_sales_df = pd.concat([south_sales_df, east_sales_df], ignore_index=True)

print("--- Combined South & East Region Sales ---")


print(south_east_sales_df)
print("\n")

# --- Step 4: Find Aggregate Sales for each combined region ---

aggregate_north_west_sales = north_west_sales_df['Sales'].sum()
aggregate_south_east_sales = south_east_sales_df['Sales'].sum()

print(f"Aggregate Sales for North & West Regions: ${aggregate_north_west_sales:,.2f}")


print(f"Aggregate Sales for South & East Regions: ${aggregate_south_east_sales:,.2f}")

Explanation:

1. import pandas as pd: This line imports the Pandas library, which is essential for
working with DataFrames.
2. Sample Data Creation: I've created four pd.DataFrame objects (north_sales_df,
south_sales_df, west_sales_df, east_sales_df) using dictionaries. Each dictionary
contains a 'Date' column (converted to datetime objects for proper handling) and a
'Sales' column. In your actual scenario, you would replace these sections with
pd.read_csv('YourFileName.csv').
3. pd.concat([df1, df2], ignore_index=True): This is the core function used for
combining the sales data.
o [df1, df2] is a list of the DataFrames you want to combine.

o ignore_index=True ensures that the index of the new combined DataFrame is


reset, rather than keeping the original indices from the individual DataFrames,
which can lead to duplicate indices.
4. df['Sales'].sum(): After combining the DataFrames, we simply select the 'Sales'
column from each combined DataFrame and use the .sum() method to get the total
aggregate sales.

15. (a) How text and image annotations are done using Python? Give an example of
your own with appropriate Python code.

Text and image annotations in Python involve adding information or labels to text
data or visual elements within images. This is commonly done for tasks such as data labeling
for machine learning, creating visual guides, or adding metadata.

Image Annotation with Text using Pillow (PIL):


The Pillow library (PIL Fork) is a popular choice for image manipulation, including adding
text annotations.
from PIL import Image, ImageDraw, ImageFont

def annotate_image_with_text(image_path, text, position, font_path=None, font_size=30,


fill_color=(255, 0, 0)):

"""

Annotates an image with text at a specified position.

Args:

image_path (str): Path to the input image.

text (str): The text to be added.

position (tuple): (x, y) coordinates for the top-left corner of the text.

font_path (str, optional): Path to a TrueType font file (.ttf). Defaults to None (uses
default Pillow font).

font_size (int, optional): Size of the font. Defaults to 30.

fill_color (tuple, optional): RGB tuple for the text color. Defaults to red (255, 0, 0).

"""

try:

img = Image.open(image_path).convert("RGB")

draw = ImageDraw.Draw(img)

if font_path:

try:
font = ImageFont.truetype(font_path, font_size)

except IOError:

print(f"Warning: Font file not found at {font_path}. Using default font.")

font = ImageFont.load_default()

else:

font = ImageFont.load_default()

draw.text(position, text, fill=fill_color, font=font)

img.save("annotated_image.jpg")

print("Image annotated successfully and saved as 'annotated_image.jpg'.")

except FileNotFoundError:

print(f"Error: Image file not found at {image_path}")

except Exception as e:

print(f"An error occurred: {e}")

# Example Usage:

# Create a dummy image for demonstration if you don't have one

try:

dummy_img = Image.new('RGB', (400, 300), color = 'white')

dummy_img.save('dummy_image.jpg')

print("Dummy image 'dummy_image.jpg' created.")

except Exception as e:

print(f"Could not create dummy image: {e}")

# Annotate the dummy image

annotate_image_with_text(

image_path="dummy_image.jpg",

text="Hello, Annotation!",

position=(50, 50),

font_size=40,
fill_color=(0, 0, 255) # Blue color

# You can also use a custom font:

# annotate_image_with_text(

# image_path="dummy_image.jpg",

# text="Custom Font Example",

# position=(50, 150),

# font_path="/path/to/your/font.ttf", # Replace with your font path

# font_size=30,

# fill_color=(0, 128, 0) # Green color

#)

Explanation:
 Import necessary modules: Image, ImageDraw, and ImageFont from PIL.
 Open the image: Image.open(image_path).convert("RGB") loads the image and
converts it to RGB mode for consistent color handling.
 Create a drawing object: ImageDraw.Draw(img) creates an object that allows
drawing operations on the image.
 Load the font: ImageFont.truetype() loads a custom TrueType font if font_path is
provided, otherwise ImageFont.load_default() is used.
 Add text: draw.text(position, text, fill=fill_color, font=font) draws the
specified text at the position with the given fill_color and font.
 Save the annotated image: img.save("annotated_image.jpg") saves the modified
image.
Text Annotation (e.g., for NLP):
For text annotation in Natural Language Processing (NLP), you typically use libraries like
spaCy or NLTK to identify and label entities, parts of speech, or other linguistic features
within a text. While this doesn't involve visual annotation, it's a crucial form of "annotation."
import spacy

def annotate_text_entities(text):

"""

Annotates text to identify named entities using spaCy.


Args:

text (str): The input text.

Returns:

list: A list of dictionaries, each representing an identified entity.

Each dictionary contains 'text', 'start_char', 'end_char', and 'label'.

"""

try:

nlp = spacy.load("en_core_web_sm") # Load a small English model

doc = nlp(text)

entities = []

for ent in doc.ents:

entities.append({

"text": ent.text,

"start_char": ent.start_char,

"end_char": ent.end_char,

"label": ent.label_

})

return entities

except OSError:

print("SpaCy model 'en_core_web_sm' not found. Please run: python -m spacy


download en_core_web_sm")

return []

# Example Usage:

sample_text = "Apple Inc. was founded by Steve Jobs and Steve Wozniak in Cupertino,
California."

annotated_entities = annotate_text_entities(sample_text)

print("\nText Annotations (Named Entities):")


for entity in annotated_entities:

print(f" Text: '{entity['text']}', Label: '{entity['label']}', Start: {entity['start_char']}, End:


{entity['end_char']}")

Explanation:
 Import spaCy:
Imports the necessary library.
 Load a spaCy model:
spacy.load("en_core_web_sm") loads a pre-trained English language model for entity
recognition.
 Process the text:
nlp(text) processes the input text, applying various NLP tasks including named entity
recognition.
 Extract entities:
doc.ents provides access to the identified named entities, each with properties
like text, start_char, end_char, and label.
 Format and return:
The function formats the extracted entities into a list of dictionaries for easier use.

(Or)

15. (b) Appraise the following (i) Histograms (ii) Binnings (iii) Density with appropriate
Python code.

(i) Histogram
A histogram is a graphical representation of the distribution of numerical data. It divides the
range of values in a continuous variable into a series of intervals (bins) and displays the count
or frequency of data points falling into each bin as bars. The x-axis represents the bins, and
the y-axis represents the frequency or density.
Python
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data


data = np.random.randn(1000)

# Create a histogram
plt.hist(data, bins=30, edgecolor='black', alpha=0.7)
plt.title('Histogram of Sample Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

(ii) Binning
Binning, also known as bucketing, is the process of grouping continuous data into a set of
discrete intervals or "bins." This is a fundamental step in creating histograms. The choice of
bin size and number significantly impacts the appearance and interpretation of the histogram,
with too few bins potentially over-smoothing the distribution and too many bins introducing
noise.
Python
import numpy as np

# Sample data
data = np.array([1.2, 2.3, 3.3, 3.1, 1.7, 3.4, 2.1, 1.25, 1.3])

# Define custom bins


bins = [1, 2, 3, 4]

# Perform binning using numpy.histogram


counts, bin_edges = np.histogram(data, bins=bins)

print(f"Counts per bin: {counts}")


print(f"Bin edges: {bin_edges}")

(iii) Density
Density, in the context of histograms and data visualization, refers to the probability
density function (PDF) of a continuous variable. When a histogram is normalized to represent
density, the area of each bar corresponds to the proportion of data points within that bin, and
the total area under the histogram sums to 1. Density plots, often generated using Kernel
Density Estimation (KDE), provide a smooth representation of the data's underlying
distribution.
Python
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data


data = np.random.randn(1000)

# Create a histogram with a superimposed density plot (KDE)


sns.histplot(data, bins=30, kde=True, color='skyblue', edgecolor='black')
plt.title('Histogram with Density Plot')
plt.xlabel('Value')
plt.ylabel('Density')
plt.grid(True)
plt.show()
PART C-(1x1515 marks)
16. (a) Perform an exploratory data analysis for the following data with different types
of plots:
The dataset contains cases from a study that was conducted between 1958 and 1970 at
the University of Chicago's Billings Hospital on the survival of patients who had
undergone surgery for breast cancer.
Data attributes:-
Age of patient at the time of operation (numerical)
Patient's year of operation (year-1900, numerical)
Number of positive axillary nodes detected (numerical)
Survival status (class attribute) 1 the patient survived 5 years or longer, 2 the patient
died within 5 year

Answer:

An exploratory data analysis (EDA) is a critical first step in any data science project.
It involves using visualization and summary statistics to understand the main characteristics
of a dataset, uncover patterns, detect outliers, and check assumptions.

Below is a comprehensive exploratory data analysis of the breast cancer survival


dataset you described, using a variety of plots to visualize the relationships between the
attributes. Since the raw data was not provided, I have created a simulated dataset that
matches the attributes and description you gave.

The analysis will include:

1. Initial Data Inspection: Loading and summarizing the data to understand its
structure.
2. Univariate Analysis: Plots showing the distribution of each individual numerical
attribute (Age, Year of Operation, Axillary Nodes).

3. Bivariate Analysis: Plots comparing each numerical attribute against the Survival
Status to see how they relate to the outcome.

4. Multivariate Analysis: A pairplot to visualize the relationships between all pairs of


numerical attributes, colored by Survival Status.

Exploratory Data Analysis

Based on the generated plots, here is a summary of the key insights into the dataset:

1. Univariate Distributions:

 Age: The age distribution appears to be roughly bell-shaped (normal), with most
patients falling in the 40-60 year range.
 Year of Operation: The operations are spread fairly evenly across the years of the
study (1958-1970).
 Axillary Nodes: The distribution of axillary nodes is heavily skewed to the right. A
large majority of patients have a very low number of positive nodes, with a long tail
extending to higher values.

2. Bivariate Analysis (Feature vs. Survival Status):

 Age vs. Survival: The boxplots for Age show that the median age for patients who
survived is very similar to those who did not. While there's a slight difference in the
median, the distributions largely overlap, suggesting that age, by itself, may not be a
strong predictor of survival.
 Year of Operation vs. Survival: The distributions for Year of Operation are also
very similar for both groups. This suggests that the year the surgery was performed,
within the study's timeframe, is not a major factor in predicting survival.
 Axillary Nodes vs. Survival: The most significant difference is seen in the Axillary
Nodes plot. The boxplot for patients who died shows a slightly higher median number
of positive nodes and a larger interquartile range compared to those who survived.
This indicates that patients with a higher number of positive axillary nodes are more
likely to have a lower survival rate. This feature appears to be the most indicative of
survival status among the three.

3. Multivariate Analysis (Pairplot):

 The pairplot confirms the findings from the bivariate analysis. The scatter plots
comparing Age, Year of Operation, and Axillary Nodes show no strong linear
correlations between the numerical features themselves.
 Crucially, when the points are colored by Survival Status, the axillary_nodes plots
stand out. The non-survivor patients (class 2) tend to be clustered in the areas with a
slightly higher number of positive nodes, particularly when plotted against age.
In conclusion, this exploratory data analysis suggests that the number of positive axillary
nodes is likely the most important feature for predicting survival status in this dataset, while
Age and Year of Operation seem to have less predictive power on their own.

(Or)
16. (b) Assume that an of 80 describes the strong negative relationship between years of
heavy smoking (X) and life expectancy (Y).
Assume, furthermore, that the distributions of heavy smoking and life expectancy each
have the following means and sums of squares: 5 60 35 70 x y X Y SS SS
(1) Determine the least squares regression equation for predicting life expectancy from
years of heavy smoking. (3)
(ii) Determine the standard error of estimate, Sy/x, assuming that the correlation of-.80
was based on n = 50 pairs of observations. (3)
(iii) Supply a rough interpretation of Sy/x.(3)
(iv) Predict the life expectancy for John, who has smoked heavily for 8 years. (3)
(v) Predict the life expectancy for Katie, who has never smoked heavily.
Answer:

First, let's clarify the given values. The problem states a "strong negative relationship"
with a value of "80". In standard statistical notation, this implies a correlation coefficient ( r)
of -0.80. The remaining values are:

 Correlation coefficient, r=−0.80


 Mean of years of heavy smoking, Xˉ=5

 Mean of life expectancy, Yˉ=60

 Sum of Squares for X, SSX=35

 Sum of Squares for Y, SSY=70

 Number of observations, n=50

Based on the information provided, let's solve each part of the problem.

First, let's clarify the given values. The problem states a "strong negative relationship" with a
value of "80". In standard statistical notation, this implies a correlation coefficient (r) of -
0.80. The remaining values are:

 Correlation coefficient, r=−0.80


 Mean of years of heavy smoking, Xˉ=5
 Mean of life expectancy, Yˉ=60
 Sum of Squares for X, SSX=35
 Sum of Squares for Y, SSY=70
 Number of observations, n=50

(i) Determine the least squares regression equation for predicting life expectancy from
years of heavy smoking.

To provide numerical answers, the exact values for n in the standard deviation
calculation in part (i) are needed, and the calculation of a and b would then allow the
predictions in (iv) and (v). Using n=50 from part (ii) for part (i) would be a reasonable
assumption if not explicitly stated otherwise. Assuming n=50 for all calculations, the
predicted life expectancy for John is approximately 128.82 years, and for Katie is
approximately 137.87 years.
B.E/B.Tech. DEGREE EXAMINATIONS, APRIL/MAY 2023

CS 3352 – FOUNDATIONS OF DATA SCIENCE

Answer ALL questions.


PART A-(10 x 2 = 20 marks)

1. Outline the difference between structured data and unstructured data.


Structured Data:
Organized in a defined format, typically in tables with rows and columns,
making it easy to store, query, and analyze (e.g., relational databases, spreadsheets).
Unstructured Data:
Lacks a predefined format or organization, making it more challenging to
process and analyze using traditional methods (e.g., text documents, images, audio
files).
2. Define data mining.
Data mining is the process of discovering patterns, insights, and knowledge
from large datasets using various techniques from statistics, artificial intelligence, and
machine learning.
3. Compare and contrast qualitative data and quantitative data with an example.
Qualitative Data:
Descriptive and non-numerical, representing qualities or characteristics (e.g.,
colors, opinions, feelings).
Quantitative Data:
Numerical and measurable, representing quantities or amounts (e.g., age,
height, temperature).
4. List the differences between a discrete variable and a continuous variable with an
example.
Discrete Variable:
Can only take a finite or countably infinite number of values, often integers,
and typically represents counts (e.g., number of students in a class).
Continuous Variable:
Can take any value within a given range, including decimals and fractions, and
typically represents measurements (e.g., height of a person).
5. What is the use of scatter plot?
A scatter plot is used to visualize the relationship between two numerical
variables, helping to identify patterns, correlations, or trends between them.
6. Define correlation coefficient.
The correlation coefficient is a statistical measure that quantifies the strength
and direction of a linear relationship between two variables, ranging from -1 (perfect
negative correlation) to +1 (perfect positive correlation).
7. State the advantages of using Numpy arrays.
Efficiency:
Faster computation compared to Python lists due to optimized C
implementations.
Memory Efficiency:
Uses less memory for storing large datasets.
Broadcasting:
Allows operations on arrays of different shapes.
Rich Functionality:
Provides a wide range of mathematical functions for array operations.
8. Outline the two types of Numpy's UFuncs.
Unary UFuncs:
Operate on a single input array (e.g., np.sin(), np.exp()).
Binary UFuncs:
Operate on two input arrays (e.g., np.add(), np.multiply()).
9. State the two possible options in IPython notebook used to embed graphics directly in
the notebook.
Using the %matplotlib inline magic command.
Using the display() function from IPython.display module with appropriate graphic
objects.
10. How plt.scatter function differs from plt.plot function?
plt.scatter() is specifically designed to create scatter plots, where individual data
points are plotted as markers and can be customized based on additional variables (e.g., size,
color).
plt.plot() is a more general-purpose function used to draw lines and/or markers
connecting data points, typically for visualizing trends or series.
PART – B (5 × 13 = 65)

11 (a).Elaborate about the steps in the data science process with a diagram. (13 marks)

The data science process is a structured approach for extracting knowledge and
insights from data. It generally involves six key steps: Problem Framing, Data Collection,
Data Preparation, Exploratory Data Analysis, Model Building, and Communication &
Deployment. Each step is crucial for ensuring that the final analysis is accurate, relevant, and
actionable.

1. Problem Framing:
This initial step involves clearly defining the business or research problem that the
data science project aims to address. Understanding the objectives and context is vital for
guiding the entire process. A well-defined problem statement sets the direction for the
project and helps in identifying the right data sources and analytical approaches.

2. Data Collection:
Once the problem is defined, the next step is to gather the necessary data. This
involves identifying and accessing relevant data from various sources, both internal and
external. This could include databases, spreadsheets, APIs, or even external datasets. Data
collection may also involve data wrangling and cleaning.
3. Data Preparation:
This crucial step involves cleaning, transforming, and preparing the data for
analysis. It often includes handling missing values, outliers, and inconsistencies in the
data. Data preparation ensures the data is in a suitable format for modeling and analysis.
4. Exploratory Data Analysis (EDA):
EDA involves exploring the data to gain a deeper understanding of its
characteristics, patterns, and relationships. Techniques like data visualization, statistical
analysis, and summary statistics are used to identify trends, outliers, and potential
insights. EDA helps in identifying the most relevant variables and features for modeling.
5. Model Building:
This step involves selecting and training appropriate machine learning models to
solve the defined problem. Different algorithms can be used depending on the nature of the
problem (e.g., classification, regression, clustering). Model building also includes
evaluating the model's performance and optimizing it for better results.
6. Communication and Deployment:
The final step involves communicating the findings and insights from the analysis,
often through visualizations, reports, or dashboards. If the analysis is to be used in a
business context, the model may be deployed for real-time predictions or automated
decision-making. Continuous monitoring and maintenance of the deployed model are also
crucial.
(Or)
11. (b).What is a data warehouse? Outline the architecture of a data warehouse with a
diagram. (13 marks)

Data warehousing is the process of constructing and using a data warehouse. A data
warehouse is constructed by integrating data from multiple heterogeneous sources that
support analytical reporting, structured and/or ad hoc queries, and decision making. Data
warehousing involves data cleaning, data integration, and data consolidations.

Data Warehouse
Although a data warehouse and a traditional database share some similarities, they
need not be the same idea. The main difference is that in a database, data is collected for
multiple transactional purposes. However, in a data warehouse, data is collected on an
extensive scale to perform analytics. Databases provide real-time data, while warehouses
store data to be accessed for big analytical queries.
Bottom Tier
The bottom tier or data warehouse server usually represents a relational database
system. Back-end tools are used to cleanse, transform and feed data into this layer.
Middle Tier
The middle tier represents an OLAP server that can be implemented in two ways. The
ROLAP or Relational OLAP model is an extended relational database management system
that maps multidimensional data process to standard relational process. The MOLAP or
multidimensional OLAP directly acts on multidimensional data and operations.
Top Tier
This is the front-end client interface that gets data out from the data warehouse. It
holds various tools like query tools, analysis tools, reporting tools, and data mining tools.
Data Warehousing integrates data and information collected from various sources into
one comprehensive database. For example, a data warehouse might combine customer
information from an organization’s point of-sale systems, its mailing lists, website, and
comment cards. It might also incorporate confidential information about employees, salary
information, etc. Businesses use such components of data warehouse to analyze customers.
Data mining is one of the features of a data warehouse that involves looking for
meaningful data patterns in vast volumes of data and devising innovative strategies for
increased sales and profits.
Types of Data Warehouse
There are three main types of data warehouse.
Enterprise Data Warehouse (EDW)
This type of warehouse serves as a key or central database that facilitates decision-
support services throughout the enterprise. The advantage to this type of warehouse is that it
provides access to cross-organizational information, offers a unified approach to data
representation, and allows running complex queries.
Operational Data Store (ODS)
This type of data warehouse refreshes in real-time. It is often preferred for routine
activities like storing employee records. It is required when data warehouse systems do not
support reporting needs of the business.
Data Mart
A data mart is a subset of a data warehouse built to maintain a particular department,
region, or business unit. Every department of a business has a central repository or data mart
to store data. The data from the data mart is stored in the ODS periodically. The ODS then
sends the data to the EDW, where it is stored and used.
12. (a). (i). What is a frequency distribution? Customers who have purchased a
particular product rated the usability of the product on a 10-pont scale, ranging from
1(poor) to 10(excellent) as follows:
3 7 2 7 8
3 1 4 10 3
2 5 5 8
2 7 3 6 7
8 9 7 3 6
Construct a frequency distribution for the above data.

Solution:

A frequency distribution is a table that displays the number of occurrences


(frequency) of each unique value in a dataset. It helps understand how the data is distributed
across values.

Step 1: Organize the raw data:

All values:
3, 7, 2, 7, 8, 3, 1, 4, 10, 3, 2, 5, 5, 8, 2, 7, 3, 6, 7, 8, 9, 7, 3, 6

Step 2: Count frequency of each value (1 to 10):


Value (Rating) Frequency
1 1
2 3
3 5
4 1
5 2
6 2
7 5
8 3
9 1
10 1
what is relative frequency distribution? the GRE
12.(a).(ii).
scores for a group of graduate school applicants are
distributed as follows,
GRE Score Frequency
725–749 1
700–724 3
675–699 14
650–674 30
625–649 34
600–624 42
575–599 30
550–574 27
525–549 13
500–524 4
475–499 2
Total 200
Explain the procedure to convert frequency distribution into a relative frequency
distribution and convert the data presented in the above table to a relative
frequency distribution. Do not round numbers to two digits to the right of decimal
point.
Solution:

A relative frequency distribution shows the proportion of total observations that fall
within each class interval. It is calculated by:

Relative Frequency = (Class Frequency) / (Total Frequency)

Procedure to Convert to Relative Frequency Distribution:

1. Identify the total frequency, which is the sum of all class frequencies (already given
as 200).
2. For each class, divide its frequency by the total frequency.
3. Round off the relative frequency to two decimal places.

Relative Frequency Table:

GRE Score Frequency Relative Frequency


725–749 1 1 / 200 = 0.01
700–724 3 3 / 200 = 0.02
675–699 14 14 / 200 = 0.07
650–674 30 30 / 200 = 0.15
625–649 34 34 / 200 = 0.17
600–624 42 42 / 200 = 0.21
575–599 30 30 / 200 = 0.15
550–574 27 27 / 200 = 0.14
525–549 13 13 / 200 = 0.07
500–524 4 4 / 200 = 0.02
475–499 2 2 / 200 = 0.01
Total 200 1.00

(Or)

12. (b)(i) What is Z-score? Outline the steps to obtain a Z-score.

Answer:

A Z-score is a statistical measure that tells you how many standard deviations a data
point is from the mean. Formula:

Where:

 X = individual score
 μ = mean

 σ = standard deviation

Steps to Calculate Z-score:

1. Obtain the raw score (X) for which Z-score is to be calculated.


2. Determine the mean (μ) of the data.

3. Determine the standard deviation (σ) of the data.

4. Substitute the values into the Z-score formula.


5. Calculate the result to find how far the value is from the mean.

12.(b).(ii). Express each of the following scores as a z-score: first mary's


intelligence quotient is 135, given a mean of 100 and standard deviation 15,
second mary obtained a score of 470 in the competitive examination conducted in
april 2022, given a mean of 500 and a standard deviatiobn of 100.

Solution:

A z-score indicates how many standard deviations an element is from the


mean. It is calculated using the formula:

Where:

 X = individual score
 μ = mean

 σ = standard deviation
13 (a).Calculate the correlation coefficient for the heights (in inches) of fathers (x) and their
sons (y) with the data presented below:

x 66, 68, 68, 70, 71, 72, 72

y 68, 70, 69, 72, 72, 72, 74

Answer:

Use Pearson’s correlation coefficient formula:

Step 1: Prepare a table:

x y x² y² xy
66 68 4356 4624 4488
68 70 4624 4900 4760
68 69 4624 4761 4692
70 72 4900 5184 5040
71 72 5041 5184 5112
72 72 5184 5184 5184
72 74 5184 5476 5328
Total 34,913 35,313 34,604
(Or)

13 (b).The values of xxx and their corresponding values of y are presented


below.
x 0.5 1.5 2.5 3.5 4.5 5.5 6.5

y 2.5 3.5 5.5 4.5 6.5 8.5 10.5

(i) Find the least squares regression line y = ax+b


(ii) Estimate the value of y when x =10

Answer:

Step 1: Compute required sums:


∑ x =24.5, ∑ y = 41.5, ∑ xy = 172.25, ∑ x2 = 122.75, n=7

14 (a).What is an aggregate function? Elaborate about the aggregate functions in


NumPy. (13 Marks)

Answer:

Definition:

Aggregate functions perform a calculation on a set of values and return a single result.

In NumPy, common aggregate functions include:

 np.sum() – Sum of array elements


 np.mean() – Mean of array elements

 np.median() – Median value

 np.std() – Standard deviation

 np.var() – Variance

 np.min() / np.max() – Minimum / Maximum values

Example:

import numpy as np
arr = np.array([1, 2, 3, 4])
print(np.sum(arr)) # Output: 10
print(np.mean(arr)) # Output: 2.5
(Or)
14 (b). (i) What is broadcasting? Explain the rules of broadcasting with an example. (7
marks)

Answer:

Broadcasting in NumPy is a mechanism that allows NumPy to work with arrays of


different shapes when performing arithmetic operations. It effectively "stretches" the smaller
array to match the shape of the larger array without actually creating copies of the data.

Rules of broadcasting:

1. If the arrays do not have the same number of dimensions, the shape of the smaller
array is padded with ones on its left side.

2. If the sizes of the dimensions do not match, the array with size 1 in that dimension
is stretched to match the other array.

3. If in any dimension the sizes are unequal and neither is 1, an error is raised.

Example:

python
CopyEdit
a = np.array([1, 2, 3]) # Shape (3,)
b=5
print(a + b) # [6 7 8]
Example:

Adding a scalar to an array, e.g., np.array([1, 2, 3]) + 5. The scalar 5 is broadcast to


[5, 5, 5] to match the array's shape.

14.(b).(ii) Elaborate about the mapping between Python operators and Pandas methods.
(6 marks)

Answer:

Mapping between Python operators and Pandas methods


Pandas DataFrames and Series are built on top of NumPy arrays, and they often overload
standard Python operators to perform element-wise operations, similar to how NumPy does.

Pandas provides method equivalents for Python operators:

Operator Equivalent Method

+ add()
Operator Equivalent Method

- sub()

* mul()

/ div()

** pow()

Example:

python
CopyEdit
df1.add(df2) # same as df1 + df2

+ operator maps to the add() method (e.g., df1 + df2 is equivalent to


df1.add(df2)).
- operator maps to the sub() method (e.g., s1 – s2 is equivalent to s1.sub(s2)).
* operator maps to the mul() method (e.g., df * 2 is equivalent to df.mul(2)).
/ operator maps to the div() method (e.g., s / 3 is equivalent to s.div(3)).

15 (a). Explain various visualization charts like line plots, scatter plots, and histograms
using Matplotlib with examples. (13 marks)

Answer:

Matplotlib is a comprehensive library for creating static, animated, and interactive


visualizations in Python. Check out our home page for more information.

1. Line Plot:

Used for showing trends over time.

Python
CopyEdit
plt.plot([1, 2, 3], [2, 4, 1])

2. Scatter Plot:

Used for showing relationship between two variables.

Python
CopyEdit
plt.scatter([1, 2, 3], [4, 5, 6])
3. Histogram:

Used to visualize distribution of a dataset.

Python
CopyEdit
plt.hist([1, 1, 2, 3, 3, 4, 4, 4])

(Or)

15 (b). Outline any two three-dimensional plotting in Matplotlib with an example. (13
marks)

Answer:

Use mpl_toolkits.mplot3d:

3D Line Plot:

from mpl_toolkits.mplot3d import Axes3D

import matplotlib.pyplot as plt

import numpy as np

fig = plt.figure()

ax = fig.add_subplot(111, projection='3d')

x = np.random.rand(100)

y = np.random.rand(100)

z = np.random.rand(100)

ax.scatter(x, y, z)

plt.show()

3D Surface Plot:

from mpl_toolkits.mplot3d import Axes3D


import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
X = np.arange(-5, 5, 0.25)
Y = np.arange(-5, 5, 0.25)
X, Y = np.meshgrid(X, Y)
R = np.sqrt(X*2 + Y*2)
Z = np.sin(R)
ax.plot_surface(X, Y, Z)
plt.show()

PART – C (1 × 15 = 15)

16. (a).(i) What is mode? Can there be distributions with no mode or more than
one mode? The owner of a new car conducts six gas mileage tests and obtains the
following results, expressed in miles per gallon: 26.3, 28.7, 27.4, 26.9, 27.4, 26.9.
Find the mode for these data.

Answer:
Concepts:
Mode, Statistics, Data analysis

Explanation:
The mode of a set of data is the value that appears most frequently. A distribution can
have no mode if all values occur with the same frequency, or it can have more than one mode
if multiple values occur with the highest frequency. In this case, we will find the mode of the
given gas mileage test results.

Step by Step Solution:


Step 1
List the gas mileage results: 26.3, 28.7, 27.4, 26.9, 27.4, 26.9.

Step 2
Count the frequency of each result: 26.3 appears 1 time, 28.7 appears 1 time, 27.4
appears 2 times, and 26.9 appears 2 times.

Step 3
Identify the highest frequency: The highest frequency is 2, which corresponds to the
values 27.4 and 26.9.

Step 4
Since both 27.4 and 26.9 appear most frequently, the data set is bimodal with modes
27.4 and 26.9.

Final Answer:
The modes are 27.4 and 26.9.
16.(a).(ii) What is mode? Can there be distribution with bo mode or more than
one mode? The owner of new car conducts six gas mileage tests and obtain the
following results, expressed in miles per gallon: 26.3, 28.7,27.4,26.6,27.4,26.9.
Find the mode for these data.
Answer:

Concepts:
Median, Mode

Explanation:
The median is the middle value in a list of numbers. To find the median, you need to arrange
the numbers in ascending order and then find the middle value. If there is an even number of
observations, the median is the average of the two middle numbers. The mode is the value
that appears most frequently in a data set. A distribution can have no mode, one mode, or
more than one mode.

Step by Step Solution:


Step 1
Arrange the first set of five scores 2,8,2,7,6 in ascending order: 2,2,6,7,8.

Step 2
Since there are 5 scores (an odd number), the median is the middle score: 6.

Step 3
Arrange the second set of six scores 3,8,9,3,1,8 in ascending order: 1,3,3,8,8,9.

Step 4
Since there are 6 scores (an even number), the median is the average of the two
middle scores: (3+8)/2=5.5.

Step 5
The median for the first set of scores is 6.

Step 6
The median for the second set of scores is 5.5.

Step 7
To find the mode of the gas mileage data 26.3,28.7,27.4,26.6,27.4,26.9, identify the
value(s) that appear most frequently.

Step 8
The value 27.4 appears twice, while all other values appear only once.

Step 9
Therefore, the mode for the gas mileage data is 27.4.
Final Answer:
The median for the first set of scores is 6. The median for the second set of scores is
5.5. The mode for the gas mileage data is 27.4.

B.E/B.Tech. DEGREE EXAMINATIONS, APRIL/MAY 2024

CS 3352 – FOUNDATIONS OF DATA SCIENCE

Answer ALL questions.


PART A-(10 x 2 = 20 marks)

1. How missing values present in a dataset are treated during data analysis phase?

Missing values in a dataset during the data analysis phase are typically handled
through various techniques such as imputation (replacing missing values with estimated
values like mean, median, mode, or using more advanced methods like K-Nearest Neighbors
imputation or regression imputation), deletion (removing rows or columns with missing
values), or using models that can inherently handle missing data.

2. Identify and write down various data analytic challenges faced in the conventional
system.
Conventional data analytic systems often face challenges such as data heterogeneity
(dealing with diverse data types and sources), data volume and velocity (handling large and
rapidly changing datasets), data quality issues (inconsistencies, errors, missing data), limited
scalability, and lack of real-time processing capabilities.

3. Will treating categorical variables as continuous variables result in a better predictive


model? Justify your answer.

No, treating categorical variables as continuous variables will not result in a better predictive
model. This approach can lead to several problems:

Treating categorical variables as continuous can negatively impact a predictive


model's performance because it imposes an artificial order and numerical relationship where
none exists, leading to misleading interpretations and inaccurate predictions. For example,
assigning numerical values like 1, 2, 3 to categories like "Red," "Green," "Blue" implies that
"Green" is "more" than "Red," which is not true. Proper handling involves one-hot encoding
or other categorical encoding techniques.

4. Issue: Feeding data which has variables correlated to one another is not a good
statistical practice, since we are providing multiple weightage to the same type of data.

Solution: Correlation Analysis.


Show how such issues are prevented by correlation analysis technique. Justify with a
small instance dataset.

Correlation analysis helps identify and quantify the relationships between variables.
By identifying highly correlated variables, one can choose to remove redundant variables
(e.g., keeping only one from a highly correlated pair) or use dimensionality reduction
techniques like Principal Component Analysis (PCA) to create new, uncorrelated features,
preventing multiple weightage to the same type of data. For instance, in a dataset with
"Height in cm" and "Height in inches," removing one prevents redundancy.

5. State the purpose of adding additional quantitative and/or categorical explanatory


variables to any developed linear regression model. Justify with an example.

Adding additional quantitative and/or categorical explanatory variables to a linear


regression model aims to improve the model's predictive power and provide a more
comprehensive understanding of the relationship between variables. More variables can
explain more variance in the dependent variable, leading to a better fit and more accurate
predictions. For example, predicting house prices using only "area" might be less accurate
than including "number of bedrooms" and "location" as well.

6. Give an example of a data set with a non-Gaussian distribution.

A classic example of a dataset with a non-Gaussian (non-normal) distribution is


household income data.

 Distribution Shape: Income data is typically positively skewed (or right-skewed).


 Reason: A large majority of people have incomes clustered around the lower and
middle-income brackets, while a small number of individuals have extremely high
incomes. This creates a long tail on the right side of the distribution, which is not
characteristic of a symmetrical Gaussian distribution.

7. Under what circumstances, the pivot_table() in pandas is used?

The pivot_table() function in pandas is used to create a spreadsheet-style pivot table


as a DataFrame. It is particularly useful for summarizing and aggregating data, allowing you
to transform data from a "long" format to a "wide" format, performing calculations like sums,
averages, or counts based on multiple categorical variables.

8. Using appropriate data visualization modules develop a python code snippet that
generates a simple sinusoidal wave in an empty gridded axes?

import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 2 * np.pi, 100)
y = np.sin(x)
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_title("Simple Sinusoidal Wave")
ax.set_xlabel("X-axis")
ax.set_ylabel("Y-axis")
plt.show()

9. Write a python code snippet that generates a time-series graph representing COVID-
19 incidence cases for a particular week.

Da Da Da Da Da Da Da
y1 y2 y3 y4 y5 y6 y7

7 18 9 44 2 5 89

Python

import matplotlib.pyplot as plt


days = ["Day 1", "Day 2", "Day 3", "Day 4", "Day 5", "Day 6", "Day 7"]
cases = [7, 18, 9, 44, 2, 5, 89]
plt.plot(days, cases, marker='o')
plt.title("COVID-19 Incidence Cases for a Particular Week")
plt.xlabel("Day")
plt.ylabel("Number of Cases")
plt.grid(True)
plt.show()

10. Write a python code snippet that draws a histogram for the following list of positive
numbers. 7 18 9 44 25 89 91 11 6 77 85 91 6 55

Python:
import matplotlib.pyplot as plt
# List of positive numbers
data = [7, 18, 9, 44, 25, 89, 91, 11, 6, 77, 85, 91, 6, 55]
# Create a histogram
plt.figure(figsize=(10, 6))
plt.hist(data, bins=10, edgecolor='black', alpha=0.7)
# Set labels, title, and grid
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram of Positive Numbers')
plt.grid(axis='y', alpha=0.75)
# Display the plot
plt.show()

PART – B (5 × 13 = 65)

11. (a) (i) Suppose there is a dataset having variables with missing values of more than
30%, how will you deal with such dataset? (6)
When dealing with a dataset where variables have more than 30% missing values, it's
crucial to address this issue effectively to avoid biased or inaccurate results. Here's how you
can approach it:
Imputation Techniques:
Mean/Median/Mode Imputation:
Replace missing values with the mean, median, or mode of the respective variable.
This is a simple method but can reduce variance and distort relationships.
Regression Imputation:
Predict missing values using a regression model based on other variables in the
dataset. This can preserve relationships better than simple imputation.
K-Nearest Neighbors (KNN) Imputation:
Impute missing values by finding the K-nearest neighbors to the observation with
missing data and using their values to estimate the missing ones.
Multiple Imputation:
Generate multiple plausible imputed datasets, analyze each, and combine the results.
This accounts for the uncertainty introduced by imputation and provides more robust
estimates.
Deletion Methods:
Row-wise Deletion (Listwise Deletion):
Remove entire rows containing any missing values. This is simple but can lead to
significant data loss if many rows have missing data, especially with over 30% missing
values.
Column-wise Deletion:
Remove entire columns (variables) that have a high percentage of missing values.
This is often necessary when a variable is largely incomplete and provides little information.
Advanced Techniques:
Machine Learning-based Imputation:
Use more sophisticated machine learning algorithms like Random Forest or deep learning
models to predict and impute missing values.
Domain Knowledge:
Consult domain experts to understand the reasons for missing data and potentially fill
in gaps based on their expertise.

11.(a).(ii) List down the various feature selection methods for selecting the right
variables for building efficient predictive models. Explain about any two selection
methods. (7)

Feature selection is the process of choosing a subset of relevant features (variables) to


use in model construction. This helps in building more efficient and robust predictive models.
Various Feature Selection Methods:
Filter Methods:
Select features based on their statistical properties (e.g., correlation, chi-square,
information gain) independently of the chosen machine learning algorithm.
Wrapper Methods:
Use a specific machine learning algorithm to evaluate subsets of features and select
the best performing one (e.g., Recursive Feature Elimination, Forward Selection).
Embedded Methods:
Perform feature selection as part of the model training process (e.g., Lasso regularization,
Tree-based feature importance).
Explanation of Two Selection Methods:
Filter Methods (e.g., Correlation-based Feature Selection):
These methods assess the relevance of features based on their relationship with the
target variable or their inter-correlation, without involving a machine learning model.
Example (Correlation):
You can calculate the Pearson correlation coefficient between each independent
variable and the dependent variable. Features with a high absolute correlation value are
considered more relevant. You can also look at the correlation between independent variables
to identify and remove highly correlated features, reducing multicollinearity.
Advantages:
Computationally efficient, less prone to over fitting, and can be used as a pre-
processing step.
Wrapper Methods (e.g., Recursive Feature Elimination - RFE):
These methods use a specific machine learning algorithm to evaluate different subsets
of features and select the one that yields the best model performance.
Example (RFE):
RFE works by iteratively training the model and removing the least important features
until the desired number of features is reached or a performance criterion is met. For instance,
with a Support Vector Machine (SVM), RFE would repeatedly train the SVM, rank features
by their weights, and eliminate the lowest-ranked ones.
Advantages:
Can find optimal feature subsets for a given model, often leading to better predictive
performance.
Disadvantages:
Computationally intensive, as it involves training the model multiple times.

(Or)

11. (b) (i) Explain Data Analytic life cycle. Brief about Time-Series Analysis.

Data Analytic Life Cycle: The data analytic life cycle is a systematic process for
conducting data analysis projects. While different models exist, a common cycle includes the
following stages:

 Problem Definition: Clearly defining the business problem or research question to be


answered.
 Data Collection: Gathering all the necessary data from various sources, both internal
and external.

 Data Cleaning & Preprocessing: Handling missing values, outliers, and


inconsistencies to ensure data quality.

 Exploratory Data Analysis (EDA): Using statistical and visualization techniques to


understand the data's characteristics and relationships.

 Modeling: Building and training predictive or descriptive models using appropriate


algorithms.

 Validation & Interpretation: Evaluating the model's performance and interpreting


the results to derive insights.
 Deployment & Monitoring: Implementing the model in a production environment
and continuously monitoring its performance.

Time-Series Analysis: This is a specific type of data analysis that involves analyzing
data points collected over a period of time. The main goal is to understand the underlying
structure of the data and to forecast future values. Key components of a time series include:

 Trend: A long-term upward or downward movement in the data.


 Seasonality: A recurring pattern or cycle in the data (e.g., sales increasing every
holiday season).

 Cyclical Component: Fluctuations that are not of a fixed period (e.g., business
cycles).

 Irregular Component: Random, unpredictable variations.

11.(b).(ii) Outline the purpose of data cleansing. How missing and nullified data
attributes are handled and modified during preprocessing stage? (7)

Purpose of Data Cleansing: Data cleansing, also known as data preprocessing or data
scrubbing, is the process of detecting and correcting (or removing) corrupt or inaccurate
records from a dataset. Its primary purpose is to improve data quality so that analysis and
modeling can be performed effectively and reliably. Without proper data cleansing, a
"garbage in, garbage out" scenario occurs, where flawed data leads to flawed insights and
poor model performance.

Handling Missing and Nullified Data: During the preprocessing stage, missing and
nullified data attributes are handled using several methods:

 Imputation: As mentioned earlier, this involves filling in the missing values. The
choice of imputation method (mean, median, mode, regression, etc.) depends on the
nature of the data and the extent of the missing values.
 Deletion:

1. Row-wise Deletion: Removing an entire row (observation) if a significant


number of its values are missing. This is suitable when the percentage of
missing values is small.

2. Column-wise Deletion: Removing a variable (column) if it has a very high


percentage of missing values (e.g., more than 30%). This is often a necessary
step to avoid an unstable model.

 Flagging: Creating a new binary variable (e.g., is_age_missing) to indicate that the
original age value was missing. This can sometimes be useful if the fact that a value is
missing is itself a piece of information.

12. (a) (i) Indicate whether each of the following distributions is positively or negatively
skewed. The distribution of
(1) Incomes of tax payers have a mean of $48,000 and a median of $43,000. (3)

(2) GPAs for all students at some college have a mean of 3.01 and a median of 3.20. (3)

Answer:

Incomes of tax payers have a mean of $48,000 and a median of $43,000.

 A distribution is positively skewed when the mean is greater than the median. Since
$48,000 > $43,000, this distribution is

positively skewed.

GPAs for all students at some college have a mean of 3.01 and a median of 3.20.

 A distribution is negatively skewed when the mean is less than the median. Since 3.01
< 3.20, this distribution is

negatively skewed.

12.(a).(ii) During their first swim through a water maze, 15 laboratory rats made the
following number of errors (blind alleyway entrances): 2, 17, 5, 3, 28, 7, 5, 8, 5, 6, 2, 12,
10, 4, 3.

(1) Find the mode, median, and mean for these data. (3)
(2) 2) Without constructing a frequency distribution or graph, would it be possible
to characterize the shape of this distribution as balanced, positively skewed, or
negatively skewed? (4)

Answer:

1. Find the mode, median, and mean for these data.

 First, sort the data: 2, 2, 3, 3, 4, 5, 5, 5, 6, 7, 8, 10, 12, 17, 28.


 Mode: The mode is the value that appears most frequently. The value 5 appears three
times, more than any other value.

 Median: The median is the middle value of the sorted data. With 15 data points, the
8th value is the median, which is 5.

 Mean: The mean is the sum of all values divided by the number of values. The sum is
117, and there are 15 values, so the mean is 117/15 = 7.8.

2.Without constructing a frequency distribution or graph, would it be possible to


characterize the shape of this distribution as balanced, positively skewed, or negatively
skewed?

 Yes, it is possible. The shape can be characterized by comparing the mean and the
median.
 The mean (7.8) is greater than the median (5), which indicates a

positively skewed distribution. The presence of larger values like 17 and 28 pull the
mean to the right.

(Or)

12.(b) (i) Assume that SAT math scores approximate a normal curve with a mean of 500
and a standard deviation of 100. Sketch a normal curve and shade in the target area(s)
described by each of the following statements:

 More than 570


 Less than 515
 Between 520 and 540
 Convert to z scores and find the target areas specific to the above values.

Answer:

Sketch a normal curve and shade in the target area(s) described by each of the following
statements:

 More than 570: The shaded area should be to the right of the value 570 on the curve.
 Less than 515: The shaded area should be to the left of the value 515 on the curve.

 Between 520 and 540: The shaded area should be between the values 520 and 540 on
the curve.

Convert to z scores and find the target areas specific to the above values.

 The formula for the z-score is Z=(X−µ)/σ


 More than 570:

o Z=(570−500)/100=0.70

o Using a standard normal distribution table, the area to the left of Z=0.70 is
approximately 0.7580.

o The area to the right (more than 570) is 1−0.7580=0.2420.

 Less than 515:

o Z=(515−500)/100=0.15

o Using a standard normal distribution table, the area to the left of Z=0.15 is
approximately 0.5596.

o The area is 0.5596.

 Between 520 and 540:


o Z_1=(520−500)/100=0.20

o Z_2=(540−500)/100=0.40

o Area to the left of Z=0.40 is approximately 0.6554.

o Area to the left of Z=0.20 is approximately 0.5793.

o The area between them is 0.6554−0.5793=0.0761.

12.(b).(ii) Assume that the burning times of electric light bulbs approximate a normal
curve with a mean of 1200 hours and a standard deviation of 120 hours. If a large
number of new lights are installed at the same time (possibly along a newly opened
freeway), at what time will

 1 percent fails?
 50 percent fail?
 95 percent fail?

Answer:

 1 percent fails?
o We need to find the z-score corresponding to the bottom 1% (0.01 area to
the left).

o From a z-table, the z-score is approximately -2.33.

o X=µ+Zσ=1200+(−2.33)(120)=1200−279.6=920.4 hours.

 50 percent fail?

o The 50% mark corresponds to the mean in a normal distribution.

o The z-score is 0.

o X=µ=1200 hours.

 95 percent fail?

o We need to find the z-score corresponding to the bottom 95% (0.95 area to
the left).

o From a z-table, the z-score is approximately 1.645.

o X=µ+Zσ=1200+(1.645)(120)=1200+197.4=1397.4 hours.

13. (a) (i)In Statistics, highlight the impact when the goodness of fit test score is low?
A low goodness-of-fit score in a statistical test indicates that the observed data
significantly deviates from the expected data based on the model. This suggests that the
model is a poor fit for the data, and the results of any analysis based on that model may be
unreliable.

Understanding Goodness of Fit:

 Goodness-of-fit tests, like the chi-square test, assess how well observed data matches
an expected distribution.
 A high score (or a high p-value) indicates a close match, suggesting the model
accurately represents the data.
 A low score (or a low p-value) indicates a poor fit, implying the model is not
accurately capturing the observed data.
Impact of a Low Goodness-of-Fit Score:

 Inaccurate Predictions:
 If the model doesn't fit the data well, predictions made using it are likely to be
inaccurate and unreliable.
 Misleading Conclusions:
 Any conclusions drawn from the model's results may be flawed due to the poor fit,
potentially leading to incorrect interpretations of the data.
 Need for Model Refinement:
 A low goodness-of-fit score signals the need to revise or refine the model,
potentially by including additional variables, changing the model's structure, or
choosing a different model altogether.
 Invalid Hypothesis Tests:
 If the model is used in hypothesis testing, a low goodness-of-fit score may
invalidate the test results, as the assumptions underlying the test are not met.
 Potential for Bias:
If the model is used to make predictions or classifications, a poor fit can introduce
bias into the results.

13.(a).(ii) Given the following dataset of employee, Using regression analysis, find the
expected salary of an employee if the age is 45.

Age Salary

54 67000

42 43000
49 55000

57 71000

35 25000

(Or)
13. (b) (i) Define autocorrelation and how is it calculated? What does the negative
correlation convey?

 Autocorrelation is the correlation of a time series with a delayed version of itself. It


measures the similarity between observations of a series as a function of the time lag
between them.
 Calculation: Autocorrelation is calculated using a formula similar to the standard
correlation coefficient, but it compares a series's values at time t with its values at
time t−k, where k is the time lag.
 Negative Correlation: A negative autocorrelation suggests that a high value in the
series at one point in time is likely to be followed by a low value at a later point in
time, and vice versa. It indicates an oscillating or cyclical pattern in the data.

13. (b) (ii) What is the philosophy of Logistic regression? What kind of model it is?
What does logistic Regression predict? Tabulate the cardinal differences of Linear and
Logistic Regression.

 Philosophy: The philosophy of Logistic Regression is to model the probability of a


categorical outcome occurring. It uses a logistic (sigmoid) function to map any real
number to a value between 0 and 1, which represents a probability.
 Model Type: It is a classification model.
 Prediction: Logistic Regression predicts the probability that an observation belongs
to a particular class or category.
 Cardinal Differences between Linear and Logistic Regression:

Feature Linear Regression Logistic Regression


To predict a continuous dependent To predict a categorical dependent
Purpose
variable. variable.
A continuous value (e.g., salary, A probability value between 0 and
Output
temperature). 1.
Underlying
Linear function (y=b0+b1x). Logistic (sigmoid) function.
Function
Does not assume a linear
Assumes a linear relationship between
Relationship relationship; it models the log-
independent and dependent variables.
odds.
Model Type Regression model. Classification model.

14. (a) Define Dictionary in Python. Do the following operations on dictionaries.

(i) Initialize two dictionaries (D₁ and Da) with key and value pairs.

(ii) Compare those two dictionaries with master key list 'M' and print the missing keys.

(iii) Find keys that are in D, but NOT in Da.

(iv) Merge Dı and Da and create Ds using expressions.


A dictionary in Python is an unordered collection of data values, used to store data in
key:value pairs. Dictionaries are optimized for retrieving values when the key is known. They
are written with curly brackets {} and have keys that must be unique and immutable, while
values can be of any data type and can be duplicated.

(i) Initialization of two dictionaries:

Python

D1 = {'apple': 1, 'banana': 2, 'cherry': 3, 'date': 4}

D2 = {'banana': 5, 'grape': 6, 'kiwi': 7, 'date': 8}

(ii) Comparison with a master key list 'M' and printing missing keys:

Python

M = ['apple', 'banana', 'cherry', 'date', 'elderberry', 'fig']

missing_keys_D1 = [key for key in M if key not in D1]

missing_keys_D2 = [key for key in M if key not in D2]

print(f"Missing keys in D1 from master list M: {missing_keys_D1}")

print(f"Missing keys in D2 from master list M: {missing_keys_D2}")

(iii) Finding keys that are in D1 but NOT in D2:

Python

keys_in_D1_not_in_D2 = [key for key in D1 if key not in D2]

print(f"Keys in D1 but not in D2: {keys_in_D1_not_in_D2}")

(iv) Merging D1 and D2 to create a new dictionary D3 using expressions:

Python

D3 = {**D1, **D2}

print(f"Merged dictionary D3: {D3}")

(Or)

14.(b) (i) How to create hierarchical data from the existing data frame?

(ii) How to use group by with 2 columns in data set? Give a python code snippet.

Answer:

(i) create hierarchical data from the existing data frame:


To create hierarchical data (MultiIndex) from an existing DataFrame in Python using
the Pandas library, you can use the set_index method to set multiple columns as the index.

Steps:

1. Import Pandas:

Start by importing the pandas library.

2. Create a DataFrame:

Create a sample DataFrame with the data you want to make hierarchical.

3. Set Multiple Columns as Index:

Use df.set_index(['column1', 'column2']) to set the desired columns as the hierarchical


index.

import pandas as pd

# Sample DataFrame

data = {'City': ['New York', 'New York', 'London', 'London'],

'Year': [2020, 2021, 2020, 2021],

'Population': [8.4, 8.5, 8.9, 9.0]}

df = pd.DataFrame(data)

# Create hierarchical data by setting 'City' and 'Year' as the index

hierarchical_df = df.set_index(['City', 'Year'])

print(hierarchical_df)

(ii). use group by with 2 columns in data set

The groupby() method in pandas is used to group data based on one or more columns.
To group by two columns, you pass a list of the column names to the groupby() method.

For example, to find the average temperature for each city and month:

import pandas as pd

# Sample DataFrame

data = {'City': ['New York', 'New York', 'London', 'London', 'New York'],

'Month': ['Jan', 'Feb', 'Jan', 'Feb', 'Jan'],

'Temperature': [1, 2, 5, 6, 0]}


df = pd.DataFrame(data)

# Group by 'City' and 'Month' and calculate the mean temperature

grouped_data = df.groupby(['City', 'Month'])['Temperature'].mean()

print(grouped_data)

15. (a) Write a code snippet that projects our globe as a 2-D flat surface (using
cylindrical project) and convey information about the location of any three major
Indian cities in the map (using scatter plot).

A Python code snippet to project a 2D cylindrical map of India and plot the locations
of three major cities using a scatter plot is shown below. The code utilizes
the matplotlib and numpy libraries. It defines the coordinates of Mumbai, Delhi, and
Chennai, converts them to cylindrical projection, and then plots them on a 2D plane.

import matplotlib.pyplot as plt

import numpy as np

# Define city coordinates (latitude, longitude in degrees)

cities = {

"Mumbai": (19.0760, 72.8777),

"Delhi": (28.6139, 77.2090),

"Chennai": (13.0827, 80.2707),

# Convert latitude and longitude to radians

def to_radians(degrees):

return np.radians(degrees)

# Cylindrical projection function

def cylindrical_projection(latitude, longitude):

x = longitude

y = np.log(np.tan(np.pi / 4 + latitude / 2)) # Mercator projection formula

return x, y

# Convert city coordinates to projected coordinates


projected_cities = {}

for city, (lat, lon) in cities.items():

lat_rad = to_radians(lat)

lon_rad = to_radians(lon)

x, y = cylindrical_projection(lat_rad, lon_rad)

projected_cities[city] = (x, y)

# Plotting

plt.figure(figsize=(8, 6))

for city, (x, y) in projected_cities.items():

plt.scatter(x, y, label=city)

plt.xlabel("Longitude")

plt.ylabel("Latitude (projected)")

plt.title("Cylindrical Projection of Indian Cities")

plt.legend()

plt.grid(True)

plt.show()

(Or)

15.(b) (i) Write a working code that performs a simple Gaussian process regression
(GPR), using the Scikit-Learn API.

(ii) Briefly explain about visualization with Seaborn. Give an example working code
segment that represents a 2D kernel density plot for any data.

Answer:

(i) Gaussian Process Regression (GPR) with Scikit-Learn


Code demonstrates a simple Gaussian Process Regression using the scikit-
learn library. It generates synthetic data, fits a GaussianProcessRegressor model, and then
makes predictions.
Python
import numpy as np
from sklearn.gaussian_process
import GaussianProcessRegressor
from sklearn.gaussian_process.kernels
import RBF, ConstantKernel

# Generate synthetic data


X = np.linspace(0, 10, 100).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.normal(0, 0.5, X.shape[0])

# Define the kernel for the GPR


kernel = ConstantKernel(1.0, constant_value_bounds="fixed") * RBF(1.0,
length_scale_bounds="fixed")

# Initialize and fit the GPR model


gpr = GaussianProcessRegressor(kernel=kernel, alpha=0.1, random_state=0)
gpr.fit(X, y)

# Make predictions
X_test = np.linspace(0, 10, 200).reshape(-1, 1)
y_pred, sigma = gpr.predict(X_test, return_std=True)

# y_pred contains the mean predictions, sigma contains the standard deviations

(ii) Visualization with Seaborn and 2D Kernel Density Plot


Seaborn is a Python data visualization library built on top of Matplotlib. It provides a
high-level interface for drawing attractive and informative statistical graphics. Seaborn
simplifies the creation of complex visualizations by offering functions tailored for common
statistical plot types, often requiring less code than direct Matplotlib usage. It excels at
visualizing relationships between variables, distributions of data, and comparisons across
categories.
The following code segment demonstrates creating a 2D kernel density plot using
Seaborn for any given data.
Python
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Create sample data


np.random.seed(42)
data = pd.DataFrame({
'x': np.random.normal(0, 1, 500),
'y': np.random.normal(0, 1, 500) + np.random.normal(0, 0.5, 500)
})
# Create a 2D kernel density plot
plt.figure(figsize=(8, 6))
sns.kdeplot(data=data, x='x', y='y', fill=True, cmap='viridis', levels=5)
plt.title('2D Kernel Density Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

PART C-(1x 15 = 15 marks)

16. (a) Given a unsorted multi indexes that represents the distance between two cities,
write a python code anippet using appropriate libraries to find the shortest distance
between any two given cities. The following matrix representation can be used to create
the data frame that can be served as an input for the prescribed program.

Distance between any two given cities from an unsorted multi-indexed distance
matrix using Python, the pandas and networkx libraries are suitable. pandas can be used to
represent the distance matrix as a DataFrame, and networkx can be used to construct a graph
from this DataFrame and apply graph algorithms like shortest path.

import pandas as pd

import networkx as nx

# 1. Create the distance matrix as a DataFrame

# Assume the following matrix representation for distances between cities A, B, C, D,


E

matrix = [

[0, 30, 24, 6, 13],

[16, 0, 19, 5, 10],

[7, 16, 0, 15, 12],

[9, 17, 22, 0, 18],

[21, 8, 9, 11, 0]

cities = ['A', 'B', 'C', 'D', 'E']

distance_df = pd.DataFrame(matrix, index=cities, columns=cities)

# 2. Create a graph from the DataFrame

# networkx can create a graph directly from a pandas adjacency matrix


G = nx.from_pandas_adjacency(distance_df)

# 3. Specify the cities for which to find the shortest distance

city1 = 'A'

city2 = 'E'

# 4. Find the shortest path length (distance) between the two cities

# The shortest_path_length function uses Dijkstra's algorithm by default for weighted


graphs

shortest_distance = nx.shortest_path_length(G, source=city1, target=city2)

# 5. Print the result

print(f'The shortest distance between {city1} and {city2} is {shortest_distance}.')

16.(b) A URL Server wants to consolidate a history of websites visited by an user 'U'.
Every visited website information is stored in a 2-tuple format viz., (website_id,
Duration_of_visit) in the URL cache. Using split, apply and combine operations, devise
a code snippet that consolidate the website history and find out the website whose
duration of visit is maximum.

Example:

Input: [(4,2), (5,1), (4,3), (1,4), (7,3), (5,2), (1,1), (7,1)]

Output: [(4,5), (5,3), (1,5), (7,4)].

The website with key_id '1' has the max.duration of visit = 5.

Answer:

Split, Apply, and Combine Logic:

1. Split: The input data, a list of tuples, is split based on the website_id.
2. Apply: A function (in this case, summation) is applied to each group to calculate the
total Duration_of_visit for each website_id.

3. Combine: The results from the "apply" step are combined to create a new list or
dictionary representing the consolidated history.

from collections import defaultdict

# Example Input provided in the question

input_data = [(4,2), (5,1), (4,3), (1,4), (7,3), (5,2), (1,1), (7,1)]

# Step 1 & 2: Split and Apply (using defaultdict for grouping and summation)
consolidated_history = defaultdict(int)

for website_id, duration in input_data:

consolidated_history[website_id] += duration

# Step 3: Combine (converting the dictionary to a list of tuples for the specified output
format)

output_list = list(consolidated_history.items())

print("Output:", output_list)

# Find the website with the maximum duration of visit

max_duration = 0

max_website_id = None

for website_id, total_duration in consolidated_history.items():

if total_duration > max_duration:

max_duration = total_duration

max_website_id = website_id

print(f"The website with key_id '{max_website_id}' has the max.duration of visit =


{max_duration}.")
B.E/B.Tech. DEGREE EXAMINATIONS, NOV/DEC- 2023
CS 3352 – FOUNDATIONS OF DATA SCIENCE
Answer ALL questions.
PART A-(10 x 2 = 20 marks)
1. What is Structured data?
Structured data refers to data that is organized into a fixed format such as rows and
columns, making it easily searchable in relational databases.
Example: A table of student records with fields like Name, ID, Age, and Marks.
2. Give an overview of common errors.
Common errors in data science include:
 Syntax Errors: Mistakes in code syntax.
 Runtime Errors: Errors that occur during execution.
 Logical Errors: Incorrect output due to flawed logic.
 Data Errors: Missing, duplicate, or inconsistent data.
3. Explain the types of data.
 Quantitative Data: Numerical (e.g., height, weight).
o Discrete: Countable (e.g., number of students).
o Continuous: Measurable (e.g., temperature).
 Qualitative Data: Categorical (e.g., colors, gender).
o Nominal: No order (e.g., city names).
o Ordinal: With order (e.g., rankings).
4. Define median with example.
The median is the middle value in a sorted list. Example: For the data [3, 5, 8], the
median is 5. If the list is even: [3, 5, 7, 9] → Median = (5+7)/2 = 6
5. Define multiple regressions.
Multiple regression is a statistical technique used to predict the value of a dependent
variable using two or more independent variables.
Example: Predicting house prices using size, location, and number of rooms.
6. Define regression towards the mean.
Regression towards the mean refers to the phenomenon where extreme values tend to
move closer to the average on subsequent measurements.
7. What are the key properties of Pearson correlation coefficient?
 Ranges from -1 to +1
 +1: Perfect positive linear relationship
 -1: Perfect negative linear relationship
 0: No linear relationship
 Measures the strength and direction of a linear relationship between two variables.
8. Summarize some built-in Pandas aggregations?
Common built-in Pandas aggregation functions include:
 sum() – Total
 mean() – Average
 min() / max() – Minimum / Maximum
 count() – Number of entries
 std() – Standard deviation
 median() – Middle value
9. Explain partial sort.
Partial sort arranges a portion of the data in order rather than the entire dataset.
Example: Finding the top 3 minimum values in an unsorted list.
10. Give a summary about the comparison operators.
Comparison operators are used to compare values:
 == → Equal to
 != → Not equal to
 > → Greater than
 < → Less than
 >= → Greater than or equal to
 <= → Less than or equal to
They return Boolean values: True or False.
PART B (5 × 13 = 65 marks)
11.(a) Explain the different facets of data with example.
Structured Data
• Structured data is arranged in rows and column format. It helps for application to retrieve
and process data easily. Database management system is used for storing structured data.
• The term structured data refers to data that is identifiable because it is organized in a
structure. The most common form of structured data or records is a database where specific
information is stored based on a methodology of columns and rows.
• Structured data is also searchable by data type within content. Structured data is understood
by computers and is also efficiently organized for human readers.
• An Excel table is an example of structured data.
Unstructured Data
• Unstructured data is data that does not follow a specified format. Row and columns are not
used for unstructured data. Therefore it is difficult to retrieve required information.
Unstructured data has no identifiable structure.
• The unstructured data can be in the form of Text: (Documents, email messages, customer
feedbacks), audio, video, images. Email is an example of unstructured data.
• Even today in most of the organizations more than 80 % of the data are in unstructured
form. This carries lots of information. But extracting information from these various sources
is a very big challenge.
• Characteristics of unstructured data:
1. There is no structural restriction or binding for the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural rules.
Natural Language
• Natural language is a special type of unstructured data.
• Natural language processing enables machines to recognize characters, words and
sentences, then apply meaning and understanding to that information. This helps machines to
understand language as humans do.
• Natural language processing is the driving force behind machine intelligence in many
modern real-world applications. The natural language processing community has had success
in entity recognition, topic recognition, summarization, text completion and sentiment
analysis.
•For natural language processing to help machines understand human language, it must go
through speech recognition, natural language understanding and machine translation. It is an
iterative process comprised of several layers of text analysis.
Machine - Generated Data
• Machine-generated data is an information that is created without human interaction as a
result of a computer process or application activity. This means that data entered manually by
an end-user is not recognized to be machine-generated.
• Machine data contains a definitive record of all activity and behavior of our customers,
users, transactions, applications, servers, networks, factory machinery and so on.
• Examples of machine data are web server logs, call detail records, network event logs and
telemetry.
• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions generate
machine data. Machine data is generated continuously by every processor-based system, as
well as many consumer-oriented systems.
Graph-based or Network Data
•Graphs are data structures to describe relationships and interactions between entities in
complex systems. In general, a graph contains a collection of entities called nodes and
another collection of interactions between a pair of nodes called edges.
• Nodes represent entities, which can be of any object type that is relevant to our problem
domain. By connecting nodes with edges, we will end up with a graph (network) of nodes.
• Graph databases are used to store graph-based data and are queried with specialized query
languages such as SPARQL.
• Graph databases are capable of sophisticated fraud prevention. With graph databases, we
can use relationships to process financial and purchase transactions in near-real time. With
fast graph queries, we are able to detect that, for example, a potential purchaser is using the
same email address and credit card as included in a known fraud case.

• Graph theory has proved to be very effective on large-scale datasets such as social network
data. This is because it is capable of by-passing the building of an actual visual representation
of the data to run directly on data matrices.
Audio, Image and Video
• Audio, image and video are data types that pose specific challenges to a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be
challenging for computers.
•The terms audio and video commonly refers to the time-based media storage format for
sound/music and moving pictures information. Audio and video digital recording, also
referred as audio and video codecs, can be uncompressed, lossless compressed or lossy
compressed depending on the desired quality and use cases.
• It is important to remark that multimedia data is one of the most important sources of
information and knowledge; the integration, transformation and indexing of multimedia data
bring significant challenges in data management and analysis. Many challenges have to be
addressed including big data, multidisciplinary nature of Data Science and heterogeneity.
Streaming Data
Streaming data is data that is generated continuously by thousands of data sources,
which typically send in the data records simultaneously and in small sizes (order of
Kilobytes).
Streaming data includes a wide variety of data such as log files generated by
customers using your mobile or web applications, ecommerce purchases, in-game player
activity, information from social networks, financial trading floors or geospatial services and
telemetry from connected devices or instrumentation in data centers.
(Or)
11.(b) Explain in detail about the cleansing, integration, transforming data and build
amodel.
Data Cleaning
• Data is cleansed through processes such as filling in missing values, smoothing the noisy
data or resolving the inconsistencies in the data.
• Data cleaning tasks are as follows:
1. Data acquisition and metadata
2. Fill in missing values
3. Unified date format
4. Converting nominal to numeric
5. Identify outliers and smooth out noisy data
6. Correct inconsistent data
• Data cleaning is a first step in data pre-processing techniques which is used to find the
missing value, smooth noise data, recognize outliers and correct inconsistent.
• Missing value: These dirty data will affects on miming procedure and led to unreliable and
poor output. Therefore it is important for some data cleaning routines. For example, suppose
that the average salary of staff is Rs. 65000/-. Use this value to replace the missing value for
salary.
• Data entry errors: Data collection and data entry are error-prone processes. They often
require human intervention and because humans are only human, they make typos or lose
their concentration for a second and introduce an error into the chain. But data collected by
machines or computers isn't free from errors either. Errors can arise from human sloppiness,
whereas others are due to machine or hardware failure. Examples of errors originating from
machines are transmission errors or bugs in the extract, transform and load phase (ETL).
• Whitespace error: Whitespaces tend to be hard to detect but cause errors like other
redundant characters would. To remove the spaces present at start and end of the string, we
can use strip() function on the string in Python.
• Fixing capital letter mismatches: Capital letter mismatches are common problem. Most
programming languages make a distinction between "Chennai" and "chennai".
• Python provides string conversion like to convert a string to lowercase, uppercase using
lower(), upper().
• The lower() Function in python converts the input string to lowercase. The upper() Function
in python converts the input string to uppercase.
Outlier
• Outlier detection is the process of detecting and subsequently excluding outliers from a
given set of data. The easiest way to find outliers is to use a plot or a table with the minimum
and maximum values.
• An outlier may be defined as a piece of data or observation that deviates drastically from the
given norm or average of the data set. An outlier may be caused simply by chance, but it may
also indicate measurement error or that the given data set has a heavy-tailed distribution.
Combining Data from Different Data Sources
1. Joining table
• Joining tables allows user to combine the information of one observation found in one table
with the information that we find in another table. The focus is on enriching a single
observation.
• A primary key is a value that cannot be duplicated within a table. This means that one value
can only be seen once within the primary key column. That same key can exist as a foreign
key in another table which creates the relationship. A foreign key can have duplicate
instances within a table.

2. Appending tables
• Appending table is called stacking table. It effectively adding observations from one table to
another table.The result of appending these tables is a larger one with the observations from
Table 1 as well as Table 2. The equivalent operation in set theory would be the union and this
is also the command in SQL, the common language of relational databases. Other set
operators are also used in data science, such as set difference and intersection.
3. Using views to simulate data joins and appends
• Duplication of data is avoided by using view and append. The append table requires more
space for storage. If table size is in terabytes of data, then it becomes problematic to duplicate
the data. For this reason, the concept of a view was invented.

Transforming Data
• In data transformation, the data are transformed or consolidated into forms appropriate for
mining. Relationships between an input variable and an output variable aren't always linear.
• Reducing the number of variables: Having too many variables in the model makes the
model difficult to handle and certain techniques don't perform well when user overload them
with too many input variables.
• All the techniques based on a Euclidean distance perform well only up to 10 variables. Data
scientists use special methods to reduce the number of variables but retain the maximum
amount of data.
Euclidean distance :
• Euclidean distance is used to measure the similarity between observations. It is calculated as
the square root of the sum of differences between each point.
Euclidean distance = √(X1-X2)2 + (Y1-Y2)2
Turning variable into dummies :
• Variables can be turned into dummy variables. Dummy variables canonly take two values:
true (1) or false√ (0). They're used to indicate the absence of a categorical effect that may
explain the observation.

Build the Models


• To build the model, data should be clean and understand the content properly. The
components of model building are as follows:
a) Selection of model and variable
b) Execution of model
c) Model diagnostic and model comparison
• Building a model is an iterative process. Most models consist of the following main steps:
1. Selection of a modelling technique and variables to enter in the model
2. Execution of the model
3. Diagnosis and model comparison
Model Execution
Various programming language is used for implementing the model. For model
execution, Python provides libraries like Stats Models or Scikit-learn. These packages use
several of the most popular techniques. Coding a model is a nontrivial task in most cases, so
having these libraries available can speed up the process. Following are the remarks on
output:
a) Model fit: R-squared or adjusted R-squared is used.
b) Predictor variables have a coefficient: For a linear model this is easy to interpret.
c) Predictor significance: Coefficients are great, but sometimes not enough evidence exists
to show that the influence is there.
Following commercial tools are used :
 SAS enterprise miner: This tool allows users to run predictive and descriptive
models based on large volumes of data from across the enterprise.
 SPSS modeler: It offers methods to explore and analyse data through a GUI.
 Matlab: Provides a high-level language for performing a variety of data analytics,
algorithms and data exploration.
 Alpine miner: This tool provides a GUI front end for users to develop analytic
workflows and interact with Big Data tools and platforms on the back end.
Model Diagnostics and Model Comparison
Try to build multiple model and then select best one based on multiple criteria. Working with
a holdout sample helps user pick the best-performing model.
• In Holdout Method, the data is split into two different datasets labeled as a training and a
testing dataset. This can be a 60/40 or 70/30 or 80/20 split. This technique is called the hold-
out validation technique.
Suppose we have a database with house prices as the dependent variable and two
independent variables showing the square footage of the house and the number of rooms.
Now, imagine this dataset has 30 rows. The whole idea is that you build a model that can
predict house prices accurately.
To 'train' our model or see how well it performs, we randomly subset 20 of those rows
and fit the model. The second step is to predict the values of those 10 rows that we excluded
and measure how well our predictions were.
As a rule of thumb, experts suggest to randomly sample 80% of the data into the
training set and 20% into the test set.

12. (a) (i) Explain Normal Curve and Z-Score.


 Normal Curve: Symmetrical bell-shaped curve where mean = median = mode.
 Z-score: Indicates how many standard deviations a value is from the mean.

12.(a).(ii) Using standard normal curve table, find the proportion of the total area
identified with the following statements.
1. Above a Z score of 1.80
2. Between the mean and a z score of 1.65
3. Between z scores of 0 and -1.96
Answer:
(Or)
12.(b) (i) Describe the types of variables.
(ii). Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with
the following results:

Age 23 27 39 49 50 52 54 56 57 58 60

% fat 9.5 17.8 31.4 27.2 31.2 34.6 42.5 33.4 30.2 34.1 41

Draw the boxplots for age .


Types of variables.
Variables in research and statistics can be classified into several types based on their
nature and role. Key classifications include quantitative vs. qualitative, discrete vs.
continuous, and independent vs. dependent. Additionally, there are control, confounding, and
intervening variables that play specific roles in research.
Quantitative vs. Qualitative Variables:
 Quantitative variables
represent numerical measurements, indicating magnitude or amount. Examples include age,
height, weight, or temperature.
 Discrete variables: are quantitative variables that can only take specific,
separate values, usually whole numbers. Examples include the number of
students in a class or the number of cars passing a point on a road.
 Continuous variables: are quantitative variables that can take any value
within a given range. Examples include height, weight, or temperature.
 Qualitative variables
represent characteristics or categories that cannot be measured numerically. Examples
include color, gender, or type of car.
 Nominal variables: are qualitative variables that have no inherent order or
ranking. Examples include colors (red, blue, green) or types of fruit (apple,
banana, orange).
 Ordinal variables: are qualitative variables with a meaningful order or
ranking. Examples include levels of satisfaction (unsatisfied, neutral, satisfied)
or education level (high school, bachelor's, master's).
 Binary variables: are a special case of qualitative variables with only two
categories (e.g., yes/no, true/false).
Independent vs. Dependent Variables:
 Independent variables
are manipulated or changed in an experiment to observe their effect on another variable. They
are considered the "cause" in a cause-and-effect relationship.
 Dependent variables
are the variables being measured or observed in response to changes in the independent
variable. They are considered the "effect" in a cause-and-effect relationship.
Control Variables:
 Control variables: are kept constant or controlled during an experiment to minimize
their influence on the dependent variable. This helps isolate the effect of the
independent variable.
Confounding Variables:
 Confounding variables: are variables that are related to both the independent and
dependent variables, potentially obscuring the true relationship between them.
Intervening Variables:
 Intervening variables: are theoretical variables that are used to explain the
relationship between an independent and dependent variable.
Understanding these different types of variables is crucial for designing and interpreting
research studies effectively.
12.(b).(ii). Suppose that a hospital tested the age and body fat data for 18 randomly selected adults
with the following results:

Age 23 27 39 49 50 52 54 56 57 58 60

% fat 9.5 17.8 31.4 27.2 31.2 34.6 42.5 33.4 30.2 34.1 41
Draw the boxplots for age .

Boxplot for the given Age data.


 Median (central orange line): around 52
 Interquartile range (IQR): spread between Q1 (≈39) and Q3 (≈57)
 Minimum and maximum whiskers: 27 and 60
 Outlier: Age 23 (plotted as a separate circle)
This visualization helps understand the data distribution, central tendency, and any outliers
present. Let me know if you want the same for % body fat too.
13. (a) (i) Explain Scatter Plot.
A scatter plot (or scatter diagram) is a type of graph used to display the relationship
between two continuous variables. It consists of points plotted on a two-dimensional
graph, where:
 The x-axis represents one variable
 The y-axis represents the second variable
Key Features:
 Each point represents one observation in the dataset.
 Helps visualize correlation or patterns between variables.
 Ideal for identifying:
o Positive correlation (as one variable increases, so does the other)
o Negative correlation (as one increases, the other decreases)
o No correlation (random scatter of points)
13.(a).(ii) Describe Range and Variance
Range
 Definition:
The range is the difference between the maximum and minimum values in a dataset.
It gives a basic measure of how spread out the data is.

Formula:

 Range=Maximum Value−Minimum Value

Example:
For the ages: 23, 27, 39, 49, 50, 52, 54, 56, 57, 58, 60

Range=60−23=37

Interpretation:
A larger range indicates greater variability in the data.

Variance

Variance measures the average squared deviation of each data point from the mean.
It gives a more accurate picture of data spread than range.

Formula (Population):

Where:

 xi = individual data point


 xˉ = mean of data
 n = number of data points

Example (Sample Variance):


For the sample data: 10, 12, 14

 Mean xˉ=(10+12+14)/3=12
 Variance s2=(10−12)2+(12−12)2+(14−12)2]/(3−1)=(4+0+4)/2=4

(Or)

13.(b) (i) Explain the Correlation Coefficient.


The correlation coefficient, r, is a summary measure that describes the extent of the statistical
relationship between two interval or ratio level variables.

Properties of r

 The correlation coefficient is scaled so that it is always between -1 and +1.

 When r is close to 0 this means that there is little relationship between the variables and the
farther away from 0 r is, in either the positive or negative direction, the greater the
relationship between the two variables.

 The sign of r indicates the type of linear relationship, whether positive or negative.

 The numerical value of r, without regard to sign, indicates the strength of the linear
relationship.

 A number with a plus sign (or no sign) indicates a positive relationship, and a number with
a minus sign indicates a negative relationship

13.(b).(ii) Explain Least Squares Equation with Example.

The least squares equation, also known as the least squares regression, is a method for finding
the best-fitting line (or curve) to a set of data points by minimizing the sum of the squares of
the differences between the observed values and the values predicted by the line. In simpler
terms, it finds the line that's closest to all the data points, minimizing the overall "error".

Explanation:

1. The Goal:

The primary goal is to find a line (or curve) that best represents the relationship between two
or more variables in a dataset. This line is often represented by an equation like y = a + bx,
where 'a' is the y-intercept and 'b' is the slope.

2. Minimizing Squared Errors:


The least squares method achieves this by minimizing the sum of the squares of the vertical
distances (residuals) between each data point and the line. These vertical distances represent
the errors in prediction.

3. Why Squares?

Squaring the errors ensures that all differences are positive, preventing negative errors from
cancelling out positive errors and leading to a more accurate representation of the overall
error magnitude.

4. Finding the Best Fit:

The method calculates the values for 'a' and 'b' that result in the smallest possible sum of
squared errors, thus determining the line of best fit.

14. (a) Explain Grouping in Python with Example.

Grouping in Python with Example

In Python, the concept of "grouping" is most commonly and effectively implemented


using the groupby() method from the pandas library. The groupby() operation is a
powerful and essential tool for data analysis, allowing you to split a dataset into groups based
on some criteria, apply a function to each group, and then combine the results.

This process is often described as the "Split-Apply-Combine" strategy:

1. Split: The data is divided into groups based on a specific key or set of keys (e.g., a
column or multiple columns in a DataFrame). Each unique value in the key column
becomes a separate group.
2. Apply: A function is applied to each individual group. This function can be an
aggregation (like sum(), mean(), or count()), a transformation (which returns a
result with the same size as the group), or a filtration (which discards certain groups).

3. Combine: The results of the applied function from each group are combined into a
new data structure (usually a DataFrame or Series).

Example using pandas

Let's illustrate grouping with a practical example using a pandas DataFrame. First, you need
to have the pandas library installed. You can install it using pip:

pip install pandas

python

import pandas as pd

# Create a sample DataFrame

data = {'Product': ['A', 'B', 'A', 'C', 'B', 'A'],


'Region': ['East', 'West', 'East', 'North', 'West', 'South'],

'Sales': [100, 150, 120, 80, 200, 90]}

df = pd.DataFrame(data)

# Group by 'Product' and calculate the sum of 'Sales' for each product

grouped_sales = df.groupby('Product')['Sales'].sum()

print(grouped_sales)

Explanation:
 df.groupby('Product'): This splits the DataFrame df into groups based on the unique
values in the 'Product' column ('A', 'B', 'C').
 ['Sales']: This selects the 'Sales' column within each of these groups for the
subsequent operation.
 .sum(): This applies the sum aggregation function to the 'Sales' column of each
product group.
The output grouped_sales will be a Series showing the total sales for each product:

Product

A 310

B 350

C 80

Name: Sales, dtype: int64

Common Aggregation Functions with groupby():

Function Description
.sum() Sum of values
.mean() Average
.count() Count of entries
.max() Maximum value
.min() Minimum value
(Or)
14. (b) Explain the following in Python
(i) Data Indexing
(ii) Operation on missing data
Answer:
(i) Data Indexing in Python

Data indexing is a fundamental concept in Python, especially when working with data
structures like lists, tuples, dictionaries, and more complex structures from libraries like
pandas. It refers to the process of selecting and accessing specific data elements or subsets of
data.

Lists and Tuples: These are sequence types, and you access elements by their integer
position (index), starting from 0.

Python

my_list = ['apple', 'banana', 'cherry']


print(my_list[0]) # Output: 'apple'
print(my_list[1:3]) # Slicing: Output: ['banana', 'cherry']
print(my_list[-1]) # Negative indexing: Output: 'cherry' (last element)

Dictionaries: Dictionaries are key-value pairs, and you access values by their
associated key.

Python

my_dict = {'name': 'Alice', 'age': 30, 'city': 'New York'}


print(my_dict['name']) # Output: 'Alice'
(ii)Data Indexing in pandas

In the context of data analysis, indexing becomes more sophisticated with the pandas
library. pandas DataFrames and Series have powerful indexing capabilities, allowing for
selection based on position, labels, or boolean conditions.

Label-based indexing (.loc): Used to select data by row and column labels.

Python

import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['x', 'y', 'z'])
print(df.loc['y', 'B']) # Output: 5
print(df.loc[['x', 'z'], 'A']) # Output: Series with values for 'x' and 'z' in column 'A'

Position-based indexing (.iloc): Used to select data by integer position, similar to lists.

Python

print(df.iloc[1, 1]) # Output: 5 (Row at index 1, Column at index 1)


print(df.iloc[0:2, :]) # Output: first two rows, all columns

Boolean indexing: Used to select data based on a condition.

Python
df = pd.DataFrame({'A': [10, 20, 30], 'B': [1, 2, 3]})
print(df[df['A'] > 15]) # Output: Rows where column 'A' is greater than 15

Operation on Missing Data

Missing data, often represented as NaN (Not a Number) in pandas, is a common issue in real-
world datasets. Handling it properly is crucial for accurate analysis. pandas provides a
comprehensive set of functions to detect, remove, and fill missing values.

1.Detecting Missing Data

The first step is to identify where the missing values are.

 isnull(): Returns a boolean DataFrame or Series with True where the value is
missing.
 isna(): An alias for isnull().

 notnull(): Returns a boolean DataFrame or Series with True where the value is not
missing.

Python
import numpy as np
data = {'A': [1, 2, np.nan], 'B': [4, np.nan, 6]}
df_missing = pd.DataFrame(data)

print(df_missing.isnull())
Output:
A B
0 False False
1 False True
2 True False

2. Dropping Missing Data


If the amount of missing data is small and won't significantly impact the analysis,
you can simply remove the rows or columns with missing values.

 dropna(): Drops rows or columns containing missing values.


o df.dropna(): Drops any row with at least one NaN.

o df.dropna(how='all'): Drops a row only if all its values are NaN.

o df.dropna(axis=1): Drops columns with NaN values.

Python
# Drops the row with NaN (row 1 and 2)
df_dropped = df_missing.dropna()
print(df_dropped)
Output:
A B
0 1.0 4.0
3. Filling Missing Data

Often, you don't want to discard data. Instead, you can fill the missing values with a
substitute.

 fillna(): Fills NaN values with a specified value.


o df.fillna(0): Fills all NaN with 0.

o df['A'].fillna(df['A'].mean()): Fills NaN in column 'A' with the mean


of that column.

o df.fillna(method='ffill'): Forward-fill, fills NaN with the previous non-


missing value.

o df.fillna(method='bfill'): Backward-fill, fills NaN with the next non-


missing value.

Python
# Fills NaN values with the mean of each column
df_filled = df_missing.fillna(df_missing.mean())
print(df_filled)
Output:
A B
0 1.0 4.0
1 2.0 5.0
2 1.5 6.0

By using these operations, you can effectively manage and prepare your data for
analysis, ensuring that missing values do not lead to errors or biased results.

15. (a) Explain the different types of joins in Python.


In Python, particularly within the Pandas library for data manipulation, "joins" refer to
methods of combining two or more DataFrames based on common columns or indices. These
operations are analogous to SQL joins. The primary types of joins are:
Inner Join:
Returns only the rows where there is a match in the specified key column(s) in both
DataFrames. Non-matching rows from either DataFrame are excluded from the result.
Left Join (Left Outer Join):
Returns all rows from the "left" DataFrame and the matching rows from the "right"
DataFrame. If a row in the left DataFrame has no match in the right DataFrame, NaN (or
equivalent null values) will be filled for the columns from the right DataFrame.
Right Join (Right Outer Join):
Returns all rows from the "right" DataFrame and the matching rows from the "left"
DataFrame. If a row in the right DataFrame has no match in the left DataFrame, NaN will
be filled for the columns from the left DataFrame.
Full Outer Join:
Returns all rows when there is a match in either the left or the right DataFrame. This
includes all matching rows, plus all non-matching rows from both DataFrames,
with NaN for non-matching columns.
Cross Join:
Returns the Cartesian product of the two DataFrames. Every row from the left
DataFrame is combined with every row from the right DataFrame, resulting in a DataFrame
with len(left_df) * len(right_df) rows.
Implementation in Pandas:
These joins are typically performed using the pd.merge() function in Pandas, where
the how parameter specifies the type of join:
Python
import pandas as pd

# Example DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value1': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value2': [5, 6, 7, 8]})

# Inner Join
inner_join_df = pd.merge(df1, df2, on='key', how='inner')

# Left Join
left_join_df = pd.merge(df1, df2, on='key', how='left')

# Right Join
right_join_df = pd.merge(df1, df2, on='key', how='right')

# Full Outer Join


outer_join_df = pd.merge(df1, df2, on='key', how='outer')

# Cross Join
cross_join_df = pd.merge(df1, df2, how='cross')
(Or)
15.(b).Explain the various features of Matplotlib platform used for data visualization
and illustrate its challenges.

Matplotlib is a foundational and widely used library in Python for creating static,
animated, and interactive visualizations. Its power lies in its extensive features and its role as
the building block for many other high-level plotting libraries.

Various Features of Matplotlib


 Versatility and Customization: Matplotlib's most significant feature is the granular
control it offers over every aspect of a plot. From figures and subplots to lines, fonts,
colors, and markers, you can precisely tailor your visualizations to meet specific
requirements. This is invaluable for creating publication-quality graphics.
 Wide Range of Plot Types: Matplotlib can generate a vast array of plots, from basic
to highly complex. This includes:

o 2D Plots: Line plots, scatter plots, bar charts, histograms, pie charts, box plots.

o 3D Plots: Surface plots, wireframe plots, and 3D scatter plots using the
mpl_toolkits.mplot3d toolkit.

o Advanced Visualizations: Heatmaps, contour plots, stream plots, and more.

 Two Interfaces: Matplotlib offers two main interfaces for plotting:

o Pyplot API: A state-based, MATLAB-style interface that is quick and easy


for simple plots. It automatically handles the creation of figures and axes. For
example, plt.plot(x, y) is a common Pyplot function.

o Object-Oriented API: This is a more powerful and flexible approach. You


explicitly create Figure and Axes objects and then call methods on those
objects. This is the recommended approach for creating complex plots or
working with multiple subplots.

 Integration with the Python Ecosystem: Matplotlib seamlessly integrates with other
key libraries in the scientific Python stack, such as NumPy and pandas. This allows
for efficient data manipulation and direct visualization within the same environment.
Many pandas plotting functions, for instance, use Matplotlib under the hood.

 Subplots and Layout Management: You can create complex layouts with multiple
plots in a single figure using plt.subplots(). This is essential for comparing
different datasets or different aspects of the same data side-by-side. You have fine-
grained control over the placement, size, and spacing of these subplots.

 Export to Various Formats: Matplotlib can save figures in a wide range of formats,
including raster formats like PNG and JPG, and vector formats like PDF, SVG, and
EPS. This is crucial for reports, presentations, and publications, as vector formats are
scalable without loss of quality.

 Interactive Features: In interactive environments like Jupyter notebooks, Matplotlib


plots can be interactive, allowing users to zoom, pan, and save figures. This is useful
for exploring data dynamically.

Challenges of Matplotlib

Despite its powerful features, Matplotlib has some challenges that can be a hurdle for new
users and experienced developers alike.
 Steep Learning Curve: Matplotlib's vast functionality and two different APIs can be
confusing for beginners. The distinction between Figure and Axes objects and when
to use the Pyplot vs. the object-oriented approach can be a source of frustration.
Simple tasks are easy, but complex customizations require a deep understanding of
the API.
 Verbosity: To achieve a polished, publication-quality plot, a significant amount of
code is often required. Customizing every element—such as labels, titles, tick marks,
and colors—can make the code long and sometimes difficult to read, especially when
compared to higher-level libraries like Seaborn or Plotly.

 Outdated Default Aesthetics: By default, Matplotlib's plots can sometimes look


dated or less aesthetically pleasing than those generated by more modern libraries.
While you can customize every aspect, achieving a modern look often requires extra
effort and a good understanding of styling. The introduction of stylesheets has helped
with this, but it's still a common criticism.

 Complexity of Advanced Plots: Creating advanced plots, such as complex multi-


panel figures, shared axes subplots, or custom layouts, can be cumbersome. Users
must meticulously manage figure and axes objects, handle spacing, and coordinate
transformations, which can be a difficult and time-consuming process.

 Not Ideal for Interactive Web Visualizations: While Matplotlib has some
interactive features, it is primarily a library for static plots. For highly interactive,
web-based dashboards and visualizations, libraries like Plotly or Bokeh are often a
better choice, as they are designed from the ground up for this purpose.

16. (a) Describe in detail about pivot table.

A pivot table is a data summarization tool used in data analysis to quickly and
efficiently reorganize, aggregate, and present data. Its primary purpose is to help users gain
insights from large datasets by summarizing them in a new, more understandable format.

In Python, the most common and powerful way to create a pivot table is by using the
pivot_table() function from the pandas library. This function is a highly versatile tool that
goes beyond the basic functionality of a spreadsheet pivot table, offering extensive control
over aggregation and data structure.

A pivot table typically involves four key components that you use to structure your
summary:

 Index (Rows): The columns that you want to use as the rows of your new summary
table. These columns define the "groups" you'll be analyzing. The unique values from
these columns will become the row labels of the pivot table.
 Columns: The columns that you want to use as the columns of your new summary
table. These are also used for grouping, but the groups are arranged horizontally.

 Values: The numerical data (or other data types that can be aggregated) that you want
to summarize. These are the values that will be aggregated in the cells of the pivot
table.
 Aggregation Function (aggfunc): The function that is applied to the values for each
group. Common aggregation functions include sum, mean, count, min, max, median,
etc. You can also pass a list of functions to apply multiple aggregations at once.

The creation of a pivot table follows a three-step process, similar to a groupby operation
but with a more structured output:

 Splitting: The original data is split into groups based on the unique values in the
index and columns you specify.
 Aggregating: A function (aggfunc) is applied to the values of each group. For
example, if you are analyzing sales data by region and product, the aggregation
function would calculate the total sales for each region-product combination.

 Reshaping/Combining: The aggregated results are then combined into a new, two-
dimensional table where the unique values of the index columns form the rows and
the unique values of the columns columns form the columns.

Example with Python and pandas

Let's use a sample dataset to illustrate the power of pd.pivot_table().

Sample Data: Imagine you have sales data for a company with multiple salespersons,
regions, and product categories.

Python
import pandas as pd
import numpy as np

# Create a sample DataFrame


data = {
'Region': ['North', 'South', 'North', 'South', 'East', 'East', 'North'],
'Salesperson': ['Alice', 'Bob', 'Alice', 'Bob', 'Charlie', 'Charlie', 'Alice'],
'Category': ['Electronics', 'Furniture', 'Furniture', 'Electronics', 'Electronics',
'Furniture', 'Electronics'],
'Sales': [1000, 1500, 800, 1200, 2000, 900, 1100]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)
print("-" * 50)

Now, let's create a pivot table to answer a question like: "What are the total sales for
each product category, broken down by region?"

Python
# Create a pivot table to get total sales by region and category
pivot_sales = df.pivot_table(
index='Region', # Use 'Region' as the rows
columns='Category', # Use 'Category' as the columns
values='Sales', # Aggregate the 'Sales' data
aggfunc='sum' # The aggregation function is 'sum'
)
print("Pivot Table: Total Sales by Region and Category")
print(pivot_sales)

Output of the pivot table:

Category Electronics Furniture


Region
East 2000.0 900.0
North 2100.0 800.0
South 1200.0 1500.0

Explanation of the Output:

 Rows (index): The rows are the unique regions (East, North, South).
 Columns (columns): The columns are the unique categories (Electronics,
Furniture).

 Values (values): The cells contain the sum of sales for each combination. For
example, the value at Region='North' and Category='Electronics' is 2100,
which is the sum of Alice's two sales in that category and region (1000 + 1100).

 NaN Values: If a combination of index and columns has no corresponding data in the
original DataFrame (e.g., no sales of furniture in the 'East' region), a NaN value would
appear. This can be handled with the fill_value parameter.

Advanced Features of pivot_table()

Multiple Aggregations: You can apply multiple functions to the values.

Python

pivot_multiple_agg = df.pivot_table(
index='Region',
columns='Category',
values='Sales',
aggfunc=['sum', 'mean']
)
print("\nPivot Table with Multiple Aggregations:")
print(pivot_multiple_agg)

Multiple Indices/Columns: You can use multiple columns for rows and/or columns to create
a hierarchical index.

Python

pivot_multi_index = df.pivot_table(
index=['Region', 'Salesperson'],
values='Sales',
aggfunc='sum'
)
print("\nPivot Table with Multiple Indices:")
print(pivot_multi_index)

fill_value: You can replace NaN values with a specified value (e.g., 0) to make the table
cleaner.

Python

pivot_filled = df.pivot_table(
index='Region',
columns='Category',
values='Sales',
aggfunc='sum',
fill_value=0
)
print("\nPivot Table with fill_value=0:")
print(pivot_filled)

pivot table is an indispensable tool for summarizing and analyzing data, and pandas'
pivot_table() function provides a flexible and powerful way to create these summaries in
Python. It's a go-to method for transforming long-format data into a wide-format, a process
that is essential for many data analysis tasks.

(Or)
16. (b) Find the following for the given data set:
Mean ,Median, Mode, Variance, Standard deviation and skewness

Marks (Class Interval) Frequency


0–10 10
10–20 40
20–30 20
30–40 28
40–50 50
50–60 40
60–70 16
70–80 14

You might also like