CS3352 FDS QP Solved (Anna University)
CS3352 FDS QP Solved (Anna University)
TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
PREPARED BY
KALAIVANI.V
AP/CSE
B.E/B.Tech. DEGREE EXAMINATIONS, NOV/DEC-2022
2 List an overview of common errors in retrieving data and which cleansing solutions to
be employed.
3. Classify the below list of data into their types: (a) ethnic group (b) age (c) family size
(d) academic major (e) sexual preference (f) IQ score (g) net worth (dollars) (h) third-
place finish (i) gender (j) temperature and write a brief note on them.
6. Consider Helen sent 10 greeting cards to her friends and she received back 8 cards,
what is the kind of relationship it is? Brief on it.
8. Create a data frame with key and data pairs as Key-Data pair as A-10, B-20, A-40, C-
5, B-10, C-10. Find the sum of each key and display the result as each key group.
import pandas as pd
data = {'Key': ['A', 'B', 'A', 'C', 'B', 'C'], 'Data': [10, 20, 40, 5, 10, 10]}
df = pd.DataFrame(data)
# Group by 'Key' and sum the 'Data'
result = df.groupby('Key')['Data'].sum()
print("Sum of each key group:")
print(result)
11. (a) Examine the different facets of data with the challenges in their processing.
Structured Data
Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing structured
data. The term structured data refers to data that is identifiable because it is organized in a
structure. The most common form of structured data or records is a database where specific
information is stored based on a methodology of columns and rows.
An Excel table is an example of structured data.
Unstructured Data
Unstructured data is data that does not follow a specified format. Row and columns
are not used for unstructured data. Therefore it is difficult to retrieve required information.
Unstructured data has no identifiable structure. The unstructured data can be in the form of
Text: (Documents, email messages, customer feedbacks), audio, video, images. Email is an
example of unstructured data.
Even today in most of the organizations more than 80 % of the data are in
unstructured form. This carries lots of information. But extracting information from these
various sources is a very big challenge.
Natural Language
This step involves acquiring data from all the identified internal and external sources, which
helps to answer the business question.
It collection of data which required for project. This is the process of gaining a
business understanding of the data user have and deciphering what each piece of data means.
This could entail determining exactly what data is required and the best methods for
obtaining it. This also entails determining what each of the data points means in terms of the
company. If we have given a data set from a client, for example, we shall need to know what
each column and row represents.
Data can have many inconsistencies like missing values, blank columns, an incorrect
data format, which needs to be cleaned. We need to process, explore and condition data
before modeling. The cleandata, gives the better predictions.
In this step, the actual model building process starts. Here, Data scientist distributes
datasets for training and testing. Techniques like association, classification and clustering are
applied to the training data set. The model, once prepared, is tested against the "testing"
dataset.
Deliver the final base lined model with reports, code and technical documents in this
stage. Model is deployed into a real-time production environment after thorough testing. In
this stage, the key findings are communicated to all stakeholders. This helps to decide if the
project results are a success or a failure based on the inputs from the model.
• To understand the project, three concept must understand: what, why and how.
a) What is expectation of company or organization?
Understanding the domain area of the problem is essential. In many cases, data
scientists will have deep computational and quantitative knowledge that can be broadly
applied across many disciplines. Data scientists have deep knowledge of the methods,
techniques and ways for applying heuristics to a variety of business and conceptual problems.
2. Resources :
As part of the discovery phase, the team needs to assess the resources available to
support the project. In this context, resources include technology, tools, systems, data and
people.
Framing is the process of stating the analytics problem to be solved. At this point, it is
a best practice to write down the problem statement and share it with the key stakeholders.
Each team member may hear slightly different things related to the needs and the problem
and have somewhat different ideas of possible solutions.
The team can identify the success criteria, key risks and stakeholders, which should
include anyone who will benefit from the project or will be significantly impacted by the
project. When interviewing stakeholders, learn about the domain area and any relevant
history from similar analytics projects.
The team should plan to collaborate with the stakeholders to clarify and frame the
analytics problem. At the outset, project sponsors may have a predetermined solution that
may not necessarily realize the desired outcome. In these cases, the team must use its
knowledge and expertise to identify the true underlying problem and appropriate solution.
This person understands the problem and usually has an idea of a potential working solution.
This step involves forming ideas that the team can test with data. Generally, it is best
to come up with a few primary hypotheses to test and then be creative about developing
several more. These Initial Hypotheses form the basis of the analytical tests the team will use
in later phases and serve as the foundation for the findings in phase.
Retrieving Data
Retrieving required data is second phase of data science project. Sometimes Data
scientists need to go into the field and design a data collection process. Many companies will
have already collected and stored the data and what they don't have can often be bought from
third parties.
Most of the high quality data is freely available for public and commercial use. Data
can be stored in various format. It is in text file format and tables in database. Data may be
internal or external.
1. Start working on internal data, i.e. data stored within the company
First step of data scientists is to verify the internal data. Assess the relevance and
quality of the data that's readily in company. Most companies have a program for maintaining
key data, so much of the cleaning work may already be done. This data can be stored in
official data repositories such as databases, data marts, data warehouses and data lakes
maintained by a team of IT professionals.
Data repository is also known as a data library or data archive. This is a general term
to refer to a data set isolated to be mined for data reporting and analysis. The data repository
is a large database infrastructure, several databases that collect, manage and store data sets for
data analysis, sharing and reporting.
• Data repository can be used to describe several ways to collect and store data:
Data warehouse is a large data repository that aggregates data usually from multiple
sources or segments of a business, without the data being necessarily related. Data lake is a
large data repository that stores unstructured data that is classified and tagged with metadata.
Data marts are subsets of the data repository. These data marts are more targeted to what the
data user needs and easier to use.
To build the model, data should be clean and understand the content properly. The
components of model building are as follows:
b) Execution of model
• Building a model is an iterative process. Most models consist of the following main steps:
1. Must the model be moved to a production environment and, if so, would it be easy to
implement?
2. How difficult is the maintenance on the model: how long will it remain relevantif left
untouched?
Model Execution
Various programming language is used for implementing the model. For model
execution, Python provides libraries like StatsModels or Scikit-learn. These packages use
several of the most popular techniques. Coding a model is a nontrivial task in most cases, so
having these libraries available can speed up the process. Following are the remarks on
output:
b) Predictor variables have a coefficient: For a linear model this is easy to interpret.
c) Predictor significance: Coefficients are great, but sometimes not enough evidence exists
to show that the influence is there.
Linear regression works if we want to predict a value, but for classify something,
classification models are used. The k-nearest neighbors method is one of the best method.
1. SAS enterprise miner: This tool allows users to run predictive and descriptive models
based on large volumes of data from across the enterprise.
2. SPSS modeler: It offers methods to explore and analyze data through a GUI.
12. (a) Demonstrate the different types of variables used in data analysis with an
example for each.
Examples
Counts- such as the number of children in a family. (1, 2, 3, etc., but never 1.5)
These variables cannot have fractional or decimal values. You can have 20 or 21 cats,
but not 20.5
The number of heads in a sequence of coin tosses.
The result of rolling a die.
The number of patients in a hospital.
The population of a country.
While discrete variables have no decimal places, the average of these values can be
fractional. For example, families can have only a discrete number of children: 1, 2, 3, etc.
However, the average number of children per family can be 2.2.
Independent Variable
Independent variables (IVs) are the ones that you include in the model to explain or
predict changes in the dependent variable.
Independent indicates that they stand alone and other variables in the model do not
influence them.
Independent variables are also known as predictors, factors, treatment variables,
explanatory variables, input variables, x-variables, and right-hand variables—because
they appear on the right side of the equals sign in a regression equation.
It is a variable that stands alone and isn't changed by the other variables you are
trying to measure. For example, someone's age might be an independent variable.
Other factors (such as what they eat, how much they go to school, how much
television they watch)
Dependent Variable
The dependent variable (DV) is what you want to use the model to explain or
predict. The values of this variable depend on other variables.
It’s also known as the response variable, outcome variable, and left-hand
variable. Graphs place dependent variables on the vertical, or Y, axis.
a dependent variable is exactly what it sounds like. It is something that depends
on other factors.
For example the blood sugar test depends on what food you ate, at which time you ate
etc. Unlike the independent variable, the dependent variable isn’t manipulated by the
investigator. Instead, it represents an outcome: the data produced by the experiment.
Confounding Variable
(Or)
(b) The number of friends reported by Facebook users is summarized in the following frequency
distribution. FRIENDS
Interval (Friends) Frequency (f)
400+ 2
350–399 5
300–349 12
250–299 17
200–249 23
150–199 49
100–149 27
50–99 29
0–49 30
Total 200
The relative frequencies are calculated by dividing each interval's frequency by the
total number of users (200).
To find the approximate percentile rank of the interval 300-349, we use the formula:
o 50-99: 29
o 100-149: 27
o 150-199: 49
o 200-249: 23
o 250-299: 17
The approximate percentile rank of the interval 300-349 is 93.50. This means that
approximately 93.5% of Facebook users reported having 349 or fewer friends.
The histogram below visually represents the frequency distribution of friends among
Facebook users. The x-axis shows the number of friends, and the y-axis shows the frequency
(number of users).
(v) Why would it not be possible to convert to a stem and leaf display?
It would not be possible to convert this frequency distribution into a stem and leaf
display because a stem and leaf display requires raw, individual data points or at least
precise numerical values within each bin.
1. Grouped Data: You are provided with grouped data (frequency distribution), where
individual data points are aggregated into intervals. For example, you know 36 users
have between 0-49 friends, but you don't know the exact number of friends for each
of those 36 users (e.g., whether they have 10, 25, or 48 friends).
2. Loss of Individual Detail: A stem and leaf display requires the "leaf" part to
represent the actual trailing digit(s) of individual data points. Since this individual
detail is lost when data is grouped into a frequency distribution, you cannot
reconstruct a stem and leaf plot. You only know the count within each bin, not the
specific values that contribute to that count.
In essence, a stem and leaf display needs more granular information than what a frequency
distribution provides.
13. (a) (i) Categorize the different types of relationships using Scatter plots. (7)
Scatter plots are powerful graphical tools used to visualize the relationship between
two numerical variables. By observing the pattern of points on a scatter plot, we can
categorize different types of relationships. Here are the main types:
Description: As the values of one variable increase, the values of the other variable
also tend to increase. The points on the scatter plot generally form an upward-sloping
straight line.
Example: The relationship between hours studied for an exam and exam score.
Generally, as the hours studied increase, the exam score tends to increase.
Description: As the values of one variable increase, the values of the other variable
tend to decrease. The points on the scatter plot generally form a downward-sloping
straight line.
Example: The relationship between number of hours spent watching TV and
grades (GPA). Often, as the hours spent watching TV increase, grades might tend to
decrease.
Description: There is no discernible pattern or trend between the two variables. The
points on the scatter plot appear randomly scattered, forming a cloud with no clear
direction. Changes in one variable do not predict changes in the other.
Example: The relationship between a person's shoe size and their IQ score. There is
no expected correlation between these two variables.
Description: The variables are related, but the relationship does not follow a straight
line. Instead, the points form a curve (e.g., U-shaped, inverted U-shaped, exponential,
logarithmic).
Sub-types and Examples:
o Curvilinear (e.g., U-shaped):
Example: The relationship between age and reaction time. Reaction time might
initially decrease (improve) with age, then increase (worsen) in older age, forming a U-shape.
Example: The relationship between stress level and performance. Low stress might
lead to low performance, moderate stress to high performance, and very high stress to low
performance, forming an inverted U-shape.
In addition to the direction (positive, negative, none) and form (linear, non-linear), we can
also describe the strength of the relationship, which refers to how closely the points cluster
around the trend (line or curve).
• Strong Relationship: Points are tightly clustered around the trend line or curve. This
indicates a high degree of correlation, meaning changes in one variable are strongly
associated with predictable changes in the other.
By observing these patterns on a scatter plot, data analysts can gain valuable insights
into the interdependencies between variables, which is crucial for decision-making, predictive
modeling, and further statistical analysis.
(2) Determine the least squares equation for these data. (Remember, you will first have
to calculate r. SSy and SSx) (2)
(3) Determine the standard error of estimate, Sy/x, given that n=7. (2)
Answer:
13.(b) (i) In studies dating back over 100 years, it's well established that regression
toward the mean occurs between the heights of fathers and the heights of their adult
Sons.
Indicate whether the following statements are true or false.
(1) Sons of tall fathers will tend to be shorter than their fathers. (1)
(2) Sons of short fathers will tend to be taller than the mean for all sons. (1)
(3) Every son of a tall father will be shorter than his father. (1)
(4) Taken as a group, adult sons are shorter than their fathers. (1)
(5) Fathers of tall sons will tend to be taller than their sons. (1)
(6) Fathers of short sons will tend to be taller than their sons but shorter than the mean
for all fathers. (1)
Answer:
Regression toward the mean is a statistical phenomenon that describes the tendency of
extreme values on one measurement to be closer to the average on a second measurement. In
the context of father-son heights, it means:
Extremely tall fathers tend to have sons who are tall, but slightly shorter than
themselves (regressing toward the average height of sons).
Extremely short fathers tend to have sons who are short, but slightly taller than
themselves (regressing toward the average height of sons).
(1) Sons of tall fathers will tend to be shorter than their fathers.
(2) Sons of short fathers will tend to be taller than the mean for all sons.
False. Sons of short fathers will tend to be taller than their fathers (regressing up
towards the mean), but their height will likely still be below the overall mean height
for all sons. For example, if the average son is 175 cm, a very short father (e.g., 160
cm) might have a son who is 165 cm. This son is taller than his father, but still shorter
than the overall mean for all sons.
(3) Every son of a tall father will be shorter than his father.
False. Regression toward the mean describes a tendency or a statistical average effect.
It does not apply to every single individual case. It's possible for some sons of tall
fathers to be even taller than their fathers due to genetic variation or environmental
factors.
(4) Taken as a group, adult sons are shorter than their fathers.
False. Regression toward the mean does not imply a change in the population mean
over generations. The average height of adult sons is generally similar to, or in many
populations, even slightly taller than, their fathers due to improved nutrition and
health over time (a secular trend). The phenomenon describes the movement of
individual extreme values towards the mean, not a shift in the mean itself.
(5) Fathers of tall sons will tend to be taller than their sons.
False. This is the reverse application of regression toward the mean. If a son is
exceptionally tall (an extreme value), his father's height, while likely above average,
will tend to be closer to the average height of fathers. Therefore, the father will
typically be shorter than his exceptionally tall son.
(6) Fathers of short sons will tend to be taller than their sons but shorter than the mean
for all fathers.
• True.
"taller than their sons": If a son is exceptionally short, his father's height (if above the
son's height) will regress towards the average, meaning the father will likely be taller than his
very short son.
"shorter than the mean for all fathers": Since the son is short, it's likely the father is
also on the shorter side, and his height (regressing from the son's extreme short height) will
tend to be closer to the mean of all fathers, but still below it, consistent with being the father
of a short son.
1. Range:
o The value of r2 always falls between 0 and 1 (inclusive).
o It's the square of the Pearson correlation coefficient (r), so it's always non-
negative.
o In simpler terms, it tells you how much of the variability you observe in Y is
accounted for by the variations in X.
o r2=0: This indicates that the independent variable(s) (X) explains none of the
variance in the dependent variable (Y). There is no linear relationship between
X and Y, and the model does not improve prediction over simply using the
mean of Y.
o r2=1: This indicates that the independent variable(s) (X) explains all of the
variance in the dependent variable (Y). There is a perfect linear relationship,
meaning all data points fall exactly on the regression line. This is rare in real-
world data outside of deterministic relationships.
4. "Goodness of Fit":
14. (a) Imagine you have a series of data that represents the amount of precipitation
each day for a year in a given city. Load the daily rainfall statistics for the city of
Chennai in 2021 which is given in a cav file Chennairainfall2021.csv using Pandas
generate a histogram for rainy days, and find out the days that have high rainfall.
Answer:
To perform the requested analysis, the following steps using Python with the Pandas
and Matplotlib libraries are required:
Load the Data:
Use Pandas to load the Chennai Rainfall 2021.csv file into a DataFrame. Assume the CSV
contains a column named 'Rainfall' representing daily precipitation and a 'Date' column for
the corresponding date.
Generate a Histogram:
Create a histogram of the 'Rainfall' column to visualize the distribution of daily rainfall
amounts. This will help understand the frequency of different rainfall levels.
Identify High Rainfall Days:
Define a threshold to determine what constitutes "high rainfall." Filter the DataFrame to
display only the days where the rainfall exceeds this defined threshold.
import pandas as pd
try:
except FileNotFoundError:
print("Error: 'Chennai Rainfall 2021.csv' not found. Please ensure the file is in the correct
directory.")
exit()
df['Date'] = pd.to_datetime(df['Date'])
plt.xlabel('Rainfall (mm)')
plt.ylabel('Frequency of Days')
plt.grid(axis='y', alpha=0.75)
plt.show()
high_rainfall_threshold = 50
(Or)
(b) Consider that, an E-Commerce organization like Amazon, have different regions
sales as NorthSales, SouthSales, WestSales, EastSales.csv files. They want to combine
North and West region sales and South and East sales to find the aggregate sales of
these collaborating regions Help them to do so using Python code.
Answer:
To help the E-Commerce organization combine their regional sales data, I'll provide
Python code using the Pandas library. Since I don't have access to your specific .csv files, I
will demonstrate this by creating sample DataFrames that mimic the structure of your
NorthSales.csv, SouthSales.csv, WestSales.csv, and EastSales.csv files.
1. Create Sample Data: Generate sample sales data for North, South, West, and East
regions.
2. Load Data (Simulated): Treat these sample DataFrames as if they were loaded
from .csv files.
3. Combine North and West Sales: Merge the sales data from the North and West
regions.
4. Combine South and East Sales: Merge the sales data from the South and East
regions.
5. Calculate Aggregate Sales: Sum the sales for each of the newly combined regions.
6. Display Results: Print the combined sales dataframes and their total aggregate sales.
Python
import pandas as pd
# --- Step 1: Create Sample DataFrames (simulating loading from CSV files) ---
# In a real scenario, you would replace these with:
# north_sales_df = pd.read_csv('NorthSales.csv')
# south_sales_df = pd.read_csv('SouthSales.csv')
# west_sales_df = pd.read_csv('WestSales.csv')
# east_sales_df = pd.read_csv('EastSales.csv')
# --- Step 4: Find Aggregate Sales for each combined region ---
aggregate_north_west_sales = north_west_sales_df['Sales'].sum()
aggregate_south_east_sales = south_east_sales_df['Sales'].sum()
Explanation:
1. import pandas as pd: This line imports the Pandas library, which is essential for
working with DataFrames.
2. Sample Data Creation: I've created four pd.DataFrame objects (north_sales_df,
south_sales_df, west_sales_df, east_sales_df) using dictionaries. Each dictionary
contains a 'Date' column (converted to datetime objects for proper handling) and a
'Sales' column. In your actual scenario, you would replace these sections with
pd.read_csv('YourFileName.csv').
3. pd.concat([df1, df2], ignore_index=True): This is the core function used for
combining the sales data.
o [df1, df2] is a list of the DataFrames you want to combine.
15. (a) How text and image annotations are done using Python? Give an example of
your own with appropriate Python code.
Text and image annotations in Python involve adding information or labels to text
data or visual elements within images. This is commonly done for tasks such as data labeling
for machine learning, creating visual guides, or adding metadata.
"""
Args:
position (tuple): (x, y) coordinates for the top-left corner of the text.
font_path (str, optional): Path to a TrueType font file (.ttf). Defaults to None (uses
default Pillow font).
fill_color (tuple, optional): RGB tuple for the text color. Defaults to red (255, 0, 0).
"""
try:
img = Image.open(image_path).convert("RGB")
draw = ImageDraw.Draw(img)
if font_path:
try:
font = ImageFont.truetype(font_path, font_size)
except IOError:
font = ImageFont.load_default()
else:
font = ImageFont.load_default()
img.save("annotated_image.jpg")
except FileNotFoundError:
except Exception as e:
# Example Usage:
try:
dummy_img.save('dummy_image.jpg')
except Exception as e:
annotate_image_with_text(
image_path="dummy_image.jpg",
text="Hello, Annotation!",
position=(50, 50),
font_size=40,
fill_color=(0, 0, 255) # Blue color
# annotate_image_with_text(
# image_path="dummy_image.jpg",
# position=(50, 150),
# font_size=30,
#)
Explanation:
Import necessary modules: Image, ImageDraw, and ImageFont from PIL.
Open the image: Image.open(image_path).convert("RGB") loads the image and
converts it to RGB mode for consistent color handling.
Create a drawing object: ImageDraw.Draw(img) creates an object that allows
drawing operations on the image.
Load the font: ImageFont.truetype() loads a custom TrueType font if font_path is
provided, otherwise ImageFont.load_default() is used.
Add text: draw.text(position, text, fill=fill_color, font=font) draws the
specified text at the position with the given fill_color and font.
Save the annotated image: img.save("annotated_image.jpg") saves the modified
image.
Text Annotation (e.g., for NLP):
For text annotation in Natural Language Processing (NLP), you typically use libraries like
spaCy or NLTK to identify and label entities, parts of speech, or other linguistic features
within a text. While this doesn't involve visual annotation, it's a crucial form of "annotation."
import spacy
def annotate_text_entities(text):
"""
Returns:
"""
try:
doc = nlp(text)
entities = []
entities.append({
"text": ent.text,
"start_char": ent.start_char,
"end_char": ent.end_char,
"label": ent.label_
})
return entities
except OSError:
return []
# Example Usage:
sample_text = "Apple Inc. was founded by Steve Jobs and Steve Wozniak in Cupertino,
California."
annotated_entities = annotate_text_entities(sample_text)
Explanation:
Import spaCy:
Imports the necessary library.
Load a spaCy model:
spacy.load("en_core_web_sm") loads a pre-trained English language model for entity
recognition.
Process the text:
nlp(text) processes the input text, applying various NLP tasks including named entity
recognition.
Extract entities:
doc.ents provides access to the identified named entities, each with properties
like text, start_char, end_char, and label.
Format and return:
The function formats the extracted entities into a list of dictionaries for easier use.
(Or)
15. (b) Appraise the following (i) Histograms (ii) Binnings (iii) Density with appropriate
Python code.
(i) Histogram
A histogram is a graphical representation of the distribution of numerical data. It divides the
range of values in a continuous variable into a series of intervals (bins) and displays the count
or frequency of data points falling into each bin as bars. The x-axis represents the bins, and
the y-axis represents the frequency or density.
Python
import matplotlib.pyplot as plt
import numpy as np
# Create a histogram
plt.hist(data, bins=30, edgecolor='black', alpha=0.7)
plt.title('Histogram of Sample Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
(ii) Binning
Binning, also known as bucketing, is the process of grouping continuous data into a set of
discrete intervals or "bins." This is a fundamental step in creating histograms. The choice of
bin size and number significantly impacts the appearance and interpretation of the histogram,
with too few bins potentially over-smoothing the distribution and too many bins introducing
noise.
Python
import numpy as np
# Sample data
data = np.array([1.2, 2.3, 3.3, 3.1, 1.7, 3.4, 2.1, 1.25, 1.3])
(iii) Density
Density, in the context of histograms and data visualization, refers to the probability
density function (PDF) of a continuous variable. When a histogram is normalized to represent
density, the area of each bar corresponds to the proportion of data points within that bin, and
the total area under the histogram sums to 1. Density plots, often generated using Kernel
Density Estimation (KDE), provide a smooth representation of the data's underlying
distribution.
Python
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
Answer:
An exploratory data analysis (EDA) is a critical first step in any data science project.
It involves using visualization and summary statistics to understand the main characteristics
of a dataset, uncover patterns, detect outliers, and check assumptions.
1. Initial Data Inspection: Loading and summarizing the data to understand its
structure.
2. Univariate Analysis: Plots showing the distribution of each individual numerical
attribute (Age, Year of Operation, Axillary Nodes).
3. Bivariate Analysis: Plots comparing each numerical attribute against the Survival
Status to see how they relate to the outcome.
Based on the generated plots, here is a summary of the key insights into the dataset:
1. Univariate Distributions:
Age: The age distribution appears to be roughly bell-shaped (normal), with most
patients falling in the 40-60 year range.
Year of Operation: The operations are spread fairly evenly across the years of the
study (1958-1970).
Axillary Nodes: The distribution of axillary nodes is heavily skewed to the right. A
large majority of patients have a very low number of positive nodes, with a long tail
extending to higher values.
Age vs. Survival: The boxplots for Age show that the median age for patients who
survived is very similar to those who did not. While there's a slight difference in the
median, the distributions largely overlap, suggesting that age, by itself, may not be a
strong predictor of survival.
Year of Operation vs. Survival: The distributions for Year of Operation are also
very similar for both groups. This suggests that the year the surgery was performed,
within the study's timeframe, is not a major factor in predicting survival.
Axillary Nodes vs. Survival: The most significant difference is seen in the Axillary
Nodes plot. The boxplot for patients who died shows a slightly higher median number
of positive nodes and a larger interquartile range compared to those who survived.
This indicates that patients with a higher number of positive axillary nodes are more
likely to have a lower survival rate. This feature appears to be the most indicative of
survival status among the three.
The pairplot confirms the findings from the bivariate analysis. The scatter plots
comparing Age, Year of Operation, and Axillary Nodes show no strong linear
correlations between the numerical features themselves.
Crucially, when the points are colored by Survival Status, the axillary_nodes plots
stand out. The non-survivor patients (class 2) tend to be clustered in the areas with a
slightly higher number of positive nodes, particularly when plotted against age.
In conclusion, this exploratory data analysis suggests that the number of positive axillary
nodes is likely the most important feature for predicting survival status in this dataset, while
Age and Year of Operation seem to have less predictive power on their own.
(Or)
16. (b) Assume that an of 80 describes the strong negative relationship between years of
heavy smoking (X) and life expectancy (Y).
Assume, furthermore, that the distributions of heavy smoking and life expectancy each
have the following means and sums of squares: 5 60 35 70 x y X Y SS SS
(1) Determine the least squares regression equation for predicting life expectancy from
years of heavy smoking. (3)
(ii) Determine the standard error of estimate, Sy/x, assuming that the correlation of-.80
was based on n = 50 pairs of observations. (3)
(iii) Supply a rough interpretation of Sy/x.(3)
(iv) Predict the life expectancy for John, who has smoked heavily for 8 years. (3)
(v) Predict the life expectancy for Katie, who has never smoked heavily.
Answer:
First, let's clarify the given values. The problem states a "strong negative relationship"
with a value of "80". In standard statistical notation, this implies a correlation coefficient ( r)
of -0.80. The remaining values are:
Based on the information provided, let's solve each part of the problem.
First, let's clarify the given values. The problem states a "strong negative relationship" with a
value of "80". In standard statistical notation, this implies a correlation coefficient (r) of -
0.80. The remaining values are:
(i) Determine the least squares regression equation for predicting life expectancy from
years of heavy smoking.
To provide numerical answers, the exact values for n in the standard deviation
calculation in part (i) are needed, and the calculation of a and b would then allow the
predictions in (iv) and (v). Using n=50 from part (ii) for part (i) would be a reasonable
assumption if not explicitly stated otherwise. Assuming n=50 for all calculations, the
predicted life expectancy for John is approximately 128.82 years, and for Katie is
approximately 137.87 years.
B.E/B.Tech. DEGREE EXAMINATIONS, APRIL/MAY 2023
11 (a).Elaborate about the steps in the data science process with a diagram. (13 marks)
The data science process is a structured approach for extracting knowledge and
insights from data. It generally involves six key steps: Problem Framing, Data Collection,
Data Preparation, Exploratory Data Analysis, Model Building, and Communication &
Deployment. Each step is crucial for ensuring that the final analysis is accurate, relevant, and
actionable.
1. Problem Framing:
This initial step involves clearly defining the business or research problem that the
data science project aims to address. Understanding the objectives and context is vital for
guiding the entire process. A well-defined problem statement sets the direction for the
project and helps in identifying the right data sources and analytical approaches.
2. Data Collection:
Once the problem is defined, the next step is to gather the necessary data. This
involves identifying and accessing relevant data from various sources, both internal and
external. This could include databases, spreadsheets, APIs, or even external datasets. Data
collection may also involve data wrangling and cleaning.
3. Data Preparation:
This crucial step involves cleaning, transforming, and preparing the data for
analysis. It often includes handling missing values, outliers, and inconsistencies in the
data. Data preparation ensures the data is in a suitable format for modeling and analysis.
4. Exploratory Data Analysis (EDA):
EDA involves exploring the data to gain a deeper understanding of its
characteristics, patterns, and relationships. Techniques like data visualization, statistical
analysis, and summary statistics are used to identify trends, outliers, and potential
insights. EDA helps in identifying the most relevant variables and features for modeling.
5. Model Building:
This step involves selecting and training appropriate machine learning models to
solve the defined problem. Different algorithms can be used depending on the nature of the
problem (e.g., classification, regression, clustering). Model building also includes
evaluating the model's performance and optimizing it for better results.
6. Communication and Deployment:
The final step involves communicating the findings and insights from the analysis,
often through visualizations, reports, or dashboards. If the analysis is to be used in a
business context, the model may be deployed for real-time predictions or automated
decision-making. Continuous monitoring and maintenance of the deployed model are also
crucial.
(Or)
11. (b).What is a data warehouse? Outline the architecture of a data warehouse with a
diagram. (13 marks)
Data warehousing is the process of constructing and using a data warehouse. A data
warehouse is constructed by integrating data from multiple heterogeneous sources that
support analytical reporting, structured and/or ad hoc queries, and decision making. Data
warehousing involves data cleaning, data integration, and data consolidations.
Data Warehouse
Although a data warehouse and a traditional database share some similarities, they
need not be the same idea. The main difference is that in a database, data is collected for
multiple transactional purposes. However, in a data warehouse, data is collected on an
extensive scale to perform analytics. Databases provide real-time data, while warehouses
store data to be accessed for big analytical queries.
Bottom Tier
The bottom tier or data warehouse server usually represents a relational database
system. Back-end tools are used to cleanse, transform and feed data into this layer.
Middle Tier
The middle tier represents an OLAP server that can be implemented in two ways. The
ROLAP or Relational OLAP model is an extended relational database management system
that maps multidimensional data process to standard relational process. The MOLAP or
multidimensional OLAP directly acts on multidimensional data and operations.
Top Tier
This is the front-end client interface that gets data out from the data warehouse. It
holds various tools like query tools, analysis tools, reporting tools, and data mining tools.
Data Warehousing integrates data and information collected from various sources into
one comprehensive database. For example, a data warehouse might combine customer
information from an organization’s point of-sale systems, its mailing lists, website, and
comment cards. It might also incorporate confidential information about employees, salary
information, etc. Businesses use such components of data warehouse to analyze customers.
Data mining is one of the features of a data warehouse that involves looking for
meaningful data patterns in vast volumes of data and devising innovative strategies for
increased sales and profits.
Types of Data Warehouse
There are three main types of data warehouse.
Enterprise Data Warehouse (EDW)
This type of warehouse serves as a key or central database that facilitates decision-
support services throughout the enterprise. The advantage to this type of warehouse is that it
provides access to cross-organizational information, offers a unified approach to data
representation, and allows running complex queries.
Operational Data Store (ODS)
This type of data warehouse refreshes in real-time. It is often preferred for routine
activities like storing employee records. It is required when data warehouse systems do not
support reporting needs of the business.
Data Mart
A data mart is a subset of a data warehouse built to maintain a particular department,
region, or business unit. Every department of a business has a central repository or data mart
to store data. The data from the data mart is stored in the ODS periodically. The ODS then
sends the data to the EDW, where it is stored and used.
12. (a). (i). What is a frequency distribution? Customers who have purchased a
particular product rated the usability of the product on a 10-pont scale, ranging from
1(poor) to 10(excellent) as follows:
3 7 2 7 8
3 1 4 10 3
2 5 5 8
2 7 3 6 7
8 9 7 3 6
Construct a frequency distribution for the above data.
Solution:
All values:
3, 7, 2, 7, 8, 3, 1, 4, 10, 3, 2, 5, 5, 8, 2, 7, 3, 6, 7, 8, 9, 7, 3, 6
A relative frequency distribution shows the proportion of total observations that fall
within each class interval. It is calculated by:
1. Identify the total frequency, which is the sum of all class frequencies (already given
as 200).
2. For each class, divide its frequency by the total frequency.
3. Round off the relative frequency to two decimal places.
(Or)
Answer:
A Z-score is a statistical measure that tells you how many standard deviations a data
point is from the mean. Formula:
Where:
X = individual score
μ = mean
σ = standard deviation
Solution:
Where:
X = individual score
μ = mean
σ = standard deviation
13 (a).Calculate the correlation coefficient for the heights (in inches) of fathers (x) and their
sons (y) with the data presented below:
Answer:
x y x² y² xy
66 68 4356 4624 4488
68 70 4624 4900 4760
68 69 4624 4761 4692
70 72 4900 5184 5040
71 72 5041 5184 5112
72 72 5184 5184 5184
72 74 5184 5476 5328
Total 34,913 35,313 34,604
(Or)
Answer:
Answer:
Definition:
Aggregate functions perform a calculation on a set of values and return a single result.
np.var() – Variance
Example:
import numpy as np
arr = np.array([1, 2, 3, 4])
print(np.sum(arr)) # Output: 10
print(np.mean(arr)) # Output: 2.5
(Or)
14 (b). (i) What is broadcasting? Explain the rules of broadcasting with an example. (7
marks)
Answer:
Rules of broadcasting:
1. If the arrays do not have the same number of dimensions, the shape of the smaller
array is padded with ones on its left side.
2. If the sizes of the dimensions do not match, the array with size 1 in that dimension
is stretched to match the other array.
3. If in any dimension the sizes are unequal and neither is 1, an error is raised.
Example:
python
CopyEdit
a = np.array([1, 2, 3]) # Shape (3,)
b=5
print(a + b) # [6 7 8]
Example:
14.(b).(ii) Elaborate about the mapping between Python operators and Pandas methods.
(6 marks)
Answer:
+ add()
Operator Equivalent Method
- sub()
* mul()
/ div()
** pow()
Example:
python
CopyEdit
df1.add(df2) # same as df1 + df2
15 (a). Explain various visualization charts like line plots, scatter plots, and histograms
using Matplotlib with examples. (13 marks)
Answer:
1. Line Plot:
Python
CopyEdit
plt.plot([1, 2, 3], [2, 4, 1])
2. Scatter Plot:
Python
CopyEdit
plt.scatter([1, 2, 3], [4, 5, 6])
3. Histogram:
Python
CopyEdit
plt.hist([1, 1, 2, 3, 3, 4, 4, 4])
(Or)
15 (b). Outline any two three-dimensional plotting in Matplotlib with an example. (13
marks)
Answer:
Use mpl_toolkits.mplot3d:
3D Line Plot:
import numpy as np
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
x = np.random.rand(100)
y = np.random.rand(100)
z = np.random.rand(100)
ax.scatter(x, y, z)
plt.show()
3D Surface Plot:
PART – C (1 × 15 = 15)
16. (a).(i) What is mode? Can there be distributions with no mode or more than
one mode? The owner of a new car conducts six gas mileage tests and obtains the
following results, expressed in miles per gallon: 26.3, 28.7, 27.4, 26.9, 27.4, 26.9.
Find the mode for these data.
Answer:
Concepts:
Mode, Statistics, Data analysis
Explanation:
The mode of a set of data is the value that appears most frequently. A distribution can
have no mode if all values occur with the same frequency, or it can have more than one mode
if multiple values occur with the highest frequency. In this case, we will find the mode of the
given gas mileage test results.
Step 2
Count the frequency of each result: 26.3 appears 1 time, 28.7 appears 1 time, 27.4
appears 2 times, and 26.9 appears 2 times.
Step 3
Identify the highest frequency: The highest frequency is 2, which corresponds to the
values 27.4 and 26.9.
Step 4
Since both 27.4 and 26.9 appear most frequently, the data set is bimodal with modes
27.4 and 26.9.
Final Answer:
The modes are 27.4 and 26.9.
16.(a).(ii) What is mode? Can there be distribution with bo mode or more than
one mode? The owner of new car conducts six gas mileage tests and obtain the
following results, expressed in miles per gallon: 26.3, 28.7,27.4,26.6,27.4,26.9.
Find the mode for these data.
Answer:
Concepts:
Median, Mode
Explanation:
The median is the middle value in a list of numbers. To find the median, you need to arrange
the numbers in ascending order and then find the middle value. If there is an even number of
observations, the median is the average of the two middle numbers. The mode is the value
that appears most frequently in a data set. A distribution can have no mode, one mode, or
more than one mode.
Step 2
Since there are 5 scores (an odd number), the median is the middle score: 6.
Step 3
Arrange the second set of six scores 3,8,9,3,1,8 in ascending order: 1,3,3,8,8,9.
Step 4
Since there are 6 scores (an even number), the median is the average of the two
middle scores: (3+8)/2=5.5.
Step 5
The median for the first set of scores is 6.
Step 6
The median for the second set of scores is 5.5.
Step 7
To find the mode of the gas mileage data 26.3,28.7,27.4,26.6,27.4,26.9, identify the
value(s) that appear most frequently.
Step 8
The value 27.4 appears twice, while all other values appear only once.
Step 9
Therefore, the mode for the gas mileage data is 27.4.
Final Answer:
The median for the first set of scores is 6. The median for the second set of scores is
5.5. The mode for the gas mileage data is 27.4.
1. How missing values present in a dataset are treated during data analysis phase?
Missing values in a dataset during the data analysis phase are typically handled
through various techniques such as imputation (replacing missing values with estimated
values like mean, median, mode, or using more advanced methods like K-Nearest Neighbors
imputation or regression imputation), deletion (removing rows or columns with missing
values), or using models that can inherently handle missing data.
2. Identify and write down various data analytic challenges faced in the conventional
system.
Conventional data analytic systems often face challenges such as data heterogeneity
(dealing with diverse data types and sources), data volume and velocity (handling large and
rapidly changing datasets), data quality issues (inconsistencies, errors, missing data), limited
scalability, and lack of real-time processing capabilities.
No, treating categorical variables as continuous variables will not result in a better predictive
model. This approach can lead to several problems:
4. Issue: Feeding data which has variables correlated to one another is not a good
statistical practice, since we are providing multiple weightage to the same type of data.
Correlation analysis helps identify and quantify the relationships between variables.
By identifying highly correlated variables, one can choose to remove redundant variables
(e.g., keeping only one from a highly correlated pair) or use dimensionality reduction
techniques like Principal Component Analysis (PCA) to create new, uncorrelated features,
preventing multiple weightage to the same type of data. For instance, in a dataset with
"Height in cm" and "Height in inches," removing one prevents redundancy.
8. Using appropriate data visualization modules develop a python code snippet that
generates a simple sinusoidal wave in an empty gridded axes?
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 2 * np.pi, 100)
y = np.sin(x)
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_title("Simple Sinusoidal Wave")
ax.set_xlabel("X-axis")
ax.set_ylabel("Y-axis")
plt.show()
9. Write a python code snippet that generates a time-series graph representing COVID-
19 incidence cases for a particular week.
Da Da Da Da Da Da Da
y1 y2 y3 y4 y5 y6 y7
7 18 9 44 2 5 89
Python
10. Write a python code snippet that draws a histogram for the following list of positive
numbers. 7 18 9 44 25 89 91 11 6 77 85 91 6 55
Python:
import matplotlib.pyplot as plt
# List of positive numbers
data = [7, 18, 9, 44, 25, 89, 91, 11, 6, 77, 85, 91, 6, 55]
# Create a histogram
plt.figure(figsize=(10, 6))
plt.hist(data, bins=10, edgecolor='black', alpha=0.7)
# Set labels, title, and grid
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram of Positive Numbers')
plt.grid(axis='y', alpha=0.75)
# Display the plot
plt.show()
PART – B (5 × 13 = 65)
11. (a) (i) Suppose there is a dataset having variables with missing values of more than
30%, how will you deal with such dataset? (6)
When dealing with a dataset where variables have more than 30% missing values, it's
crucial to address this issue effectively to avoid biased or inaccurate results. Here's how you
can approach it:
Imputation Techniques:
Mean/Median/Mode Imputation:
Replace missing values with the mean, median, or mode of the respective variable.
This is a simple method but can reduce variance and distort relationships.
Regression Imputation:
Predict missing values using a regression model based on other variables in the
dataset. This can preserve relationships better than simple imputation.
K-Nearest Neighbors (KNN) Imputation:
Impute missing values by finding the K-nearest neighbors to the observation with
missing data and using their values to estimate the missing ones.
Multiple Imputation:
Generate multiple plausible imputed datasets, analyze each, and combine the results.
This accounts for the uncertainty introduced by imputation and provides more robust
estimates.
Deletion Methods:
Row-wise Deletion (Listwise Deletion):
Remove entire rows containing any missing values. This is simple but can lead to
significant data loss if many rows have missing data, especially with over 30% missing
values.
Column-wise Deletion:
Remove entire columns (variables) that have a high percentage of missing values.
This is often necessary when a variable is largely incomplete and provides little information.
Advanced Techniques:
Machine Learning-based Imputation:
Use more sophisticated machine learning algorithms like Random Forest or deep learning
models to predict and impute missing values.
Domain Knowledge:
Consult domain experts to understand the reasons for missing data and potentially fill
in gaps based on their expertise.
11.(a).(ii) List down the various feature selection methods for selecting the right
variables for building efficient predictive models. Explain about any two selection
methods. (7)
(Or)
11. (b) (i) Explain Data Analytic life cycle. Brief about Time-Series Analysis.
Data Analytic Life Cycle: The data analytic life cycle is a systematic process for
conducting data analysis projects. While different models exist, a common cycle includes the
following stages:
Time-Series Analysis: This is a specific type of data analysis that involves analyzing
data points collected over a period of time. The main goal is to understand the underlying
structure of the data and to forecast future values. Key components of a time series include:
Cyclical Component: Fluctuations that are not of a fixed period (e.g., business
cycles).
11.(b).(ii) Outline the purpose of data cleansing. How missing and nullified data
attributes are handled and modified during preprocessing stage? (7)
Purpose of Data Cleansing: Data cleansing, also known as data preprocessing or data
scrubbing, is the process of detecting and correcting (or removing) corrupt or inaccurate
records from a dataset. Its primary purpose is to improve data quality so that analysis and
modeling can be performed effectively and reliably. Without proper data cleansing, a
"garbage in, garbage out" scenario occurs, where flawed data leads to flawed insights and
poor model performance.
Handling Missing and Nullified Data: During the preprocessing stage, missing and
nullified data attributes are handled using several methods:
Imputation: As mentioned earlier, this involves filling in the missing values. The
choice of imputation method (mean, median, mode, regression, etc.) depends on the
nature of the data and the extent of the missing values.
Deletion:
Flagging: Creating a new binary variable (e.g., is_age_missing) to indicate that the
original age value was missing. This can sometimes be useful if the fact that a value is
missing is itself a piece of information.
12. (a) (i) Indicate whether each of the following distributions is positively or negatively
skewed. The distribution of
(1) Incomes of tax payers have a mean of $48,000 and a median of $43,000. (3)
(2) GPAs for all students at some college have a mean of 3.01 and a median of 3.20. (3)
Answer:
A distribution is positively skewed when the mean is greater than the median. Since
$48,000 > $43,000, this distribution is
positively skewed.
GPAs for all students at some college have a mean of 3.01 and a median of 3.20.
A distribution is negatively skewed when the mean is less than the median. Since 3.01
< 3.20, this distribution is
negatively skewed.
12.(a).(ii) During their first swim through a water maze, 15 laboratory rats made the
following number of errors (blind alleyway entrances): 2, 17, 5, 3, 28, 7, 5, 8, 5, 6, 2, 12,
10, 4, 3.
(1) Find the mode, median, and mean for these data. (3)
(2) 2) Without constructing a frequency distribution or graph, would it be possible
to characterize the shape of this distribution as balanced, positively skewed, or
negatively skewed? (4)
Answer:
Median: The median is the middle value of the sorted data. With 15 data points, the
8th value is the median, which is 5.
Mean: The mean is the sum of all values divided by the number of values. The sum is
117, and there are 15 values, so the mean is 117/15 = 7.8.
Yes, it is possible. The shape can be characterized by comparing the mean and the
median.
The mean (7.8) is greater than the median (5), which indicates a
positively skewed distribution. The presence of larger values like 17 and 28 pull the
mean to the right.
(Or)
12.(b) (i) Assume that SAT math scores approximate a normal curve with a mean of 500
and a standard deviation of 100. Sketch a normal curve and shade in the target area(s)
described by each of the following statements:
Answer:
Sketch a normal curve and shade in the target area(s) described by each of the following
statements:
More than 570: The shaded area should be to the right of the value 570 on the curve.
Less than 515: The shaded area should be to the left of the value 515 on the curve.
Between 520 and 540: The shaded area should be between the values 520 and 540 on
the curve.
Convert to z scores and find the target areas specific to the above values.
o Z=(570−500)/100=0.70
o Using a standard normal distribution table, the area to the left of Z=0.70 is
approximately 0.7580.
o Z=(515−500)/100=0.15
o Using a standard normal distribution table, the area to the left of Z=0.15 is
approximately 0.5596.
o Z_2=(540−500)/100=0.40
12.(b).(ii) Assume that the burning times of electric light bulbs approximate a normal
curve with a mean of 1200 hours and a standard deviation of 120 hours. If a large
number of new lights are installed at the same time (possibly along a newly opened
freeway), at what time will
1 percent fails?
50 percent fail?
95 percent fail?
Answer:
1 percent fails?
o We need to find the z-score corresponding to the bottom 1% (0.01 area to
the left).
o X=µ+Zσ=1200+(−2.33)(120)=1200−279.6=920.4 hours.
50 percent fail?
o The z-score is 0.
o X=µ=1200 hours.
95 percent fail?
o We need to find the z-score corresponding to the bottom 95% (0.95 area to
the left).
o X=µ+Zσ=1200+(1.645)(120)=1200+197.4=1397.4 hours.
13. (a) (i)In Statistics, highlight the impact when the goodness of fit test score is low?
A low goodness-of-fit score in a statistical test indicates that the observed data
significantly deviates from the expected data based on the model. This suggests that the
model is a poor fit for the data, and the results of any analysis based on that model may be
unreliable.
Goodness-of-fit tests, like the chi-square test, assess how well observed data matches
an expected distribution.
A high score (or a high p-value) indicates a close match, suggesting the model
accurately represents the data.
A low score (or a low p-value) indicates a poor fit, implying the model is not
accurately capturing the observed data.
Impact of a Low Goodness-of-Fit Score:
Inaccurate Predictions:
If the model doesn't fit the data well, predictions made using it are likely to be
inaccurate and unreliable.
Misleading Conclusions:
Any conclusions drawn from the model's results may be flawed due to the poor fit,
potentially leading to incorrect interpretations of the data.
Need for Model Refinement:
A low goodness-of-fit score signals the need to revise or refine the model,
potentially by including additional variables, changing the model's structure, or
choosing a different model altogether.
Invalid Hypothesis Tests:
If the model is used in hypothesis testing, a low goodness-of-fit score may
invalidate the test results, as the assumptions underlying the test are not met.
Potential for Bias:
If the model is used to make predictions or classifications, a poor fit can introduce
bias into the results.
13.(a).(ii) Given the following dataset of employee, Using regression analysis, find the
expected salary of an employee if the age is 45.
Age Salary
54 67000
42 43000
49 55000
57 71000
35 25000
(Or)
13. (b) (i) Define autocorrelation and how is it calculated? What does the negative
correlation convey?
13. (b) (ii) What is the philosophy of Logistic regression? What kind of model it is?
What does logistic Regression predict? Tabulate the cardinal differences of Linear and
Logistic Regression.
(i) Initialize two dictionaries (D₁ and Da) with key and value pairs.
(ii) Compare those two dictionaries with master key list 'M' and print the missing keys.
Python
(ii) Comparison with a master key list 'M' and printing missing keys:
Python
Python
Python
D3 = {**D1, **D2}
(Or)
14.(b) (i) How to create hierarchical data from the existing data frame?
(ii) How to use group by with 2 columns in data set? Give a python code snippet.
Answer:
Steps:
1. Import Pandas:
2. Create a DataFrame:
Create a sample DataFrame with the data you want to make hierarchical.
import pandas as pd
# Sample DataFrame
df = pd.DataFrame(data)
print(hierarchical_df)
The groupby() method in pandas is used to group data based on one or more columns.
To group by two columns, you pass a list of the column names to the groupby() method.
For example, to find the average temperature for each city and month:
import pandas as pd
# Sample DataFrame
data = {'City': ['New York', 'New York', 'London', 'London', 'New York'],
print(grouped_data)
15. (a) Write a code snippet that projects our globe as a 2-D flat surface (using
cylindrical project) and convey information about the location of any three major
Indian cities in the map (using scatter plot).
A Python code snippet to project a 2D cylindrical map of India and plot the locations
of three major cities using a scatter plot is shown below. The code utilizes
the matplotlib and numpy libraries. It defines the coordinates of Mumbai, Delhi, and
Chennai, converts them to cylindrical projection, and then plots them on a 2D plane.
import numpy as np
cities = {
def to_radians(degrees):
return np.radians(degrees)
x = longitude
return x, y
lat_rad = to_radians(lat)
lon_rad = to_radians(lon)
x, y = cylindrical_projection(lat_rad, lon_rad)
projected_cities[city] = (x, y)
# Plotting
plt.figure(figsize=(8, 6))
plt.scatter(x, y, label=city)
plt.xlabel("Longitude")
plt.ylabel("Latitude (projected)")
plt.legend()
plt.grid(True)
plt.show()
(Or)
15.(b) (i) Write a working code that performs a simple Gaussian process regression
(GPR), using the Scikit-Learn API.
(ii) Briefly explain about visualization with Seaborn. Give an example working code
segment that represents a 2D kernel density plot for any data.
Answer:
# Make predictions
X_test = np.linspace(0, 10, 200).reshape(-1, 1)
y_pred, sigma = gpr.predict(X_test, return_std=True)
# y_pred contains the mean predictions, sigma contains the standard deviations
16. (a) Given a unsorted multi indexes that represents the distance between two cities,
write a python code anippet using appropriate libraries to find the shortest distance
between any two given cities. The following matrix representation can be used to create
the data frame that can be served as an input for the prescribed program.
Distance between any two given cities from an unsorted multi-indexed distance
matrix using Python, the pandas and networkx libraries are suitable. pandas can be used to
represent the distance matrix as a DataFrame, and networkx can be used to construct a graph
from this DataFrame and apply graph algorithms like shortest path.
import pandas as pd
import networkx as nx
matrix = [
[21, 8, 9, 11, 0]
city1 = 'A'
city2 = 'E'
# 4. Find the shortest path length (distance) between the two cities
16.(b) A URL Server wants to consolidate a history of websites visited by an user 'U'.
Every visited website information is stored in a 2-tuple format viz., (website_id,
Duration_of_visit) in the URL cache. Using split, apply and combine operations, devise
a code snippet that consolidate the website history and find out the website whose
duration of visit is maximum.
Example:
Answer:
1. Split: The input data, a list of tuples, is split based on the website_id.
2. Apply: A function (in this case, summation) is applied to each group to calculate the
total Duration_of_visit for each website_id.
3. Combine: The results from the "apply" step are combined to create a new list or
dictionary representing the consolidated history.
# Step 1 & 2: Split and Apply (using defaultdict for grouping and summation)
consolidated_history = defaultdict(int)
consolidated_history[website_id] += duration
# Step 3: Combine (converting the dictionary to a list of tuples for the specified output
format)
output_list = list(consolidated_history.items())
print("Output:", output_list)
max_duration = 0
max_website_id = None
max_duration = total_duration
max_website_id = website_id
• Graph theory has proved to be very effective on large-scale datasets such as social network
data. This is because it is capable of by-passing the building of an actual visual representation
of the data to run directly on data matrices.
Audio, Image and Video
• Audio, image and video are data types that pose specific challenges to a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be
challenging for computers.
•The terms audio and video commonly refers to the time-based media storage format for
sound/music and moving pictures information. Audio and video digital recording, also
referred as audio and video codecs, can be uncompressed, lossless compressed or lossy
compressed depending on the desired quality and use cases.
• It is important to remark that multimedia data is one of the most important sources of
information and knowledge; the integration, transformation and indexing of multimedia data
bring significant challenges in data management and analysis. Many challenges have to be
addressed including big data, multidisciplinary nature of Data Science and heterogeneity.
Streaming Data
Streaming data is data that is generated continuously by thousands of data sources,
which typically send in the data records simultaneously and in small sizes (order of
Kilobytes).
Streaming data includes a wide variety of data such as log files generated by
customers using your mobile or web applications, ecommerce purchases, in-game player
activity, information from social networks, financial trading floors or geospatial services and
telemetry from connected devices or instrumentation in data centers.
(Or)
11.(b) Explain in detail about the cleansing, integration, transforming data and build
amodel.
Data Cleaning
• Data is cleansed through processes such as filling in missing values, smoothing the noisy
data or resolving the inconsistencies in the data.
• Data cleaning tasks are as follows:
1. Data acquisition and metadata
2. Fill in missing values
3. Unified date format
4. Converting nominal to numeric
5. Identify outliers and smooth out noisy data
6. Correct inconsistent data
• Data cleaning is a first step in data pre-processing techniques which is used to find the
missing value, smooth noise data, recognize outliers and correct inconsistent.
• Missing value: These dirty data will affects on miming procedure and led to unreliable and
poor output. Therefore it is important for some data cleaning routines. For example, suppose
that the average salary of staff is Rs. 65000/-. Use this value to replace the missing value for
salary.
• Data entry errors: Data collection and data entry are error-prone processes. They often
require human intervention and because humans are only human, they make typos or lose
their concentration for a second and introduce an error into the chain. But data collected by
machines or computers isn't free from errors either. Errors can arise from human sloppiness,
whereas others are due to machine or hardware failure. Examples of errors originating from
machines are transmission errors or bugs in the extract, transform and load phase (ETL).
• Whitespace error: Whitespaces tend to be hard to detect but cause errors like other
redundant characters would. To remove the spaces present at start and end of the string, we
can use strip() function on the string in Python.
• Fixing capital letter mismatches: Capital letter mismatches are common problem. Most
programming languages make a distinction between "Chennai" and "chennai".
• Python provides string conversion like to convert a string to lowercase, uppercase using
lower(), upper().
• The lower() Function in python converts the input string to lowercase. The upper() Function
in python converts the input string to uppercase.
Outlier
• Outlier detection is the process of detecting and subsequently excluding outliers from a
given set of data. The easiest way to find outliers is to use a plot or a table with the minimum
and maximum values.
• An outlier may be defined as a piece of data or observation that deviates drastically from the
given norm or average of the data set. An outlier may be caused simply by chance, but it may
also indicate measurement error or that the given data set has a heavy-tailed distribution.
Combining Data from Different Data Sources
1. Joining table
• Joining tables allows user to combine the information of one observation found in one table
with the information that we find in another table. The focus is on enriching a single
observation.
• A primary key is a value that cannot be duplicated within a table. This means that one value
can only be seen once within the primary key column. That same key can exist as a foreign
key in another table which creates the relationship. A foreign key can have duplicate
instances within a table.
2. Appending tables
• Appending table is called stacking table. It effectively adding observations from one table to
another table.The result of appending these tables is a larger one with the observations from
Table 1 as well as Table 2. The equivalent operation in set theory would be the union and this
is also the command in SQL, the common language of relational databases. Other set
operators are also used in data science, such as set difference and intersection.
3. Using views to simulate data joins and appends
• Duplication of data is avoided by using view and append. The append table requires more
space for storage. If table size is in terabytes of data, then it becomes problematic to duplicate
the data. For this reason, the concept of a view was invented.
Transforming Data
• In data transformation, the data are transformed or consolidated into forms appropriate for
mining. Relationships between an input variable and an output variable aren't always linear.
• Reducing the number of variables: Having too many variables in the model makes the
model difficult to handle and certain techniques don't perform well when user overload them
with too many input variables.
• All the techniques based on a Euclidean distance perform well only up to 10 variables. Data
scientists use special methods to reduce the number of variables but retain the maximum
amount of data.
Euclidean distance :
• Euclidean distance is used to measure the similarity between observations. It is calculated as
the square root of the sum of differences between each point.
Euclidean distance = √(X1-X2)2 + (Y1-Y2)2
Turning variable into dummies :
• Variables can be turned into dummy variables. Dummy variables canonly take two values:
true (1) or false√ (0). They're used to indicate the absence of a categorical effect that may
explain the observation.
12.(a).(ii) Using standard normal curve table, find the proportion of the total area
identified with the following statements.
1. Above a Z score of 1.80
2. Between the mean and a z score of 1.65
3. Between z scores of 0 and -1.96
Answer:
(Or)
12.(b) (i) Describe the types of variables.
(ii). Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with
the following results:
Age 23 27 39 49 50 52 54 56 57 58 60
% fat 9.5 17.8 31.4 27.2 31.2 34.6 42.5 33.4 30.2 34.1 41
Age 23 27 39 49 50 52 54 56 57 58 60
% fat 9.5 17.8 31.4 27.2 31.2 34.6 42.5 33.4 30.2 34.1 41
Draw the boxplots for age .
Formula:
Example:
For the ages: 23, 27, 39, 49, 50, 52, 54, 56, 57, 58, 60
Range=60−23=37
Interpretation:
A larger range indicates greater variability in the data.
Variance
Variance measures the average squared deviation of each data point from the mean.
It gives a more accurate picture of data spread than range.
Formula (Population):
Where:
Mean xˉ=(10+12+14)/3=12
Variance s2=(10−12)2+(12−12)2+(14−12)2]/(3−1)=(4+0+4)/2=4
(Or)
Properties of r
When r is close to 0 this means that there is little relationship between the variables and the
farther away from 0 r is, in either the positive or negative direction, the greater the
relationship between the two variables.
The sign of r indicates the type of linear relationship, whether positive or negative.
The numerical value of r, without regard to sign, indicates the strength of the linear
relationship.
A number with a plus sign (or no sign) indicates a positive relationship, and a number with
a minus sign indicates a negative relationship
The least squares equation, also known as the least squares regression, is a method for finding
the best-fitting line (or curve) to a set of data points by minimizing the sum of the squares of
the differences between the observed values and the values predicted by the line. In simpler
terms, it finds the line that's closest to all the data points, minimizing the overall "error".
Explanation:
1. The Goal:
The primary goal is to find a line (or curve) that best represents the relationship between two
or more variables in a dataset. This line is often represented by an equation like y = a + bx,
where 'a' is the y-intercept and 'b' is the slope.
3. Why Squares?
Squaring the errors ensures that all differences are positive, preventing negative errors from
cancelling out positive errors and leading to a more accurate representation of the overall
error magnitude.
The method calculates the values for 'a' and 'b' that result in the smallest possible sum of
squared errors, thus determining the line of best fit.
1. Split: The data is divided into groups based on a specific key or set of keys (e.g., a
column or multiple columns in a DataFrame). Each unique value in the key column
becomes a separate group.
2. Apply: A function is applied to each individual group. This function can be an
aggregation (like sum(), mean(), or count()), a transformation (which returns a
result with the same size as the group), or a filtration (which discards certain groups).
3. Combine: The results of the applied function from each group are combined into a
new data structure (usually a DataFrame or Series).
Let's illustrate grouping with a practical example using a pandas DataFrame. First, you need
to have the pandas library installed. You can install it using pip:
python
import pandas as pd
df = pd.DataFrame(data)
# Group by 'Product' and calculate the sum of 'Sales' for each product
grouped_sales = df.groupby('Product')['Sales'].sum()
print(grouped_sales)
Explanation:
df.groupby('Product'): This splits the DataFrame df into groups based on the unique
values in the 'Product' column ('A', 'B', 'C').
['Sales']: This selects the 'Sales' column within each of these groups for the
subsequent operation.
.sum(): This applies the sum aggregation function to the 'Sales' column of each
product group.
The output grouped_sales will be a Series showing the total sales for each product:
Product
A 310
B 350
C 80
Function Description
.sum() Sum of values
.mean() Average
.count() Count of entries
.max() Maximum value
.min() Minimum value
(Or)
14. (b) Explain the following in Python
(i) Data Indexing
(ii) Operation on missing data
Answer:
(i) Data Indexing in Python
Data indexing is a fundamental concept in Python, especially when working with data
structures like lists, tuples, dictionaries, and more complex structures from libraries like
pandas. It refers to the process of selecting and accessing specific data elements or subsets of
data.
Lists and Tuples: These are sequence types, and you access elements by their integer
position (index), starting from 0.
Python
Dictionaries: Dictionaries are key-value pairs, and you access values by their
associated key.
Python
In the context of data analysis, indexing becomes more sophisticated with the pandas
library. pandas DataFrames and Series have powerful indexing capabilities, allowing for
selection based on position, labels, or boolean conditions.
Label-based indexing (.loc): Used to select data by row and column labels.
Python
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['x', 'y', 'z'])
print(df.loc['y', 'B']) # Output: 5
print(df.loc[['x', 'z'], 'A']) # Output: Series with values for 'x' and 'z' in column 'A'
Position-based indexing (.iloc): Used to select data by integer position, similar to lists.
Python
Python
df = pd.DataFrame({'A': [10, 20, 30], 'B': [1, 2, 3]})
print(df[df['A'] > 15]) # Output: Rows where column 'A' is greater than 15
Missing data, often represented as NaN (Not a Number) in pandas, is a common issue in real-
world datasets. Handling it properly is crucial for accurate analysis. pandas provides a
comprehensive set of functions to detect, remove, and fill missing values.
isnull(): Returns a boolean DataFrame or Series with True where the value is
missing.
isna(): An alias for isnull().
notnull(): Returns a boolean DataFrame or Series with True where the value is not
missing.
Python
import numpy as np
data = {'A': [1, 2, np.nan], 'B': [4, np.nan, 6]}
df_missing = pd.DataFrame(data)
print(df_missing.isnull())
Output:
A B
0 False False
1 False True
2 True False
Python
# Drops the row with NaN (row 1 and 2)
df_dropped = df_missing.dropna()
print(df_dropped)
Output:
A B
0 1.0 4.0
3. Filling Missing Data
Often, you don't want to discard data. Instead, you can fill the missing values with a
substitute.
Python
# Fills NaN values with the mean of each column
df_filled = df_missing.fillna(df_missing.mean())
print(df_filled)
Output:
A B
0 1.0 4.0
1 2.0 5.0
2 1.5 6.0
By using these operations, you can effectively manage and prepare your data for
analysis, ensuring that missing values do not lead to errors or biased results.
# Example DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value1': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value2': [5, 6, 7, 8]})
# Inner Join
inner_join_df = pd.merge(df1, df2, on='key', how='inner')
# Left Join
left_join_df = pd.merge(df1, df2, on='key', how='left')
# Right Join
right_join_df = pd.merge(df1, df2, on='key', how='right')
# Cross Join
cross_join_df = pd.merge(df1, df2, how='cross')
(Or)
15.(b).Explain the various features of Matplotlib platform used for data visualization
and illustrate its challenges.
Matplotlib is a foundational and widely used library in Python for creating static,
animated, and interactive visualizations. Its power lies in its extensive features and its role as
the building block for many other high-level plotting libraries.
o 2D Plots: Line plots, scatter plots, bar charts, histograms, pie charts, box plots.
o 3D Plots: Surface plots, wireframe plots, and 3D scatter plots using the
mpl_toolkits.mplot3d toolkit.
Integration with the Python Ecosystem: Matplotlib seamlessly integrates with other
key libraries in the scientific Python stack, such as NumPy and pandas. This allows
for efficient data manipulation and direct visualization within the same environment.
Many pandas plotting functions, for instance, use Matplotlib under the hood.
Subplots and Layout Management: You can create complex layouts with multiple
plots in a single figure using plt.subplots(). This is essential for comparing
different datasets or different aspects of the same data side-by-side. You have fine-
grained control over the placement, size, and spacing of these subplots.
Export to Various Formats: Matplotlib can save figures in a wide range of formats,
including raster formats like PNG and JPG, and vector formats like PDF, SVG, and
EPS. This is crucial for reports, presentations, and publications, as vector formats are
scalable without loss of quality.
Challenges of Matplotlib
Despite its powerful features, Matplotlib has some challenges that can be a hurdle for new
users and experienced developers alike.
Steep Learning Curve: Matplotlib's vast functionality and two different APIs can be
confusing for beginners. The distinction between Figure and Axes objects and when
to use the Pyplot vs. the object-oriented approach can be a source of frustration.
Simple tasks are easy, but complex customizations require a deep understanding of
the API.
Verbosity: To achieve a polished, publication-quality plot, a significant amount of
code is often required. Customizing every element—such as labels, titles, tick marks,
and colors—can make the code long and sometimes difficult to read, especially when
compared to higher-level libraries like Seaborn or Plotly.
Not Ideal for Interactive Web Visualizations: While Matplotlib has some
interactive features, it is primarily a library for static plots. For highly interactive,
web-based dashboards and visualizations, libraries like Plotly or Bokeh are often a
better choice, as they are designed from the ground up for this purpose.
A pivot table is a data summarization tool used in data analysis to quickly and
efficiently reorganize, aggregate, and present data. Its primary purpose is to help users gain
insights from large datasets by summarizing them in a new, more understandable format.
In Python, the most common and powerful way to create a pivot table is by using the
pivot_table() function from the pandas library. This function is a highly versatile tool that
goes beyond the basic functionality of a spreadsheet pivot table, offering extensive control
over aggregation and data structure.
A pivot table typically involves four key components that you use to structure your
summary:
Index (Rows): The columns that you want to use as the rows of your new summary
table. These columns define the "groups" you'll be analyzing. The unique values from
these columns will become the row labels of the pivot table.
Columns: The columns that you want to use as the columns of your new summary
table. These are also used for grouping, but the groups are arranged horizontally.
Values: The numerical data (or other data types that can be aggregated) that you want
to summarize. These are the values that will be aggregated in the cells of the pivot
table.
Aggregation Function (aggfunc): The function that is applied to the values for each
group. Common aggregation functions include sum, mean, count, min, max, median,
etc. You can also pass a list of functions to apply multiple aggregations at once.
The creation of a pivot table follows a three-step process, similar to a groupby operation
but with a more structured output:
Splitting: The original data is split into groups based on the unique values in the
index and columns you specify.
Aggregating: A function (aggfunc) is applied to the values of each group. For
example, if you are analyzing sales data by region and product, the aggregation
function would calculate the total sales for each region-product combination.
Reshaping/Combining: The aggregated results are then combined into a new, two-
dimensional table where the unique values of the index columns form the rows and
the unique values of the columns columns form the columns.
Sample Data: Imagine you have sales data for a company with multiple salespersons,
regions, and product categories.
Python
import pandas as pd
import numpy as np
print("Original DataFrame:")
print(df)
print("-" * 50)
Now, let's create a pivot table to answer a question like: "What are the total sales for
each product category, broken down by region?"
Python
# Create a pivot table to get total sales by region and category
pivot_sales = df.pivot_table(
index='Region', # Use 'Region' as the rows
columns='Category', # Use 'Category' as the columns
values='Sales', # Aggregate the 'Sales' data
aggfunc='sum' # The aggregation function is 'sum'
)
print("Pivot Table: Total Sales by Region and Category")
print(pivot_sales)
Rows (index): The rows are the unique regions (East, North, South).
Columns (columns): The columns are the unique categories (Electronics,
Furniture).
Values (values): The cells contain the sum of sales for each combination. For
example, the value at Region='North' and Category='Electronics' is 2100,
which is the sum of Alice's two sales in that category and region (1000 + 1100).
NaN Values: If a combination of index and columns has no corresponding data in the
original DataFrame (e.g., no sales of furniture in the 'East' region), a NaN value would
appear. This can be handled with the fill_value parameter.
Python
pivot_multiple_agg = df.pivot_table(
index='Region',
columns='Category',
values='Sales',
aggfunc=['sum', 'mean']
)
print("\nPivot Table with Multiple Aggregations:")
print(pivot_multiple_agg)
Multiple Indices/Columns: You can use multiple columns for rows and/or columns to create
a hierarchical index.
Python
pivot_multi_index = df.pivot_table(
index=['Region', 'Salesperson'],
values='Sales',
aggfunc='sum'
)
print("\nPivot Table with Multiple Indices:")
print(pivot_multi_index)
fill_value: You can replace NaN values with a specified value (e.g., 0) to make the table
cleaner.
Python
pivot_filled = df.pivot_table(
index='Region',
columns='Category',
values='Sales',
aggfunc='sum',
fill_value=0
)
print("\nPivot Table with fill_value=0:")
print(pivot_filled)
pivot table is an indispensable tool for summarizing and analyzing data, and pandas'
pivot_table() function provides a flexible and powerful way to create these summaries in
Python. It's a go-to method for transforming long-format data into a wide-format, a process
that is essential for many data analysis tasks.
(Or)
16. (b) Find the following for the given data set:
Mean ,Median, Mode, Variance, Standard deviation and skewness