2nd - Semester - Data Science
2nd - Semester - Data Science
To describe the data for the data science process and to get familiarize data
science process and steps and Study usage of various data sources, and to develop
ETL pipelines for data preparation using Spark on Databricks, and Apply
2
statistical concepts to summarize and analyze data, understand hypothesis testing
and perform statistical inference
To analyze the data science applicability in real time applications., and working with various Data
4 analytics Charts. And Grasp the fundamental principles of supervised and unsupervised machine
learning algorithms
To utilize the Python libraries for Data Wrangling, Understand the various calculations and best
5
practices. To present and interpret data using visualization libraries in Python used for data science
Teaching-Learning Process
Pedagogical Initiatives:
Some sample strategies to accelerate the attainment of various course outcomes are listed below:
Adopt different teaching methods to attain the course outcomes.
Include videos to demonstrate various concepts in Data Science.
Encourage collaborative (Group) Learning to encourage team building.
Ask at least three HOTS (Higher-order Thinking Skills) module-wise questions to promote critical
thinking.
Adopt Problem-Based Learning (PBL), which fosters students’ analytical skills, and develops
thinking skills such as evaluating, generalizing, and analyzing information rather than simply recalling
it.
Show different ways to solve a problem and encourage the students to come up with creative and
optimal solutions.
Discuss various case studies to map with real-world scenarios and improve the understanding.
Devise innovative pedagogy to improve Teaching-Learning Process (TLP).
Lecturer method (L) need not to be only a traditional lecture method, but alternative effective teaching
methods could be adopted to attain the outcomes.
Use of Video/Animation to explain functioning of various concepts.
Encourage collaborative (Group Learning) Learning in the class.
Introduce Topics in manifold representations.
Show the different ways to solve the same problem with different circuits/logic and encourage the
students to come up with their own creative ways to solve them.
Discuss how every concept can be applied to the real world - and when that's possible, it
helps improve the students' understanding.
What is Data Science? Big Data and Data Science hype – and getting past the hype, Why
now? – Datafication, Current landscape of perspectives, Skill sets. Needed Statistical
Inference: Populations and samples, Statistical modelling, probability distributions, fitting a
model.
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining
research goals – Retrieving data – Data preparation - Exploratory Data analysis – build the
model– presenting findings and building applications - Data Mining - Data Warehousing – Basic
Statistical descriptions of Data.
PREPARING AND GATHERING DATA AND KNOWLEDGE:
Philosophies of data science - Data science in a big data world - Benefits and uses of data
science and big data - facts of data: Structured data, Unstructured data, Natural Language,
Machine generated data, Audio, Image and video streaming data - The Big data Eco system:
Distributed file system, Distributed Programming framework, Data Integration frame work,
Machine learning Framework, NoSQL Databases, Scheduling tools, Benchmarking Tools,
System Deployment, Service programming and Security. Big Data Fundamentals: Definition
and characteristics of big data (volume, variety, velocity), Impact of big data on different
industries, Challenges of processing big data with traditional methods. Apache Spark: A
Distributed Processing Engine: Introduction to Spark and its distributed nature, Components
of the Spark ecosystem (Spark Core, Spark SQL, Spark Streaming), Benefits of using Spark for
data processing. Building ETL Pipelines with Spark on Databricks: Introduction to the ETL
process (Extract, Transform, Load), Setting up a Databricks workspace (free tier available),
Connecting to data sources and data ingestion techniques in Spark, Data transformation and
manipulation using Spark Data Frames/Datasets
Group activity: Summarize testable predictions for real-time data, Data can be used from the
Pedagogy Kaggle data set and other open source Github repository and data repository.
DESCRIBING DATA
Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing Data
8
with Averages - Describing Variability - Normal Distributions and Standard (z) Scores.
2 THE DATA SCIENCE PROCESS-Overview of the data science process- defining research goals Hours
and creating project charter, retrieving data, cleansing, integrating and transforming data,
exploratory data analysis, Build the models, presenting findings and building application on top of
them.
Deep Dive into ETL with Spark: Data Ingestion and Cleaning: Techniques for
handling various data formats (text files, CSV, JSON), Addressing common data
quality issues (missing values, inconsistencies). Data Transformation with Spark
Functions : Working with Spark Data Frames/Datasets and applying transformation
functions -Filtering, aggregating, and manipulating data for analysis, Joining datasets
for comprehensive analysis. Data Quality Checks and Missing Value Handling (2
Hours): Implementing data quality checks to identify errors and inconsistencies,
Techniques for handling missing values (imputation, deletion), Ensuring data integrity
for reliable analysis, Introduction to Apache Spark SQL: Declarative data querying
with Spark SQL using SQL-like syntax, Integrating Spark SQL with Spark
DataFrames/Datasets, Performing complex queries on large datasets efficiently
Decision Trees
What Is a Decision Tree?, Entropy, The Entropy of a Partition, Creating a Decision Tree, Putting
It All Together, Random Forests, Neural Networks, Perceptrons, Feed-Forward Neural
Networks, Backpropagation, Example: Fizz Buzz, Deep Learning, The Tensor, The Layer
Abstraction, The Linear Layer, Neural Networks as a Sequence of Layers, Loss and
Optimization, Example: XOR Revisited, Other Activation Functions, Example: Fizz Buzz
Revisited, Softmaxes and Cross-Entropy, Dropout, Example: MNIST, Saving and Loading
Models, Clustering, The Idea, The Model, Example: Meetups, Choosing k, Example: Clustering
Colors, Bottom-Up Hierarchical Clustering.
Blended Learning-Data Collection from Kaggle and other repository and performing case
Pedagogy
studies
Feature Generation and Feature Selection
Extracting Meaning from Data: Motivating application: user (customer) retention. Feature 8
Generation (brainstorming, role of domain expertise, and place for imagination), Feature
3 Hours
Selection algorithms. Filters; Wrappers; Decision Trees; Random Forests. Recommendation
Systems: Building a User-Facing Data Product, Algorithmic ingredients of a Recommendation
Engine, Dimensionality Reduction, Singular Value Decomposition, Principal Component
Analysis, Exercise: build your own recommendation system.
Model Development: Simple and Multiple Regression – Model Evaluation using Visualization –
Residual Plot – Distribution Plot – Polynomial Regression and Pipelines – Measures for In-
sample Evaluation – Prediction and Decision Making.
CASE STUDIES Distributing data storage and processing with frameworks - Case study: e.g.,
Assessing risk when lending money.
The Data Ecosystem
the different types of data structures, file formats, sources of data, and the languages data
professionals use in their day-to-day tasks. various types of data repositories such as
Databases, Data Warehouses, Data Marts, Data Lakes, and Data Pipelines. the Extract,
Transform, and Load (ETL) Process, which is used to extract, transform, and load data into data
repositories. basic understanding of Big Data and Big Data processing tools such as Hadoop,
Hadoop Distributed File System (HDFS), Hive, and Spark.
List of Programs:
ii. Data Ingestion: Load a large dataset (e.g., a CSV file containing
transaction data) into Databricks.
• Basic Data Exploration: Use Spark DataFrames to explore
the dataset. And Perform basic operations like filtering,
grouping, and aggregating data.
• ETL Pipeline: Build an ETL pipeline to clean and transform
the data. Save the transformed data back to a storage system
(e.g., DBFS).
Tasks:
1. Descriptive Statistics:
o Calculate measures of central tendency (mean, median, mode)
and dispersion (variance, standard deviation) for a dataset.
2. Probability Distributions:
o Analyze a dataset to identify its underlying probability
distribution (e.g., normal, binomial).
o Visualize the distribution using histograms and probability
plots.
3. Hypothesis Testing:
o Formulate null and alternative hypotheses for a given problem.
o Perform hypothesis testing (e.g., t-test, chi-square test) and
interpret the results.
3 a. Train a regularized logistic regression classifier on the iris dataset
(https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ or the inbuilt iris
dataset) using sklearn. Train the model with the following hyperparameter C = 1e4
and report the best classification accuracy.
b. Train an SVM classifier on the iris dataset using sklearn. Try different kernels and
the associated hyperparameters. Train model with the following set of
hyperparameters RBF- kernel, gamma=0.5, one-vs-rest classifier,
no-feature-normalization. Also try C=0.01,1,10C=0.01,1,10. For the
above set of hyperparameters, find the best classification accuracy along with total
number of support vectors on the test data
4 A. Consider the following dataset. Write a program to demonstrate the working of the
decision tree based ID3 algorithm.
Price Maintenance Capacity Airbag Profitable
Low Low 2 No Yes
Low Med 4 Yes Yes
Low Low 4 No Yes
Low Med 4 No No
Low High 4 No No
Med Med 4 No No
Med Med 4 Yes Yes
Med High 2 Yes No
Med High 5 No Yes
High Med 4 Yes Yes
high Med 2 Yes Yes
High High 2 Yes No
high High 5 yes Yes
B. Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in the
dataset corresponds to the co-ordinates of each data point. The third column
corresponds to the actual
cluster label. Compute the rand index for the following methods:
o K – means Clustering
o Single – link Hierarchical Clustering
o Complete link hierarchical clustering.
o Also visualize the dataset and which algorithm will be able to recover
the true clusters.
5 A. Import any CSV file to Pandas Data Frame and perform the following:
(a) Visualize the first and last 10 records
(b) Get the shape, index and column details
(c) Select/Delete the records (rows)/columns based on conditions.
(d) Perform ranking and sorting operations.
(e) Do required statistical operations on the given columns.
(f) Find the count and uniqueness of the given categorical values.
(g) Rename single/multiple columns
2. import any CSV file to Pandas Data Frame and perform the following:
a)Handle missing data by detecting and dropping/ filling missing values.
(b) Transform data using apply () and map() method.
(c) Detect and filter outliers.
(d)Perform Vectorized String operations on Pandas Series.
(e) Visualize data using Line Plots, Bar Plots, Histograms, Density Plots and Scatter
Plots.
Tasks:
a. Univariate Analysis:
a. Analyze the distribution of individual variables using
histograms and boxplots.
b. Multivariate Analysis:
a. Explore relationships between pairs of variables using scatter
plots and correlation matrices.
c. Data Visualization:
a. Create various visualizations (bar charts, line charts, heatmaps)
using Matplotlib and Seaborn.
b. Customize the visualizations to effectively communicate
insights.
d. Feature Engineering:
a. Perform feature scaling and encoding of categorical variables.
b. Create new features from existing data to enhance model
performance.
6 A. Reading data from text files, Excel and the web and exploring various
commands for doing descriptive analytics on the Iris data set.
B. Use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the following:
1. Univariate analysis: Frequency, Mean, Median, Mode, Variance,
Standard Deviation, Skewness and Kurtosis.
2. Bivariate analysis: Linear and logistic regression modeling
3. Multiple Regression analysis
4. Also compare the results of the above analysis for the two data sets.
C. Apply and explore various plotting functions on UCI data sets.
1. Normal curves
2. Density and contour plots
3. Correlation and scatter plots
4. Histograms
5. Three-dimensional plotting
D. Visualizing Geographic Data with Basemap
7 A . Supervised Learning with Scikit-learn: Objective: Implement and
evaluate supervised learning algorithms.
Tasks:
1. Data Preparation:
o Split a dataset into training and testing sets.
2. Linear Regression:
o Implement a linear regression model to predict a continuous
target variable.
o Evaluate the model's performance using metrics like mean
squared error (MSE).
3. Decision Trees:
o Build a decision tree classifier to predict a categorical target
variable.
o Assess the model's accuracy, precision, and recall.
4. K-Nearest Neighbors (KNN):
o Implement a KNN model for classification.
o Tune the hyperparameters (e.g., the number of neighbors) to
optimize performance.
Tasks:
i. K-Means Clustering:
a. Apply K-means clustering to a dataset to group similar data
points.
b. Visualize the clusters and interpret the results.
ii. Principal Component Analysis (PCA):
a. Perform PCA on a high-dimensional dataset to reduce its
dimensionality.
b. Analyze the principal components and their contribution to
variance.
iii. Data Exploration:
a. Use unsupervised learning techniques to uncover hidden
patterns and insights within the data.
Iris is a particularly famous toy dataset (i.e. a dataset with a small number of rows and columns,
mostly used for initial small-scale tests and proofs of concept). This specific dataset contains
information about the Iris, a genus that includes 260-300 species of plants. The Iris dataset
contains measurements for 150 Iris flowers, each belonging to one of three species: Virginica,
Versicolor and Setose. (50 flowers for each of the three species). Each of the 150 flowers
contained in the Iris dataset is represented by 5 values:
□ Sepal length, in cm
□ Sepal width, in cm
□ petal length, in cm
□ petal width, in cm
Iris species, one of: iris-setose, iris-versicolor, iris-virginica. Each row of the dataset represents a
distinct flower (as such, the dataset will have 150 rows). Each row then contains 5 values (4
measurements and a species label). The dataset is described in more detail on the UCI Machine
Learning Repository website. The dataset can either be downloaded directly from there (iris.data
file), or from a terminal, using the wget tool. The following command downloads the dataset from
the original URL and stores it in a file named iris.csv.
$ wget "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" -O iris.csv
MNIST Data Set
The MNIST dataset is another particularly famous dataset as CSV file. It contains several
thousands of hand- written digits (0 to 9). Each hand-written digit is contained in a 28 × 28 8-bit
grayscale image. This means that each digit has 784 (282) pixels, and each pixel has a value that
ranges from 0 (black) to 255 (white). The dataset can be downloaded from the following
URL:https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv.
Each row of the MNIST datasets represents a digit. For the sake of simplicity, this dataset contains
only a small fraction (10,000 digits out of 70,000) of the real MNIST dataset, which is known as the
MNIST test set.
For each digit, 785 values are available.
1 Load the Iris dataset as a list of lists (each of the 150 lists should have 5 elements). CO3,4,5
Compute and print the mean and the standard deviation for each of the 4 measurement
columns (i.e. sepal length and width, petal length and width). Compute and print the
mean and the standard deviation for each of the 4 measurement columns, separately
for each of the three Iris species (Versicolor, Virginica and Setose). Which
measurement would you consider “best”, if you were to guess the Iris species based
only on those four values?
2 Load the MNIST dataset. Create a function that, given a position 1 ≤ k ≤ 10, 000, CO3,4,5
prints the kthdigit of the dataset (i.e. thekthrow of the csv file) as a grid of 28 × 28
characters. More specifically, you should map each range of pixel values to the
following characters:
[0, 64) → " "
[64, 128) → "."
[128, 192) → "*"
[192, 256) → "#"
Compute the Euclidean distance between each pair of the 784-dimensional vectors of
the digits at the following positions: 26th, 30th, 32nd, 35th. Based on the distances
computed in the previous step and knowing that the digits listed are 7, 0, 1, 1, can you
assign the correct label to each of the digits?
3 Split the Iris dataset into two the datasets - IrisTest_TrainData.csv, CO3,4,5
IrisTest_TestData.csv. Read them as two separate data frames named Train_Data
and Test_Data respectively.
Answer the following questions:
How many missing values are there in Train_Data?
What is the proportion of Setosa types in the Test_Data?
What is the accuracy score of the K-Nearest Neighbor model (model_1) with 2/3
neighbors using Train_Data and Test_Data?
Identify the list of indices of misclassified samples from the „model_1‟.
Build a logistic regression model (model_2) keeping the modelling steps constant. Find
the accuracy of the model_2
4 Demonstrate any of the Clustering model and evaluate the performance on Iris dataset. CO3,4,5
Text Books
Sl. No. Title of the Book/Name of the author/Name of the publisher/Edition and Year
1 Introducing Data Science, Davy Cielen, Arno D. B. Meysman and Mohamed Ali,Manning Publications,
2016
2 Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.
(Units II and III)
Reference Books
1 Joel Grus, “Data Science from Scratch”, 2ndEdition, O’Reilly Publications/Shroff Publishers and Distributors
Pvt. Ltd., 2019. ISBN-13: 978-9352138326
2 Data Visualization workshop, Tim Grobmann and Mario Dobler, Packt Publishing, ISBN
9781800568112
3
4 Data Science for Business by Foster Provost and Tom Fawcett
(https://www.amazon.com/Data-Science-Business-Data-Analytic-Thinking-ebook/dp/
B00E6EQ3X4)
5
Python for Data Analysis by Wes McKinney (https://www.oreilly.com/library/view/python-
for-data/9781491957653/)
6 Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow by Aurélien Géron
7 Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016. (Units IV and V)
8 Cathy O’Neil and Rachel Schutt, “ Doing Data Science, Straight Talk From The
Frontline”, O’Reilly, 2014.
9 Jiawei Han, Micheline Kamber and Jian Pei, “ Data Mining: Concepts and
Techniques”, Third Edition. ISBN 0123814790, 2011.
10 Mohammed J. Zaki and Wagner Miera Jr, “Data Mining and Analysis:
Fundamental Concepts and Algorithms”, Cambridge University Press, 2014.
11 Jojo Moolayil, “Smarter Decisions : The Intersection of IoT and Data Science”, PACKT,
2016
12 Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman,
Cambridge University Press, 2nd edition, 2014
13 Think Like a Data Scientist, Brian Godsey, Manning Publications, 2017.
Course Outcomes: At the end of the course, the student will be able to:
RBT Level
CO Course Outcomes RBT
Indicator
Level
CO Describe the data science terminologies an To Understand the basics of data science R, U Level 1
1
CO Apply the Data Science process on real time scenario and Explain how data is R, U Level 2
2 collected, managed and stored for data science.
CO Analyze data visualization tools, Build, and prepare data for use with a variety of Ap Level 3
3 statistical methods and models
CO Apply Data storage and processing with frameworks and Ap, An Level 4
4 Analyze Data using various Visualization techniques.
Apply visualization Libraries in Python to interpret and explore data, Use the Ap, An Level 4
Python Libraries for Data Wrangling and Choose contemporary models, such as
CO machine learning, AI, techniques to solve practical problems
5
1 Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and PO1
computer science and business systems to the solution of complex engineering and societal problems.
2 Problem analysis: Identify, formulate, review research literature, and analyze complex engineering and PO2
business problems reaching substantiated conclusions using first principles of mathematics, natural
sciences, and engineering sciences.
3 Design/development of solutions: Design solutions for complex engineering problems and design system PO3
components or processes that meet the specified needs with appropriate consideration for the public health
and safety, and the cultural, societal, and environmental considerations.
4 Conduct investigations of complex problems: Use research-based knowledge and research methods PO4
including design of experiments, analysis and interpretation of data, and synthesis of the information to
provide valid conclusions.
5 Modern tool usage: Create, select, and apply appropriate techniques, resources, and PO5
modern engineering and IT tools including prediction and modeling to complex engineering activities with
an understanding of the limitations
6 The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal, PO6
health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering and business practices.
7 Environment and sustainability: Understand the impact of the professional engineering solutions in business PO7
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.
8 Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the PO8
engineering and business practices.
9 Individual and team work: Function effectively as an individual, and as a member or leader in diverse PO9
teams, and in multidisciplinary settings.
10 Communication: Communicate effectively on complex engineering activities with the engineering community PO10
and with society at large, such as, being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear instructions.
11 Project management and finance: Demonstrate knowledge and understanding of the engineering, business PO11
and management principles and apply these to one‟s own work, as a member and leader in a team, to
manage projects and in multidisciplinary environments.
12 Life-long learning: Recognize the need for, and have the preparation and ability to engage in independent PO12
and life-long learning in the broadest context of technological change.
CO/PO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3
x x x x
CO1
x
CO2
x x x
CO3
x
CO4
CO5
Understand
Apply
Analyse
Evaluate
Create