FOUNDATIONS OF DATA SCIENCE L TP C
30 2 4
COURSE OBJECTIVES:
To Understand the basic concepts of Data Analysis
To acquire skills in data preparatory and preprocessing steps
To understand the mathematical skills in statistics
To learn the tools and packages in Python for data science
To acquire knowledge in data interpretation and visualization techniques
Will gain Knowledge about recent trends in Data Science
UNIT- I INTRODUCTION 8
Applications: Search engines, Image recognition
Need for data science – benefits and uses – facets of data – data science process – setting the research
goal – retrieving data – cleansing, integrating, and transforming data – exploratory data analysis – build
the models – presenting and building applications
UNIT- II DESCRIBING DATA 9
Applications: Speech recognition, Recommendation systems
Frequency distributions –Outliers –relative frequency distributions –cumulative frequency distributions –
frequency distributions for nominal data –interpreting distributions –graphs –averages -normal
distributions –z scores –normal curve problems –finding proportions –finding scores –more about z–
interpretation of r2 –multiple regression equations –regression toward the mean- statistical metrics with
python.
.
UNIT- III INTRODUCTION TO NUMPY 8
Applications: Machine Learning, Scientific Computing
Data types in Python -basics of Numpy arrays - computations on Numpy Arrays-universal functions-
aggregations: min, max and Everything in between-computation on arrays: broadcasting - comparisons,
masks, and Boolean logic - fancy indexing -sorting values in Numpy array-fast sorting-sorting along
rows or columns-partial sorts-K nearest neighbors- Numpy’s structured arrays
.
UNIT- IV DATA MANIPULATION WITH PANDAS 8
Applications: Financial Analysis, Data Visualization
Pandas objects - data indexing and selection - operating on data in pandas -handling missing data -
hierarchical indexing - combining datasets: concat and append - combining datasets: merge and join-
aggregation and grouping- pivot tables-vectorized string operations - working with time Series - high-
performance pandas: eval()and query().
.
UNIT -V PYTHON FOR DATA VISUALIZATION 7
Applications : Climate Change Analysis, Sports data Analysis
Visualization with matplotlib – line plots – scatter plots – visualizing errors – density and contour plots –
histograms, binnings, and density –three dimensional plotting – geographic data – data analysis using
statmodels and seaborn – graph plotting using Plotly – interactive data visualization using Bokeh
UNIT -VI RECENT TRENDS IN DATA SCIENCE 5
Healthcare- Drug development, Virtual healthcare assistance- Finance- Fraud detection- Marketing-
Targeted advertising, Customer interactions- Transportation - Driverless cars, Airline routing.
TOTAL: 45 PERIODS
PRACTICAL EXERCISES:
1. Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and Pandas
packages.
2. Working with Numpy arrays
3. Working with Pandas data frames
4. Reading data from text files, Excel and the web and exploring various commands for doing
descriptive analytics on the Iris data set.
5. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the
following:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.
6. Apply and explore various plotting functions on UCI data sets.
a. Normal curves
b. Density and contour plots
c. Correlation and scatter plots
d. Histograms
e. Three dimensional plotting
7. Visualizing Geographic Data with Basemap
8. Importing Data from External Source Using Python
SOFTWARE REQUIREMENTS
Python, Numpy, Scipy, Matplotlib, Pandas, statmodels, seaborn, plotly, bokeh
TOTAL : 30 PERIODS
TOTAL: 75 PERIODS
COURSE OUTCOMES:
At the end of this course, the students will be able to:
CO1: Apply the skills of data inspecting and cleansing.
CO2: Determine the relationship between data dependencies using statistics
CO3: Represent the useful information using mathematical skills
CO4: Handle data using primary tools used for data science in Python
CO5: Apply the knowledge for data describing and visualization using tools
CO6: Aware of the current scope and limitations of DS and societal implications
TEXT BOOKS
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”, Manning
Publications, 2016. (first two chapters for Unit I)
2. Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.
(Chapters 1–7 for Units II)
3. Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016. (Parts of chapters 2–4 for
Units III,IV and V)
REFERENCES
1. Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”, Green Tea Press, 2014
2.Sanjeev J. Wagh, Manisha S. Bhende, Anuradha D. Thakare, “Fundamentals of Data
Science”, CRC Press, 2022
3.Chirag Shah, “A Hands-On Introduction to Data Science”, Cambridge University Press
CO’s & PO’s MAPPING
CO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11
CO1 2 2 1 2 2 - - - 1 1 1
CO2 2 1 - 1 1 - - - 2 1 1
CO3 2 2 1 2 2 1 1 - 1 2 1
CO4 3 2 2 1 2 - - - 1 1 2
CO5 2 2 1 2 2 - - - 1 1 1
CO6 2 2 1 2 2 - - - 1 1 1