Unit-4
Unit-4
DATA SCIENCE
Data science is a field that combines statistics, computer science, and domain expertise to
extract insights and knowledge from data. It involves using various techniques such as
machine learning, data visualization, and statistical modeling to
to analyze and interpret complex
data sets.
4. Healthcare: Analyzing medical data to improve patient outcomes and reduce costs.
8. Climate change analysis: Analyzing data to understand and mitigate climate change.
Some more applications of Data Science are Fraud and Risk Detection, Genetics and Genomics,
Search Engines, Targeted Advertising, Website Recommendations, Airline Route Planning,
Advanced Image Recognition, Speech Recognitions, Gaming Platforms and so on.
Advantages:
- Better decision-making: Data science helps businesses and organizations make better-
informed decisions.
- Improved efficiency: Data science can help companies and organizations streamline their
operations by identifying inefficiencies and areas for improvement.
- Enhanced customer experience: Data science can help businesses and organizations tailor
their products and services to better meet the needs of their target audience.
- Predictive analytics: Data science can be used for predictive analytics, which involves using
data to forecast future trends and outcomes.
Disadvantages:
- Data privacy concerns: There is a risk of data privacy concerns when data is collected and
analyzed.
- Bias in data: Data can be biased due to many factors, such as the selection of the data or the
way it is collected.
- Misinterpretation of data: Data science involves complex statistical analysis, which can
sometimes lead to misinterpretation of the data.
- Data quality issues: Data science depends on the quality of the data used. If the data is not
accurate, complete or consistent, it can lead to incorrect results.
Types of Data
In data science, tabular datasets are commonly used and can stored in various formats
depending on the specific requirements of the project.
Comma Separated Values (CSV): A plain text format for tabular data where each line
represents a row and values are separated by commas.
Spreadsheet: A digital sheet consisting of cells arranged in rows and columns, used for
organizing, analyzing, and storing data.
Structured Query Language (SQL): A programming language used to manage and
manipulate relational databases.
Data Access
NumPy
NumPy stands for Numerical Python. It is the fundamental package for mathematical and
logical operations on arrays in Python.
Ex- Array of 1D, 2D and n-D
Pandas
Pandas is a Python library mainly used for data manipulation and analysis. It provides data
structure and operations for handling numerical tables and time series.
Ex- Series - 1D and Data Frame -2D
Matplotlib
Matplotlib is a free and open source library. It is useful tool in Python for creating two-
dimensional plots of arrays. Most important for data visualization.
Ex- Bar Graph, Histogram, Pie chart….
(TWO MARKS QUESTIONS)
1. What is the goal of data visualization?
Answer: To communicate insights and patterns in data through graphical representations.
2. Which algorithm is used for finding the most important features in a dataset?
Answer: Principal Component Analysis (PCA).
3. What is the name of the technique used to handle missing values in a dataset?
Answer: Imputation.
5. What is the name of the popular data science tool used for data manipulation and
analysis?
Answer: Pandas.
9. What is the name of the technique used to reduce the dimensionality of a dataset?
Answer: Feature selection.
10. Which machine learning model is used for predicting continuous outcomes?
Answer: Regression.
2. Explain the concept of overfitting in machine learning and how it can be prevented.
Answer: Overfitting occurs when a model is too complex and performs well on training
data but poorly on new data. Techniques to prevent overfitting include:
- Regularization (L1, L2)
- Early stopping
- Data augmentation
- Cross
-validation
- Ensemble methods
5. Explain the concept of bias and variance in machine learning and how they affect model
performance.
Answer: Bias refers to systematic error, while variance refers to model sensitivity to data.
High bias leads to under fitting, while high variance leads to overfitting. The goal is to
balance bias and variance to achieve optimal model performance.