0% found this document useful (0 votes)
13 views

Unit-4

Data science combines statistics, computer science, and domain expertise to extract insights from data using techniques like machine learning and data visualization. Its applications range from predictive maintenance and fraud detection to healthcare and climate change analysis, while it offers advantages such as better decision-making and improved efficiency, but also poses challenges like data privacy concerns and potential biases. The document outlines various data types, tools for data access, and key concepts in data science, including the data science workflow and machine learning principles.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Unit-4

Data science combines statistics, computer science, and domain expertise to extract insights from data using techniques like machine learning and data visualization. Its applications range from predictive maintenance and fraud detection to healthcare and climate change analysis, while it offers advantages such as better decision-making and improved efficiency, but also poses challenges like data privacy concerns and potential biases. The document outlines various data types, tools for data access, and key concepts in data science, including the data science workflow and machine learning principles.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

UNIT- 4 DATA SCIENCE

DATA SCIENCE

Data science is a field that combines statistics, computer science, and domain expertise to
extract insights and knowledge from data. It involves using various techniques such as
machine learning, data visualization, and statistical modeling to
to analyze and interpret complex
data sets.

• Works on numeric and


Data Science
alpha-numeric data

Computer • Works on images and


Vision visual data

Natural • Works on textual and


Language
Processing speech-based data

The uses of data science are numerous and diverse, including:

1. Predictive maintenance: Predicting equipment failures and scheduling maintenance.

2. * Fraud detection*: Identifying unusual patterns and anomalies to detect fraud.

3. Recommendation systems: Personalized product recommendations based on user behavior.

4. Healthcare: Analyzing medical data to improve patient outcomes and reduce costs.

5. Customer segmentation: Grouping customers based on behavior and preferences.

6. Supply chain optimization: Optimizing logistics and inventory management.

AI powered recognition systems.


7. Image and speech recognition: Developing AI-powered

8. Climate change analysis: Analyzing data to understand and mitigate climate change.

9. Business data driven insights.


usiness intelligence: Informing business decisions with data-driven

AI Notes by Sushant Thakre


10. Social media analysis: Analyzing social media data to understand public opinions and
trends.

Some more applications of Data Science are Fraud and Risk Detection, Genetics and Genomics,
Search Engines, Targeted Advertising, Website Recommendations, Airline Route Planning,
Advanced Image Recognition, Speech Recognitions, Gaming Platforms and so on.

Advantages:

- Better decision-making: Data science helps businesses and organizations make better-
informed decisions.

- Improved efficiency: Data science can help companies and organizations streamline their
operations by identifying inefficiencies and areas for improvement.

- Enhanced customer experience: Data science can help businesses and organizations tailor
their products and services to better meet the needs of their target audience.

- Predictive analytics: Data science can be used for predictive analytics, which involves using
data to forecast future trends and outcomes.

AI Notes by Sushant Thakre


- Innovation and new discoveries: Data science can lead to new discoveries and innovations by
revealing previously unknown relationships and insights in data.

Disadvantages:

- Data privacy concerns: There is a risk of data privacy concerns when data is collected and
analyzed.

- Bias in data: Data can be biased due to many factors, such as the selection of the data or the
way it is collected.

- Misinterpretation of data: Data science involves complex statistical analysis, which can
sometimes lead to misinterpretation of the data.

- Data quality issues: Data science depends on the quality of the data used. If the data is not
accurate, complete or consistent, it can lead to incorrect results.

- Cost and time: Data science can be time-consuming and expensive

Types of Data

In data science, tabular datasets are commonly used and can stored in various formats
depending on the specific requirements of the project.

Some commonly used formats for storing tabular data are:

 Comma Separated Values (CSV): A plain text format for tabular data where each line
represents a row and values are separated by commas.
 Spreadsheet: A digital sheet consisting of cells arranged in rows and columns, used for
organizing, analyzing, and storing data.
 Structured Query Language (SQL): A programming language used to manage and
manipulate relational databases.

Data Access

AI Notes by Sushant Thakre


In order to use the collected data for programming purpose we can use python programming
language which provide us various packages like NumPy, Pandas and Matplotlib, that help us
access structured data (in tabular form) within the code.

NumPy
NumPy stands for Numerical Python. It is the fundamental package for mathematical and
logical operations on arrays in Python.
Ex- Array of 1D, 2D and n-D

Pandas
Pandas is a Python library mainly used for data manipulation and analysis. It provides data
structure and operations for handling numerical tables and time series.
Ex- Series - 1D and Data Frame -2D

Matplotlib
Matplotlib is a free and open source library. It is useful tool in Python for creating two-
dimensional plots of arrays. Most important for data visualization.
Ex- Bar Graph, Histogram, Pie chart….
(TWO MARKS QUESTIONS)
1. What is the goal of data visualization?
Answer: To communicate insights and patterns in data through graphical representations.

2. Which algorithm is used for finding the most important features in a dataset?
Answer: Principal Component Analysis (PCA).

3. What is the name of the technique used to handle missing values in a dataset?
Answer: Imputation.

4. Which type of machine learning model is used for recommending systems?


Answer: Collaborative Filtering.

5. What is the name of the popular data science tool used for data manipulation and
analysis?
Answer: Pandas.

6. Which technique is used to evaluate the performance of a machine learning model?


Answer: Cross-validation.

7. What is the name of the algorithm used for clustering data?

AI Notes by Sushant Thakre


Answer: K-Means.

8. Which type of data is used to train a machine learning model?


Answer: Training data.

9. What is the name of the technique used to reduce the dimensionality of a dataset?
Answer: Feature selection.

10. Which machine learning model is used for predicting continuous outcomes?
Answer: Regression.

(FIVE MARKS QUESTIONS)

1. Describe the steps involved in the data science workflow.


Answer: The data science workflow typically involves:
- Problem definition and hypothesis formation
- Data collection and cleaning
- Data exploration and visualization
- Modeling and evaluation
- Deployment and maintenance

2. Explain the concept of overfitting in machine learning and how it can be prevented.
Answer: Overfitting occurs when a model is too complex and performs well on training
data but poorly on new data. Techniques to prevent overfitting include:
- Regularization (L1, L2)
- Early stopping
- Data augmentation
- Cross
-validation
- Ensemble methods

3. What is feature engineering, and how is it important in data science?


Answer: Feature engineering is the process of selecting and transforming variables to
create new features that improve model performance. It's important because: - It helps
reduce dimensionality - Improves model interpretability - Enhances model performance -
Reduces noise and correlations - Facilitates feature selection

4. Describe the difference between supervised, unsupervised, and reinforcement learning.


AI Notes by Sushant Thakre
Answer: Supervised learning involves labeled data and predicts outcomes. Unsupervised
learning involves unlabeled data and discovers patterns. Reinforcement learning involves
an agent learning from interactions with an environment to maximize rewards.

5. Explain the concept of bias and variance in machine learning and how they affect model
performance.
Answer: Bias refers to systematic error, while variance refers to model sensitivity to data.
High bias leads to under fitting, while high variance leads to overfitting. The goal is to
balance bias and variance to achieve optimal model performance.

AI Notes by Sushant Thakre

You might also like