0% found this document useful (0 votes)
5 views13 pages

Data Science Unit I

Unit I of the Data Science course introduces the concept of data science, its lifecycle, and the importance of big data. It covers the stages of data science, prerequisites for learning, and the differences between structured and unstructured data. The document also discusses the current scenario of data science, including its applications, ethical considerations, and the demand for skilled professionals.

Uploaded by

Safior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views13 pages

Data Science Unit I

Unit I of the Data Science course introduces the concept of data science, its lifecycle, and the importance of big data. It covers the stages of data science, prerequisites for learning, and the differences between structured and unstructured data. The document also discusses the current scenario of data science, including its applications, ethical considerations, and the demand for skilled professionals.

Uploaded by

Safior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

DATA SCIENCE UNIT-I

S
Y
.

MR.PRAMOD JADHAO
DATA SCIENCE (UNIT – I)

Unit I Introduction to Data Science


What Is Data Science?
Data science is the domain of study that deals with vast volumes of data using modern tools and techniques
to find unseen patterns, derive meaningful information, and make business decisions. Data science uses
complex machine learning algorithms to build predictive models.
The data used for analysis can come from many different sources and presented in various formats.
The Data Science Lifecycle / Data Science Process
Now that you know what is data science, next up let us focus on the data science lifecycle. Data science’s
lifecycle
consists of five distinct stages, each with its own tasks:
1. Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction. This stage involves
gathering raw structured and unstructured data.
2. Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data Architecture.
This stage covers taking the raw data and putting it in a form that can be used.
3. Process: Data Mining, Clustering/Classification, Data Modeling, Data Summarization. Data
scientists take the prepared data and examine its patterns, ranges, and biases to determine how useful
it will be in predictive analysis.
4. Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining, Qualitative
Analysis. Here is the real meat of the lifecycle. This stage involves performing the various analyses
on the data.
5. Communicate: Data Reporting, Data Visualization, Business Intelligence, Decision Making. In this
final step, analysts prepare the analyses in easily readable forms such as charts, graphs, and reports.
Prerequisites for Data Science
Here are some of the technical concepts you should know about before starting to learn what is data science.
1. Machine Learning: Machine learning is the backbone of data science. Data Scientists need to have a
solid grasp of ML in addition to basic knowledge of statistics.
2. Modeling: Mathematical models enable you to make quick calculations and predictions based on
what you already know about the data. Modeling is also a part of Machine Learning and involves
identifying which algorithm is the most suitable to solve a given problem and how to train these
models.
3. Statistics: Statistics are at the core of data science. A sturdy handle on statistics can help you extract
more intelligence and obtain more meaningful results.
4. Programming: Some level of programming is required to execute a successful data science project.
The most common programming languages are Python, and R. Python is especially popular because
it’s easy to learn, and it supports multiple libraries for data science and ML.
5. Databases: A capable data scientist needs to understand how databases work, how to manage them,
and how to extract data from them.
Need of Data Science….
The principal purpose of Data Science is to find patterns within data. It uses various statistical techniques to
analyze and draw insights from the data. From data extraction, wrangling and pre-processing, a Data
Scientist must examine the data systematically.
Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)
Then, he has the responsibility of making predictions from the data. The goal of a Data Scientist is to derive
conclusions from the data. Through these conclusions, he is able to assist companies in making smarter
business decisions.

Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)

Big Data
What is Big Data?
Big Data literally means large amounts of data. Big data is the pillar behind the idea that one can make useful
inferences with a large body of data that wasn’t possible before with smaller datasets. So extremely large
data sets may be analyzed computationally to reveal patterns, trends, and associations that are not transparent
or easy to identify.
Why is everyone interested in Big Data?
Definition: Refers to large volumes of structured, semi-structured, and unstructured data that require
advanced tools and techniques for processing and analysis
Big data is everywhere!
Every time you go to the web and do something that data is collected, every time you buy something from
one of the e-commerce your data is collected. Whenever you go to store data is collected at the point of sale,
when you do Bank transactions that data is there, when you go to Social networks like Facebook, Twitter
that data is collected. Now, these are more social data, but the same thing is starting to happen with real
engineering plants. Real-time data is collected from plants all over the world. Not only these if you are doing
much more sophisticated simulation, molecular simulations, which generates tons of data that is also
collected and stored.

Characteristics:

o Volume: Massive amounts of data generated at high velocity.


o Velocity: Data is generated and processed rapidly.
o Variety: Diverse types of data including text, images, videos, sensor data, etc.
o Veracity: Concerns the accuracy and reliability of data.

How much data is Big Data?


Google processes 20 Petabytes(PB) per day (2008)
Facebook has 2.5 PB of user data + 15 TB per day (2009)
eBay has 6.5 PB of user data + 50 TB per day (2009)
CERN’s Large Hadron Collider(LHC) generates 15 PB a year
So one of the reasons for the acceleration of data science in recent years is the enormous volume of data (e.g
Big Data) currently available and being generated. Not only are huge amounts of data being collected about
many aspects of the world and our lives, but we concurrently have the rise of inexpensive computing. This
has formed the perfect storm in which we have rich data and the tools to analyze it. Advancing computer
memory capacities, more enhanced software, more competent processors, and now, more numerous data
scientists with the skills to put this to use and solve questions using the data! And that’s the big reason why
do we need data science in the future.

Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)
Difference between big data and little data

Feature Little Data Big Data


Technology Traditional Modern
Collection Generally, it is obtained in The Big Data collection is done by using pipelines
an organized manner than having queues like AWS Kinesis or Google Pub /
is Sub
inserted into the database to balance high-speed data
Volume Data in the range of tens or Size of Data is more than Terabytes
hundreds of Gigabytes
Analysis Data marts(Analysts) Clusters(Data Scientists), Data marts(Analysts)
Areas
Quality Contains less noise as data is less Usually, the quality of data is not guaranteed
collected in a controlled manner
Processing It requires batch-oriented It has both batch and stream processing pipelines
processing pipelines
Database SQL NoSQL
Velocity A regulated and constant flow of Data arrives at extremely high speeds, large volumes
data, data aggregation is slow of data aggregation in a short time
Structure Structured data in tabular Numerous variety of data set including tabular data,
format with fixed text, audio, images, video, logs, JSON etc.(Non
schema(Relational) Relational)
Scalability They are usually vertically scaled They are mostly based on horizontally scaling
architectures, which gives more versatility at a lower
cost
Query only Sequel Python, R, Java, Sequel
Language
Hardware A single server is sufficient Requires more than one server
Value Business Intelligence, analysis and Complex data mining techniques for pattern finding,
reporting recommendation, prediction etc.
Optimization Data can be optimized Requires machine learning techniques for data
manually(human powered) optimization
Storage Storage within enterprises, local Usually requires distributed storage systems on cloud
servers etc. or in external file systems
People Data Analysts, Database Data Scientists, Data Analysts, Database
Administrators and Data Engineers Administrators and Data Engineers
Security Security practices for Small Securing Big Data systems are much more
Data include user privileges, complicated. Best security practices include data
data encryption, hashing, etc. encryption, cluster network isolation, strong access
control protocols etc.
Nomenclature Database, Data Warehouse, Data Data Lake
Mart
Infrastructure Predictable resource allocation, More agile infrastructure with horizontally scalable
mostly vertically hardware
scalable hardware.

Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)

The current Scenario of data science


Expansion of Applications: Data science is increasingly being applied across diverse domains such as
healthcare, finance, retail, manufacturing, and transportation. It plays a crucial role in optimizing processes,
improving decision-making, and driving innovation.

Integration with AI and Machine Learning: Data science heavily intersects with artificial intelligence
(AI) and machine learning (ML). AI and ML algorithms are utilized for predictive analytics, pattern
recognition, natural language processing (NLP), and computer vision tasks, among others.

Big Data Infrastructure: With the proliferation of big data, data science projects often involve managing
and analysing large datasets using distributed computing frameworks like Hadoop and Spark, as well as
cloud-based solutions provided by Amazon Web Services (AWS), Google Cloud Platform (GCP), and
Microsoft Azure.

Focus on Ethical and Responsible AI: There is a growing emphasis on ethical considerations in data
science and AI applications. Issues such as bias in algorithms, data privacy, transparency, and
accountability are gaining attention, leading to frameworks and guidelines being developed to address these
concerns.

Emerging Technologies: Data science is embracing emerging technologies such as edge computing,
Internet of Things (IoT), and block chain, which generate new types of data and require innovative
approaches for analysis and integration.

Interdisciplinary Collaboration: Data science teams often consist of professionals with diverse
backgrounds in statistics, mathematics, computer science, domain expertise (e.g., healthcare, finance), and
business acumen. Collaborative efforts are essential for successful implementation and deployment of data-
driven solutions.

Demand for Data Professionals: There is a high demand for skilled data scientists, data engineers, and
analysts across industries. Organizations are investing in building data science capabilities to gain
competitive advantage and drive growth.

Education and Training: Educational institutions and online platforms offer a wide range of courses and
programs in data science, catering to individuals seeking to enter or advance their careers in this field.
Continuous learning and upskilling are essential due to the rapid pace of technological change.

Visualization and Communication: Effective data visualization and communication skills are crucial for
data scientists to convey insights and recommendations to stakeholders, aiding in decision-making
processes.

Regulatory Landscape: Data science practices are influenced by regulatory frameworks such as GDPR
(General Data Protection Regulation) in Europe and similar data protection laws globally. Compliance with
these regulations is essential for ethical data handling and user privacy.

Structured data?
Structured data — typically categorized as quantitative data — is highly organized and easily decipherable
by machine learning algorithms. Developed by IBM in 1974, structured query language (SQL) is the
programming language used to manage structured data. By using a relational (SQL) database, business users
can quickly input, search and manipulate structured data.
Pros and cons of structured data
Examples of structured data include dates, names, addresses, credit card numbers, etc. Their benefits are
Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)
tied to ease of use and access, while liabilities revolve around data inflexibility:
Pros
 Easily used by machine learning (ML) algorithms: The specific and organized architecture of
structured data eases manipulation and querying of ML data.
 Easily used by business users: Structured data does not require an in-depth understanding of different
types of data and how they function. With a basic understanding of the topic relative to the data, users
can easily access and interpret the data.
 Accessible by more tools: Since structured data predates unstructured data, there are more tools
available for using and analyzing structured data.
Cons
 Limited usage: Data with a predefined structure can only be used for its intended purpose, which
limits its flexibility and usability.
 Limited storage options: Structured data is generally stored in data storage systems with rigid
schemas (e.g., “data warehouses”). Therefore, changes in data requirements necessitate an update of
all structured data, which leads to a massive expenditure of time and resources.

Example of structured data in a tabular format:

Consider a simple table representing sales data for a fictional company:

Order Customer Name Product Name Quantity Unit Price Total Amount Order Date
ID
1001 John Doe Laptop 2 $1200 $2400 2024-07-10
1002 Jane Smith Smartphone 1 $800 $800 2024-07-11
1003 David Brown Tablet 3 $500 $1500 2024-07-12

Structured data tools


 OLAP: Performs high-speed, multidimensional data analysis from unified, centralized data stores.
 SQLite: Implements a self-contained, server-less, zero-configuration, transactional relational
database engine.
 MySQL: Embeds data into mass-deployed software, particularly mission-critical, heavy-load
production system.
 PostgreSQL: Supports SQL and JSON querying as well as high-tier programming languages
(C/C+, Java, Python, etc.).
Use cases for structured data
 Customer relationship management (CRM): CRM software runs structured data through analytical
tools to create datasets that reveal customer behavior patterns and trends.
 Online booking: Hotel and ticket reservation data (e.g., dates, prices, destinations, etc.) fits the
“rows and columns” format indicative of the pre-defined data model.
 Accounting: Accounting firms or departments use structured data to process and record financialtransactions.

Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)

Unstructured data?
Unstructured data, typically categorized as qualitative data, cannot be processed and analyzed via
conventional data tools and methods. Since unstructured data does not have a predefined data model, it is
best managed in non-relational (NoSQL) databases. Another way to manage unstructured data is to use data
lakes to preserve it in raw form.
The importance of unstructured data is rapidly increasing. Recent projections indicate that unstructured data
is over 80% of all enterprise data, while 95% of businesses prioritize unstructured data management.
Pros and cons of unstructured data
Examples of unstructured data include text, mobile activity, social media posts, Internet of Things (IoT)
sensor data, etc. Their benefits involve advantages in format, speed and storage, while liabilities revolve
around expertise and available resources:
Pros
 Native format: Unstructured data, stored in its native format, remains undefined until needed. Its
adaptability increases file formats in the database, which widens the data pool and enables data
scientists to prepare and analyze only the data they need.
 Fast accumulation rates: Since there is no need to predefine the data, it can be collected quickly and
easily.
 Data lake storage: Allows for massive storage and pay-as-you-use pricing, which cuts costs and eases
scalability.
Cons
 Requires expertise: Due to its undefined/non-formatted nature, data science expertise is required to
prepare and analyze unstructured data. This is beneficial to data analysts but alienates unspecialized
business users who may not fully understand specialized data topics or how to utilize their data.
 Specialized tools: Specialized tools are required to manipulate unstructured data, which limits
product choices for data managers.
Unstructured data tools
 MongoDB: Uses flexible documents to process data for cross-platform applications and services.
 DynamoDB: Delivers single-digit millisecond performance at any scale via built-in security, in-
memory caching and backup and restore.
 Hadoop: Provides distributed processing of large data sets using simple programming models and
no formatting requirements.
 Azure: Enables agile cloud computing for creating and managing apps through Microsoft’s data
centers.
Use cases for unstructured data
 Data mining: Enables businesses to use unstructured data to identify consumer behavior, product
sentiment, and purchasing patterns to better accommodate their customer base.
 Predictive data analytics: Alert businesses of important activity ahead of time so they can properly
plan and accordingly adjust to significant market shifts.
 Chatbots: Perform text analysis to route customer questions to the appropriate answer sources.

Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)

Categorical data?
Qualitative variables measure attributes that can be given only as a property of the variables. The political
affiliation of a person, nationality of a person, the favorite color of a person, and the blood group of a
patient can only be measured using qualitative attributes of each variable. Often these variables have
limited number of possibilities and assume only one of the possible outcomes; i.e. the value is one of the
given categories.
Therefore, these are commonly known as categorical variables. These possible values can be numbers,
letters, names, or any symbol.

Quantitative data?
Quantitative variable records the attributes that can be measured by a magnitude or size; i.e., quantifiable.
Variables measuring temperature, weight, mass or the height of a person or the annual income of a
household are quantitative variables. Not only all the values of these variables are numbers, but each number
gives a sense of value too.
The data in quantitative type belong to either one of the three following types; Ordinal, Interval, and Ratio.
Categorical data always belong to the nominal type. Above mentioned types are formally known as levels
of measurement, and closely related to the way the measurements are made and the scale of each
measurement.

Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)
Since the form of the data in the two categories is different, different techniques and methods are employed
when gathering, analyzing, and describing.

What is the Difference Between Categorical and Quantitative data?


Definitions of Categorical and Quantitative data:
 Quantitative data are information that has a sensible meaning when referring to its magnitude.
 Categorical data are often information that takes values from a given set of categories or groups.
Characteristics of Categorical and Quantitative data:
Class of measurement:
 Quantitative data belong to ordinal, interval, or ratio classes of measurements.
 Categorical data belong to the nominal class of measurements.
Methods:
 Methods used to analyze quantitative data are different from the methods used for categorical
data, even if the principles are the same, at least the application have significant differences.
Analysis:
 Quantitative data are analyzed using statistical methods in descriptive statistics, regression,
time series, and many more.
 For categorical data, usually descriptive methods and graphical methods are employed. Some
non- parametric tests are also used.

Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)

Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)

Roles & Responsibilities of a Data Scientist


 Management: The Data Scientist plays an insignificant managerial role where he supports the
construction of the base of futuristic and technical abilities within the Data and Analytics field in
order to assist various planned and continuing data analytics projects.
 Analytics: The Data Scientist represents a scientific role where he plans, implements, and assesses
high-level statistical models and strategies for application in the business’s most complex issues.
The Data Scientist develops econometric and statistical models for various problems including
projections, classification, clustering, pattern analysis, sampling, simulations, and so forth.
 Strategy/Design: The Data Scientist performs a vital role in the advancement of innovative
strategies to understand the business’s consumer trends and management as well as ways to solve
difficult business problems, for instance, the optimization of product fulfillment and entire profit.
 Collaboration: The role of the Data Scientist is not a solitary role and in this position, he
collaborates with superior data scientists to communicate obstacles and findings to relevant
stakeholders in an effort to enhance drive business performance and decision-making.
 Knowledge: The Data Scientist also takes leadership to explore different technologies and tools
with the vision of creating innovative data-driven insights for the business at the most agile pace
feasible. In this situation, the Data Scientist also uses initiative in assessing and utilizing new and
enhanced data science methods for the business, which he delivers to senior management of
approval.
 Other Duties: A Data Scientist also performs related tasks and tasks as assigned by the Senior
Data Scientist, Head of Data Science, Chief Data Officer, or the Employer.

Difference Between Data Scientist, Data Analyst, and Data Engineer


Data Scientist, Data Engineer, and Data Analyst are the three most common careers in data science. So
let’s understand who’s data science by comparing it with its similar jobs.

Data Scientist Data Analyst Data Engineer


The focus will be on the The main focus of a data Data Engineers focus on
futuristic display of analyst is on optimization of optimization techniques and the
data. scenarios, for example how an construction of data in a
employee can enhance the conventional manner. The purpose
company’s product growth. of a data engineer is continuously
advancing data
consumption.
Data scientists present both Data formation and cleaning Frequently data engineers operate at
supervised and unsupervised of raw data, interpreting and the back end. Optimized machine
learning of data, say visualization of data to learning algorithms were used for
regression and classification of perform the analysis and to keeping data and making data to be
data, perform the prepared most accurately.
Neural networks, etc. technical summary of data.
Skills required for Data Skills required for Data Skills required for Data Engineer are
Scientist are Python, R, SQL, Analyst are Python, R, SQL, MapReduce, Hive, Pig Hadoop,
Pig, SAS, Apache Hadoop, SAS. techniques.
Java, Perl,
Spark.

Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)

Some Inspiring Data Scientists


The variety of areas in which data science is used is embodied by looking at examples of data scientists.
 Hilary Mason: She is the co-founder of Fast Forward labs, a machine learning company recently
owned by Cloudera, a data science company. She is a Data Scientist at Accel. Broadly, she works
with data to solve questions about mining the web and also learning the method that how people
communicate with each other through social media.
 Nate Silver: He is one of the most prominent data scientists or statisticians in the world today. He
is the founder of FiveThirtyEight. FiveThirtyEight is a website that applies statistical analysis to
tell compelling stories about elections, politics, sports, science, and lifestyle. He utilizes huge
amounts of public data to predict a diversity of topics; most prominently he predicts who will win
elections in the
U.S. and has an extraordinary track record for accuracy in doing so.
 Daryl Morey: He is the general manager of a US basketball team, the Houston Rockets. He was
awarded the job as GM based on his bachelor’s degree in computer science and his M.B.A. from
M.I.T.

Mr.Pramod Jadhao

You might also like