Data Science Unit I
Data Science Unit I
S
Y
.
MR.PRAMOD JADHAO
DATA SCIENCE (UNIT – I)
Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)
Big Data
What is Big Data?
Big Data literally means large amounts of data. Big data is the pillar behind the idea that one can make useful
inferences with a large body of data that wasn’t possible before with smaller datasets. So extremely large
data sets may be analyzed computationally to reveal patterns, trends, and associations that are not transparent
or easy to identify.
Why is everyone interested in Big Data?
Definition: Refers to large volumes of structured, semi-structured, and unstructured data that require
advanced tools and techniques for processing and analysis
Big data is everywhere!
Every time you go to the web and do something that data is collected, every time you buy something from
one of the e-commerce your data is collected. Whenever you go to store data is collected at the point of sale,
when you do Bank transactions that data is there, when you go to Social networks like Facebook, Twitter
that data is collected. Now, these are more social data, but the same thing is starting to happen with real
engineering plants. Real-time data is collected from plants all over the world. Not only these if you are doing
much more sophisticated simulation, molecular simulations, which generates tons of data that is also
collected and stored.
Characteristics:
Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)
Difference between big data and little data
Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)
Integration with AI and Machine Learning: Data science heavily intersects with artificial intelligence
(AI) and machine learning (ML). AI and ML algorithms are utilized for predictive analytics, pattern
recognition, natural language processing (NLP), and computer vision tasks, among others.
Big Data Infrastructure: With the proliferation of big data, data science projects often involve managing
and analysing large datasets using distributed computing frameworks like Hadoop and Spark, as well as
cloud-based solutions provided by Amazon Web Services (AWS), Google Cloud Platform (GCP), and
Microsoft Azure.
Focus on Ethical and Responsible AI: There is a growing emphasis on ethical considerations in data
science and AI applications. Issues such as bias in algorithms, data privacy, transparency, and
accountability are gaining attention, leading to frameworks and guidelines being developed to address these
concerns.
Emerging Technologies: Data science is embracing emerging technologies such as edge computing,
Internet of Things (IoT), and block chain, which generate new types of data and require innovative
approaches for analysis and integration.
Interdisciplinary Collaboration: Data science teams often consist of professionals with diverse
backgrounds in statistics, mathematics, computer science, domain expertise (e.g., healthcare, finance), and
business acumen. Collaborative efforts are essential for successful implementation and deployment of data-
driven solutions.
Demand for Data Professionals: There is a high demand for skilled data scientists, data engineers, and
analysts across industries. Organizations are investing in building data science capabilities to gain
competitive advantage and drive growth.
Education and Training: Educational institutions and online platforms offer a wide range of courses and
programs in data science, catering to individuals seeking to enter or advance their careers in this field.
Continuous learning and upskilling are essential due to the rapid pace of technological change.
Visualization and Communication: Effective data visualization and communication skills are crucial for
data scientists to convey insights and recommendations to stakeholders, aiding in decision-making
processes.
Regulatory Landscape: Data science practices are influenced by regulatory frameworks such as GDPR
(General Data Protection Regulation) in Europe and similar data protection laws globally. Compliance with
these regulations is essential for ethical data handling and user privacy.
Structured data?
Structured data — typically categorized as quantitative data — is highly organized and easily decipherable
by machine learning algorithms. Developed by IBM in 1974, structured query language (SQL) is the
programming language used to manage structured data. By using a relational (SQL) database, business users
can quickly input, search and manipulate structured data.
Pros and cons of structured data
Examples of structured data include dates, names, addresses, credit card numbers, etc. Their benefits are
Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)
tied to ease of use and access, while liabilities revolve around data inflexibility:
Pros
Easily used by machine learning (ML) algorithms: The specific and organized architecture of
structured data eases manipulation and querying of ML data.
Easily used by business users: Structured data does not require an in-depth understanding of different
types of data and how they function. With a basic understanding of the topic relative to the data, users
can easily access and interpret the data.
Accessible by more tools: Since structured data predates unstructured data, there are more tools
available for using and analyzing structured data.
Cons
Limited usage: Data with a predefined structure can only be used for its intended purpose, which
limits its flexibility and usability.
Limited storage options: Structured data is generally stored in data storage systems with rigid
schemas (e.g., “data warehouses”). Therefore, changes in data requirements necessitate an update of
all structured data, which leads to a massive expenditure of time and resources.
Order Customer Name Product Name Quantity Unit Price Total Amount Order Date
ID
1001 John Doe Laptop 2 $1200 $2400 2024-07-10
1002 Jane Smith Smartphone 1 $800 $800 2024-07-11
1003 David Brown Tablet 3 $500 $1500 2024-07-12
Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)
Unstructured data?
Unstructured data, typically categorized as qualitative data, cannot be processed and analyzed via
conventional data tools and methods. Since unstructured data does not have a predefined data model, it is
best managed in non-relational (NoSQL) databases. Another way to manage unstructured data is to use data
lakes to preserve it in raw form.
The importance of unstructured data is rapidly increasing. Recent projections indicate that unstructured data
is over 80% of all enterprise data, while 95% of businesses prioritize unstructured data management.
Pros and cons of unstructured data
Examples of unstructured data include text, mobile activity, social media posts, Internet of Things (IoT)
sensor data, etc. Their benefits involve advantages in format, speed and storage, while liabilities revolve
around expertise and available resources:
Pros
Native format: Unstructured data, stored in its native format, remains undefined until needed. Its
adaptability increases file formats in the database, which widens the data pool and enables data
scientists to prepare and analyze only the data they need.
Fast accumulation rates: Since there is no need to predefine the data, it can be collected quickly and
easily.
Data lake storage: Allows for massive storage and pay-as-you-use pricing, which cuts costs and eases
scalability.
Cons
Requires expertise: Due to its undefined/non-formatted nature, data science expertise is required to
prepare and analyze unstructured data. This is beneficial to data analysts but alienates unspecialized
business users who may not fully understand specialized data topics or how to utilize their data.
Specialized tools: Specialized tools are required to manipulate unstructured data, which limits
product choices for data managers.
Unstructured data tools
MongoDB: Uses flexible documents to process data for cross-platform applications and services.
DynamoDB: Delivers single-digit millisecond performance at any scale via built-in security, in-
memory caching and backup and restore.
Hadoop: Provides distributed processing of large data sets using simple programming models and
no formatting requirements.
Azure: Enables agile cloud computing for creating and managing apps through Microsoft’s data
centers.
Use cases for unstructured data
Data mining: Enables businesses to use unstructured data to identify consumer behavior, product
sentiment, and purchasing patterns to better accommodate their customer base.
Predictive data analytics: Alert businesses of important activity ahead of time so they can properly
plan and accordingly adjust to significant market shifts.
Chatbots: Perform text analysis to route customer questions to the appropriate answer sources.
Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)
Categorical data?
Qualitative variables measure attributes that can be given only as a property of the variables. The political
affiliation of a person, nationality of a person, the favorite color of a person, and the blood group of a
patient can only be measured using qualitative attributes of each variable. Often these variables have
limited number of possibilities and assume only one of the possible outcomes; i.e. the value is one of the
given categories.
Therefore, these are commonly known as categorical variables. These possible values can be numbers,
letters, names, or any symbol.
Quantitative data?
Quantitative variable records the attributes that can be measured by a magnitude or size; i.e., quantifiable.
Variables measuring temperature, weight, mass or the height of a person or the annual income of a
household are quantitative variables. Not only all the values of these variables are numbers, but each number
gives a sense of value too.
The data in quantitative type belong to either one of the three following types; Ordinal, Interval, and Ratio.
Categorical data always belong to the nominal type. Above mentioned types are formally known as levels
of measurement, and closely related to the way the measurements are made and the scale of each
measurement.
Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)
Since the form of the data in the two categories is different, different techniques and methods are employed
when gathering, analyzing, and describing.
Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)
Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)
Mr.Pramod Jadhao
DATA SCIENCE (UNIT – I)
Mr.Pramod Jadhao