0% found this document useful (0 votes)
7 views

IDS- UNIT-1

Data science is a multidisciplinary field that utilizes statistical and computational methods to extract insights from data, aiding in business decision-making and problem-solving. Key processes include data collection, cleaning, analysis, visualization, and decision-making, requiring skills in programming, mathematics, and machine learning. Data science applications span various industries, enhancing performance measurement, risk mitigation, and customer experience, while also addressing different data types such as structured and unstructured data.

Uploaded by

sflhub2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

IDS- UNIT-1

Data science is a multidisciplinary field that utilizes statistical and computational methods to extract insights from data, aiding in business decision-making and problem-solving. Key processes include data collection, cleaning, analysis, visualization, and decision-making, requiring skills in programming, mathematics, and machine learning. Data science applications span various industries, enhancing performance measurement, risk mitigation, and customer experience, while also addressing different data types such as structured and unstructured data.

Uploaded by

sflhub2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

UNIT – 1

INTRODUCTION TO DATA SCIENCE


1.What is Data Science?
Data science is the study of data that helps us derive useful insight for business decision
making. Data Science is all about using tools, techniques, and creativity to uncover insights hidden
within data. It combines math, computer science, and domain expertise to tackle real-world
challenges in a variety of fields.
Data Science processes the raw data and solve business problems and even make prediction
about the future trend or requirement.
Data Science is a multidisciplinary field that involves the use of statistical and
computational methods to extract insights and knowledge from data. To analyze and comprehend
large data sets, it uses techniques from computer science, mathematics, and statistics.

Data science involves these key steps:


• Data Collection: Gathering raw data from various sources, such as databases, sensors, or
user interactions.
• Data Cleaning: Ensuring the data is accurate, complete, and ready for analysis.
• Data Analysis: Applying statistical and computational methods to identify patterns, trends,
or relationships.
• Data Visualization: Creating charts, graphs, and dashboards to present findings clearly.
• Decision-Making: Using insights to inform strategies, create solutions, or predict outcomes.

Key components of a data science may include:


1. Foundational Concepts: Introduction to basic concepts in data science, including data
types, data manipulation, data cleaning, and exploratory data analysis.
2. Programming Languages: Instruction in programming languages commonly used in data
science, such as Python or R. Students learn how to write code to analyze and manipulate
data, create visualizations, and build machine learning models.
3. Statistical Methods: Coverage of statistical techniques and methods used in data analysis,
hypothesis testing, regression analysis, and probability theory.
4. Machine Learning: Introduction to machine learning algorithms, including supervised
learning, unsupervised learning, and deep learning. Students learn how to apply machine
learning techniques to solve real-world problems and make predictions from data.

Data Science Skills


All these data science actions are performed by a Data Scientists. Let’s see essential skills required
for data scientists
• Programming Languages: Python, R, SQL.
• Mathematics: Linear Algebra, Statistics, Probability.
• Machine Learning: Supervised and unsupervised learning, deep learning basics.
• Data Manipulation: Pandas, NumPy, data wrangling techniques.
• Data Visualization: Matplotlib, Seaborn, Tableau, Power BI.
• Big Data Tools: Hadoop, Spark, Hive.
• Databases: SQL, NoSQL, data querying and management.
• Cloud Computing: AWS, Azure, Google Cloud.
• Version Control: Git, GitHub, GitLab.
• Domain Knowledge: Industry-specific expertise for problem-solving.
• Soft Skills: Communication, teamwork, and critical thinking.

Data Science Tools and Library


There are various tools required to analyze data, build models, and derive insights. Here are some of
the most important tools in data science:
• Jupyter Notebook: Interactive environment for coding and documentation.
• Google Colab: Cloud-based Jupyter Notebook for collaborative coding.
• TensorFlow: Deep learning framework for building neural networks.
• PyTorch: Popular library for machine learning and deep learning.
• Scikit-learn: Tools for predictive data analysis and machine learning.
• Docker: Containerization for reproducible environments.
• Kubernetes: Managing and scaling containerized applications.
• Apache Kafka: Real-time data streaming and processing.
• Tableau: A powerful tool for creating interactive and shareable data visualizations.
• Power BI: A business intelligence tool for visualizing data and generating insights.
• Keras: A user-friendly library for designing and training deep learning models.

2.Benifits and Uses


Benifits:
1. Better Decision-Making
Data science enables companies to make informed decisions by providing objective evidence
derived from data analysis. By leveraging data and risk analysis practices, businesses can navigate
complex decisions with confidence, ensuring they are based on solid data rather than intuition.
2. Enhanced Performance Measurement
With data science, businesses can measure performance more accurately. By collecting and
analyzing data, organizations can use trends and empirical evidence to make educated decisions,
ensuring continuous improvement and effective problem-solving across the organization.
3. Financial Insights and Optimization
Data science aids in making financial predictions, generating reports, and analyzing economic
trends. This allows companies to make informed decisions on budgeting, finances, and expenses,
leading to optimized revenue generation and a clear understanding of internal financial health.
4. Product Development
Through data-driven analysis, businesses can develop products that resonate with their target
audience. By understanding customer preferences and behaviors, companies can tailor their
offerings to meet market demands, resulting in better product development and increased customer
satisfaction.
5. Increased Efficiency
Data science helps streamline operations by identifying inefficiencies and optimizing processes. By
collecting and analyzing manufacturing data, companies can improve production efficiency and
maximize output, ultimately enhancing overall business performance.
6. Risk Mitigation and Fraud Detection
Data science enhances security by detecting and preventing fraudulent activities. Machine learning
algorithms can identify unusual behavior patterns, allowing businesses to address potential fraud
promptly. Additionally, tracking workplace operations can help ensure compliance with policies and
detect any fraudulent practices.
7. Predictive Insights and Trend Analysis
Using big data and statistical analysis, data scientists can develop projections and predictions that
help executives adjust operations accordingly. This enables companies to stay ahead of market
trends, anticipate customer feedback, and tailor their strategies to meet evolving market conditions.
8. Improved Customer Experience
By analyzing customer data, businesses can offer personalized services and enhance customer
satisfaction. Understanding customer habits, preferences, and behaviors allows companies to build
strong brand loyalty and deliver a superior customer experience.
9. Multiple Job Options
Being in demand, it has given rise to a large number of career opportunities in its various fields.
Some of them are Data Scientist, Data Analyst, Research Analyst, Business Analyst, Analytics
Manager, Big Data Engineer, etc.
10. Business benefits
Data Science helps organizations knowing how and when their products sell best and that’s why the
products are delivered always to the right place and right time. Faster and better decisions are taken
by the organization to improve efficiency and earn higher profits.
11. Highly Paid jobs & career opportunities
As Data Scientist continues being the sexiest job and the salaries for this position are also grand.
12. Hiring benefits
It has made it comparatively easier to sort data and look for best of candidates for an organization.
Big Data and data mining have made processing and selection of CVs, aptitude tests and games
easier for the recruitment teams.

Uses:
1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want to
search for something on the internet, we mostly use Search engines like Google, Yahoo,
DuckDuckGo and Bing, etc. So Data Science is used to get Searches faster.
2. In Transport
Data Science is also entered in real-time such as the Transport field like Driverless Cars. With the
help of Driverless Cars, it is easy to reduce the number of Accidents.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue of
fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss analysis in order
to carry out strategic decisions for the company. Also, Financial Industries uses Data Science
Analytics tools in order to predict the future. It allows the companies to predict customer lifetime
value and their stock market moves.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
• Detecting Tumor.
• Drug discoveries.
• Medical Image Analysis.
• Virtual Medical Bots.
• Genetics and Genomics.
• Predictive Modeling for Diagnosis etc.
6. Image Recognition
Currently, Data Science is also used in Image Recognition.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever the user
searches on the Internet, he/she will see numerous posts everywhere.
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of it, it becomes
easy to predict flight delays. It also helps to decide whether to directly land into the destination or
take a halt in between like a flight can have a direct route from Delhi to the U.S.A or it can halt in
between after that reach at the destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer Opponent, data
science concepts are used with machine learning where with the help of past data the Computer will
improve its performance. There are many games like Chess, EA Sports, etc. will use Data Science
concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to be done with full
disciplined because it is a matter of Someone’s life. Without Data Science, it takes lots of time,
resources, and finance or developing new Medicine or drug but with the help of Data Science, it
becomes easy because the prediction of success rate can be easily determined based on biological
data or factors. The algorithms based on data science will forecast how this will react to the human
body without lab experiments.
11. In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data Science helps
these companies to find the best route for the Shipment of their Products, the best time suited for
delivery, the best mode of transport to reach the destination, etc.
12. Autocomplete
AutoComplete feature is an important part of Data Science where the user will get the facility to
just type a few letters or words, and he will get the feature of auto-completing the line. In Google
Mail, when we are writing formal mail to someone so at that time data science concept of
Autocomplete feature is used where he/she is an efficient choice to auto-complete the whole
line. Also in Search Engines in social media, in various apps, AutoComplete feature is widely used.

3.Facets of Data
Very large amount of data will generate in big data and data science. These data is various types and
main categories of data are as follows:

a) Structured data

b) Unstructured data

c) Natural language

d) Machine-generated data

e) Graph-based data

f) Audio, video and images

g) Streaming data

Structured Data

• Structured data is arranged in rows and column format. It helps for application to retrieve and
process data easily. Database management system is used for storing structured data.

• The term structured data refers to data that is identifiable because it is organized in a structure. The
most common form of structured data or records is a database where specific information is stored
based on a methodology of columns and rows.

• Structured data is also searchable by data type within content. Structured data is understood by
computers and is also efficiently organized for human readers.

• An Excel table is an example of structured data.

Unstructured Data

• Unstructured data is data that does not follow a specified format. Row and columns are not used for
unstructured data. Therefore it is difficult to retrieve required information. Unstructured data has no
identifiable structure.

• The unstructured data can be in the form of Text: (Documents, email messages, customer feedbacks),
audio, video, images. Email is an example of unstructured data.

• Even today in most of the organizations more than 80 % of the data are in unstructured form. This
carries lots of information. But extracting information from these various sources is a very big
challenge.

• Characteristics of unstructured data:

1. There is no structural restriction or binding for the data.

2. Data can be of any type.

3. Unstructured data does not follow any structural rules.

4. There are no predefined formats, restriction or sequence for unstructured data.

5. Since there is no structural binding for unstructured data, it is unpredictable in nature.

Natural Language

• Natural language is a special type of unstructured data.

• Natural language processing enables machines to recognize characters, words and sentences, then
apply meaning and understanding to that information. This helps machines to understand language as
humans do.

• Natural language processing is the driving force behind machine intelligence in many modern real-
world applications. The natural language processing community has had success in entity recognition,
topic recognition, summarization, text completion and sentiment analysis.

•For natural language processing to help machines understand human language, it must go through
speech recognition, natural language understanding and machine translation. It is an iterative process
comprised of several layers of text analysis.

Machine Generated Data

• Machine-generated data is an information that is created without human interaction as a result of a


computer process or application activity. This means that data entered manually by an end-user is not
recognized to be machine-generated.

• Machine data contains a definitive record of all activity and behavior of our customers, users,
transactions, applications, servers, networks, factory machinery and so on.

• It's configuration data, data from APIs and message queues, change events, the output of diagnostic
commands and call detail records, sensor data from remote equipment and more.

• Examples of machine data are web server logs, call detail records, network event logs and telemetry.

• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions generate machine


data. Machine data is generated continuously by every processor-based system, as well as many
consumer-oriented systems.

• It can be either structured or unstructured. In recent years, the increase of machine data has surged.
The expansion of mobile devices, virtual servers and desktops, as well as cloud- based services and
RFID technologies, is making IT infrastructures more complex.

Graph-based or Network Data

•Graphs are data structures to describe relationships and interactions between entities in complex
systems. In general, a graph contains a collection of entities called nodes and another collection of
interactions between a pair of nodes called edges.

• Nodes represent entities, which can be of any object type that is relevant to our problem domain.
By connecting nodes with edges, we will end up with a graph (network) of nodes.

• A graph database stores nodes and relationships instead of tables or documents. Data is stored just
like we might sketch ideas on a whiteboard. Our data is stored without restricting it to a predefined
model, allowing a very flexible way of thinking about and using it.

• Graph databases are used to store graph-based data and are queried with specialized query languages
such as SPARQL.

• Graph databases are capable of sophisticated fraud prevention. With graph databases, we can use
relationships to process financial and purchase transactions in near-real time. With fast graph queries,
we are able to detect that, for example, a potential purchaser is using the same email address and
credit card as included in a known fraud case.

• Graph databases can also help user easily detect relationship patterns such as multiple people
associated with a personal email address or multiple people sharing the same IP address but residing
in different physical addresses.

• Graph databases are a good choice for recommendation applications. With graph databases, we can
store in a graph relationships between information categories such as customer interests, friends and
purchase history. We can use a highly available graph database to make product recommendations to
a user based on which products are purchased by others who follow the same sport and have similar
purchase history.

• Graph theory is probably the main method in social network analysis in the early history of the
social network concept. The approach is applied to social network analysis in order to determine
important features of the network such as the nodes and links (for example influencers and the
followers).

• Influencers on social network have been identified as users that have impact on the activities or
opinion of other users by way of followership or influence on decision made by other users on the
network as shown in Fig.

• Graph theory has proved to be very effective on large-scale datasets such as social network data.
This is because it is capable of by-passing the building of an actual visual representation of the data
to run directly on data matrices.

Audio, Image and Video

• Audio, image and video are data types that pose specific challenges to a data scientist. Tasks that
are trivial for humans, such as recognizing objects in pictures, turn out to be challenging for computers.

•The terms audio and video commonly refers to the time-based media storage format for sound/music
and moving pictures information. Audio and video digital recording, also referred as audio and video
codecs, can be uncompressed, lossless compressed or lossy compressed depending on the desired
quality and use cases.

• It is important to remark that multimedia data is one of the most important sources of information
and knowledge; the integration, transformation and indexing of multimedia data bring significant
challenges in data management and analysis. Many challenges have to be addressed including big
data, multidisciplinary nature of Data Science and heterogeneity.

• Data Science is playing an important role to address these challenges in multimedia data.
Multimedia data usually contains various forms of media, such as text, image, video, geographic
coordinates and even pulse waveforms, which come from multiple sources. Data Science can be a
key instrument covering big data, machine learning and data mining solutions to store, handle and
analyze such heterogeneous data.

Streaming Data

Streaming data is data that is generated continuously by thousands of data sources, which typically
send in the data records simultaneously and in small sizes (order of Kilobytes).

• Streaming data includes a wide variety of data such as log files generated by customers using your
mobile or web applications, ecommerce purchases, in-game player activity, information from social
networks, financial trading floors or geospatial services and telemetry from connected devices or
instrumentation in data centers.

4.Data Science Process In Brief


Some steps are necessary for any of the tasks that are being done in the field of data science to
derive any fruitful results from the data at hand.
• Data Collection – After formulating any problem statement the main task is to calculate
data that can help us in our analysis and manipulation. Sometimes data is collected by performing
some kind of survey and there are times when it is done by performing scrapping.
• Data Cleaning – Most of the real-world data is not structured and requires cleaning and
conversion into structured data before it can be used for any analysis or modeling.
• Exploratory Data Analysis – This is the step in which we try to find the hidden patterns in
the data at hand. Also, we try to analyze different factors which affect the target variable and the
extent to which it does so. How the independent features are related to each other and what can be
done to achieve the desired results all these answers can be extracted from this process as well. This
also gives us a direction in which we should work to get started with the modeling process.
• Model Building – Different types of machine learning algorithms as well as techniques
have been developed which can easily identify complex patterns in the data which will be a very
tedious task to be done by a human.
• Model Deployment – After a model is developed and gives better results on the holdout or
the real-world dataset then we deploy it and monitor its performance. This is the main part where we
use our learning from the data to be applied in real-world applications and use cases.
Data Science Process
Life Cycle

Components of Data Science Process


Data Science is a very vast field and to get the best out of the data at hand one has to apply multiple
methodologies and use different tools to make sure the integrity of the data remains intact
throughout the process keeping data privacy in mind. If we try to point out the main components of
Data Science then it would be:
• Data Analysis – There are times when there is no need to apply advanced deep learning and
complex methods to the data at hand to derive some patterns from it. Due to this before
moving on to the modeling part, we first perform an exploratory data analysis to get a basic
idea of the data and patterns which are available in it this gives us a direction to work on if
we want to apply some complex analysis methods on our data.
• Statistics – It is a natural phenomenon that many real-life datasets follow a normal
distribution. And when we already know that a particular dataset follows some known
distribution then most of its properties can be analyzed at once. Also, descriptive statistics
and correlation and covariances between two features of the dataset help us get a better
understanding of how one factor is related to the other in our dataset.
• Data Engineering – When we deal with a large amount of data then we have to make sure
that the data is kept safe from any online threats also it is easy to retrieve and make changes
in the data as well. To ensure that the data is used efficiently Data Engineers play a crucial
role.
• Advanced Computing
• Machine Learning – Machine Learning has opened new horizons which had helped
us to build different advanced applications and methodologies so, that the machines
become more efficient and provide a personalized experience to each individual and
perform tasks in a snap of the hand earlier which requires heavy human labor and
time intense.
• Deep Learning – This is also a part of Artificial Intelligence and Machine Learning
but it is a bit more advanced than machine learning itself. High computing power and
a huge corpus of data have led to the emergence of this field in data science.
5.BIG DATA ECOSYSTEM AND DATA SCIENCE
BIG DATA ECOSYSTEM

What is Big Data?

Big Data is the extraction, analysis and management of processing a large volume of data. It
revolves around the datatype – Big Data which is a collection of a colossal amount of data. 5
Vs that define big data are velocity, volume, value, variety and veracity.

Such amount of data, which could not be processed earlier due to limitations in the
computational techniques can now be performed with highly advanced tools and
methodologies.

Some of the tools for Big Data are – Apache Hadoop, Spark, Flink etc. Big Data contains a
pool of data that can be both structured and unstructured. By structured data, we mean the
data that mobile devices, services, and websites generate.

The unstructured data is more organized data that is the users generate themselves. For
example, emails, chats, telephone conversations, reviews, etc.

The contemporary Big Data came into existence after Google published its technical paper
on MapReduce. This brought about a revolution in the data community. MapReduce was
developed into an open-source framework called Hadoop.

Big Data Ecosystem

The big data ecosystem refers to the interconnected network of organizations, technology
platforms and applications that support big data. The ecosystem includes companies that
develop and deploy big data solutions, as well as those who use big data to make
business decisions.

The big data ecosystem is growing at a rapid pace, and it will require significant investment
in order to keep up. As the industry continues to mature, businesses will need to find ways
to work with larger data sets and create efficiencies through collaboration. To do this, they
will need to understand the basics of the big data ecosystem and its components.
The big data ecosystem has five key components:

1. Data sources: Every business needs access to reliable and large data sets in order to make
informed decisions. In order to find these sources, businesses need to identify where their
data comes from and how it can be accessed. This can be done through a variety of
methods, such as market research or surveys.

2. Platforms: Businesses use a number of different platforms to store, process and analyze
their data. These platforms can come from traditional technology companies such as
Microsoft or Amazon, or new entrants such as google Cloud platform or Apples iCloud.

3. Applications: Businesses use a wide range of applications in order to process their data.
These applications can be used for everything from analyzing customer
behavior to manufacturing products.

4. Data management: All businesses require effective ways to manage their data sets so that
they are organized, effective and accessible. This can be done through a number of methods,
including manual process or automatic processes such asimilating cubes from various source
datasets into a single report or exporting all your tables into an Excel file for analysis.)

5. Collaboration: All businesses need effective ways to collaborate with other organizations
in order to share information and make better decisions. This can be done through a variety
of methods, including online surveys or collaborations with outside experts (such as
developers who can help improve the efficiency of your existing solutions).

What is Data Science?


Data Science is an interdisciplinary field that utilizes scientific methods, algorithms, and
systems to extract knowledge and insights from structured and unstructured data. It
encompasses a variety of techniques from statistics, machine learning, data mining, and big
data analytics.

Data Scientists use their expertise to:

1. Analyze: They examine complex datasets to identify patterns, trends, and


correlations.

2. Model: Using statistical models and machine learning algorithms, they create
predictive models that can forecast future trends or behaviors.

3. Interpret: They translate data findings into actionable business strategies and
decisions.
Differences Between Big Data and Data Science

You might also like