BigData
Big Data
Introduction
• We produce a massive amount of data each day, whether we know about it or not.
Every click on the internet, every bank transaction, every video we watch on
YouTube, every email we send, every like on our Instagram post makes up data for
tech companies.
• With such a massive amount of data being collected, it only makes sense for
companies to use this data to understand their customers and
their behavior better. This is the reason why the popularity of Data Science has
grown manifold over the last few years.
• Big data is exactly what the name suggests, a “big” amount of data. Big Data
means a data set that is large in terms of volume and is more complex. Because of
the large volume and higher complexity of Big Data, traditional data processing
software cannot handle it. Big Data simply means datasets containing a large
amount of diverse data, both structured as well as unstructured.
Big Data
Introduction
• Big Data allows companies to address issues they are facing in their
business, and solve these problems effectively using Big Data
Analytics.
• Companies try to identify patterns and draw insights from this sea
of data so that it can be acted upon to solve the problem(s) at hand.
• Although companies have been collecting a huge amount of data
for decades, the concept of Big Data only gained popularity in the
early-mid 2000s.
• Corporations realized the amount of data that was being collected
on a daily basis, and the importance of using this data effectively.
Big Data
5Vs of Big Data
1.Volume refers to the amount of data that is
being collected. The data could be structured or
unstructured.
2.Velocity refers to the rate at which data is
coming in.
3.Variety refers to the different kinds of data
(data types, formats, etc.) that is coming in for
analysis. Over the last few years, 2 additional Vs
of data have also emerged – value and veracity.
4.Value refers to the usefulness of the collected
data.
5.Veracity refers to the quality of data that is
coming in from different sources.
Big Data
How Does Big Data Work?
Big data involves collecting, processing, and analyzing vast amounts of data
from multiple sources to uncover patterns, relationships, and insights that
can inform decision-making. The process involves several steps:
1.Data Collection
Big data is collected from various sources such as social media, sensors, transactional
systems, customer reviews, and other sources.
2.Data Storage
The collected data then needs to be stored in a way that it can be easily accessed and
analyzed later. This often requires specialized storage technologies capable of
handling large volumes of data.
Big Data
How Does Big Data Work?
3. Data Processing
Once the data is stored, it needs to be processed before it can be analyzed. This involves
cleaning and organizing the data to remove any errors or inconsistencies, and transform it
into a format suitable for analysis.
4. Data Analysis
After the data has been processed, it is time to analyze it using tools like statistical
models and machine learning algorithms to identify patterns, relationships, and trends.
5. Data Visualization
The insights derived from data analysis are then presented in visual formats such as
graphs, charts, and dashboards, making it easier for decision-makers to understand and
act upon them.
Big Data
Use Cases
Big Data helps corporations in making better and faster decisions, because they
have more information available to solve problems, and have more data to test their
hypothesis on.
Customer experience is a major field that has been revolutionized with the advent of Big Data.
Companies are collecting more data about their customers and their preferences than ever.
This data is being leveraged in a positive way, by giving personalized recommendations and
offers to customers, who are more than happy to allow companies to collect this data in return
for the personalized services. The recommendations you get on Netflix, or Amazon/Flipkart are
a gift of Big Data!
Machine Learning is another field that has benefited greatly from the increasing popularity of
Big Data. More data means we have larger datasets to train our ML models, and a more trained
model (generally) results in a better performance.
Demand forecasting has become more accurate with more and more data being collected
about customer purchases. This helps companies build forecasting models, that help them
forecast future demand, and scale production accordingly.
Big Data
Big Data Tools
1. Apache Hadoop is an open-source big data tool designed to store and process large
amounts of data across multiple servers. Hadoop comprises a distributed file system
(HDFS) and a MapReduce processing engine.
2. Apache Spark is a fast and general-purpose cluster computing system that supports in-
memory processing to speed up iterative algorithms. Spark can be used for batch
processing, real-time stream processing, machine learning, graph processing, and SQL
queries.
3. Apache Cassandra is a distributed NoSQL database management system designed to
handle large amounts of data across commodity servers with high availability and fault
tolerance.
4. Talend is an open-source data integration platform that enables organizations to extract,
transform, and load (ETL) data from various sources into target systems. Talend supports
big data technologies such as Hadoop, Spark, Hive, Pig, and HBase.
Etc..
Big Data
Challenges
1. Data Growth
• Managing datasets having terabytes of information can be a big challenge for companies. As
datasets grow in size, storing them not only becomes a challenge but also becomes an
expensive affair for companies.
• To overcome this, companies are now starting to pay attention to data compression and de-
duplication. Data compression reduces the number of bits that the data needs, resulting in a
reduction in space being consumed. Data de-duplication is the process of making sure
duplicate and unwanted data does not reside in our database.
2. Data Security
• Data security is often prioritized quite low in the Big Data workflow, which can backfire at times.
With such a large amount of data being collected, security challenges are bound to come up
sooner or later.
• Mining of sensitive information, fake data generation, and lack of cryptographic protection
(encryption) are some of the challenges businesses face when trying to adopt Big Data
techniques.
Big Data
Challenges
3. Data Integration
• Data is coming in from a lot of different sources (social media
applications, emails, customer verification documents, survey
forms, etc.). It often becomes a very big operational challenge for
companies to combine and reconcile all of this data.
• There are several Big Data solution vendors that offer ETL (Extract,
Transform, Load) and data integration solutions to companies that
are trying to overcome data integration problems. There are also
several APIs that have already been built to tackle issues related to
data integration.
Big Data
Advantages of Big Data
• Improved decision-making: Big data can provide insights and patterns that help
organizations make more informed decisions.
• Increased efficiency: Big data analytics can help organizations identify
inefficiencies in their operations and improve processes to reduce costs.
• Better customer targeting: By analyzing customer data, businesses can develop
targeted marketing campaigns that are relevant to individual customers, resulting
in better customer engagement and loyalty.
• New revenue streams: Big data can uncover new business opportunities, enabling
organizations to create new products and services that meet market demand.
• Competitive advantage: Organizations that can effectively leverage big data have
a competitive advantage over those that cannot, as they can make faster, more
informed decisions based on data-driven insights.
Big Data
Disadvantages of Big Data
• Privacy concerns: Collecting and storing large amounts of data can raise privacy
concerns, particularly if the data includes sensitive personal information.
• Risk of data breaches: Big data increases the risk of data breaches, leading to loss
of confidential data and negative publicity for the organization.
• Technical challenges: Managing and processing large volumes of data requires
specialized technologies and skilled personnel, which can be expensive and time-
consuming.
• Difficulty in integrating data sources: Integrating data from multiple sources can be
challenging, particularly if the data is unstructured or stored in different formats.
• Complexity of analysis: Analyzing large datasets can be complex and time-
consuming, requiring specialized skills and expertise.
Big Data
Implementation Across Industries
Industry Use of Big data
Healthcare
Analyze patient data to improve
healthcare outcomes, identify trends
and patterns, and develop
personalized treatment
Retail
Track and analyze customer data to
personalize marketing campaigns,
improve inventory management and
enhance CX
Finance
Detect fraud, assess risks and make
informed investment decisions
Manufacturing
Optimize supply chain processes,
reduce costs and improve product
quality through predictive maintenance
Transportation
Optimize routes, improve fleet
management and enhance safety by
predicting accidents before they
happen
Energy
Monitor and analyze energy usage
patterns, optimize production, and
reduce waste through predictive
analytics
Big Data
Implementation Across
Industries
Telecommunications
Manage network traffic,
improve service quality, and
reduce downtime through
predictive maintenance and
outage prediction
Government and public
Address issues such as
preventing crime, improving
traffic management, and
predicting natural disasters
Advertising and
marketing
Understand consumer
behavior, target specific
audiences and measure the
effectiveness of campaigns
Education
Personalize learning
experiences, monitor
student p
Big Data Architecture
Fig. Architecture of Big Data
Big Data Architecture
Following are the components of Big Data Architecture:
Data Sources:
• All of the sources that feed into the data extraction pipeline are subject to this definition,
so this is where the starting point for the big data pipeline is located. Data sources, open
and third-party, play a significant role in architecture.
• Relational databases, data warehouses, cloud-based data warehouses, SaaS applications,
real-time data from company servers and sensors such as IoT devices, third-party data
providers, and also static files such as Windows logs, comprise several data sources.
Data Storage:
• There is data stored in file stores that are distributed in nature and that can hold a variety
of format-based big files. It is also possible to store large numbers of different format-
based big files in the data lake.
• This consists of the data that is managed for batch built operations and is saved in the
file stores. We provide HDFS, Microsoft Azure, AWS etc..
Big Data Architecture
Batch Processing:
• Each chunk of data is split into different categories using long-running jobs, which filter and
aggregate and also prepare data for analysis. These jobs typically require sources, process
them, and deliver the processed files to new files.
• Multiple approaches to batch processing are employed, including Hive jobs map reducer
jobs written in any one of the Java or Scala or other languages such as Python.
Real Time-Based Message Ingestion:
• A real-time streaming system that caters to the data being generated in a sequential and
uniform fashion is a batch processing system. When compared to batch processing, this
includes all real-time streaming systems that cater to the data being generated at the time it
is received.
• This data mart or store, which receives all incoming messages and discards them into a
folder for data processing, is usually the only one that needs to be contacted. Message-
based ingestion stores such as Apache Kafka, Apache Flume, Event hubs from Azure, and
others, on the other hand, must be used if message-based processing is required. The
delivery process, along with other message queuing semantics, is generally more reliable.
Big Data Architecture
Stream Processing:
• Real-time message ingest and stream processing are different. The latter
uses the ingested data as a publish-subscribe tool, whereas the former
takes into account all of the ingested data in the first place and then utilises
it as a publish-subscribe tool.
• Stream processing, on the other hand, handles all of that streaming data in
the form of windows or streams and writes it to the sink. This includes
Apache Spark, Flink, Storm, etc.
Analytics-Based Datastore:
In order to analyze and process already processed data, analytical tools use
the data store that is based on HBase or any other NoSQL data warehouse
technology. The data can be presented with the help of a hive database,
which can provide metadata abstraction, or interactive use of a hive database.
Big Data Architecture
Reporting and Analysis:
The generated insights, on the other hand, must be processed and
that is effectively accomplished by the reporting and analysis tools
that utilize embedded technology and a solution to produce useful
graphs, analysis, and insights that are beneficial to the businesses. For
example, Cognos, Hyperion, and others.
Orchestration:
Data-based solutions that utilize big data are data-related tasks that
are repetitive in nature, and which are also contained in workflow
chains that can transform the source data and also move data across
sources as well as sinks and loads in stores. Sqoop, oozie, data factory,
and others are just a few examples.

Chapter 4 : Introduction to BigData.pptx

  • 1.
  • 2.
    Big Data Introduction • Weproduce a massive amount of data each day, whether we know about it or not. Every click on the internet, every bank transaction, every video we watch on YouTube, every email we send, every like on our Instagram post makes up data for tech companies. • With such a massive amount of data being collected, it only makes sense for companies to use this data to understand their customers and their behavior better. This is the reason why the popularity of Data Science has grown manifold over the last few years. • Big data is exactly what the name suggests, a “big” amount of data. Big Data means a data set that is large in terms of volume and is more complex. Because of the large volume and higher complexity of Big Data, traditional data processing software cannot handle it. Big Data simply means datasets containing a large amount of diverse data, both structured as well as unstructured.
  • 3.
    Big Data Introduction • BigData allows companies to address issues they are facing in their business, and solve these problems effectively using Big Data Analytics. • Companies try to identify patterns and draw insights from this sea of data so that it can be acted upon to solve the problem(s) at hand. • Although companies have been collecting a huge amount of data for decades, the concept of Big Data only gained popularity in the early-mid 2000s. • Corporations realized the amount of data that was being collected on a daily basis, and the importance of using this data effectively.
  • 4.
    Big Data 5Vs ofBig Data 1.Volume refers to the amount of data that is being collected. The data could be structured or unstructured. 2.Velocity refers to the rate at which data is coming in. 3.Variety refers to the different kinds of data (data types, formats, etc.) that is coming in for analysis. Over the last few years, 2 additional Vs of data have also emerged – value and veracity. 4.Value refers to the usefulness of the collected data. 5.Veracity refers to the quality of data that is coming in from different sources.
  • 5.
    Big Data How DoesBig Data Work? Big data involves collecting, processing, and analyzing vast amounts of data from multiple sources to uncover patterns, relationships, and insights that can inform decision-making. The process involves several steps: 1.Data Collection Big data is collected from various sources such as social media, sensors, transactional systems, customer reviews, and other sources. 2.Data Storage The collected data then needs to be stored in a way that it can be easily accessed and analyzed later. This often requires specialized storage technologies capable of handling large volumes of data.
  • 6.
    Big Data How DoesBig Data Work? 3. Data Processing Once the data is stored, it needs to be processed before it can be analyzed. This involves cleaning and organizing the data to remove any errors or inconsistencies, and transform it into a format suitable for analysis. 4. Data Analysis After the data has been processed, it is time to analyze it using tools like statistical models and machine learning algorithms to identify patterns, relationships, and trends. 5. Data Visualization The insights derived from data analysis are then presented in visual formats such as graphs, charts, and dashboards, making it easier for decision-makers to understand and act upon them.
  • 7.
    Big Data Use Cases BigData helps corporations in making better and faster decisions, because they have more information available to solve problems, and have more data to test their hypothesis on. Customer experience is a major field that has been revolutionized with the advent of Big Data. Companies are collecting more data about their customers and their preferences than ever. This data is being leveraged in a positive way, by giving personalized recommendations and offers to customers, who are more than happy to allow companies to collect this data in return for the personalized services. The recommendations you get on Netflix, or Amazon/Flipkart are a gift of Big Data! Machine Learning is another field that has benefited greatly from the increasing popularity of Big Data. More data means we have larger datasets to train our ML models, and a more trained model (generally) results in a better performance. Demand forecasting has become more accurate with more and more data being collected about customer purchases. This helps companies build forecasting models, that help them forecast future demand, and scale production accordingly.
  • 8.
    Big Data Big DataTools 1. Apache Hadoop is an open-source big data tool designed to store and process large amounts of data across multiple servers. Hadoop comprises a distributed file system (HDFS) and a MapReduce processing engine. 2. Apache Spark is a fast and general-purpose cluster computing system that supports in- memory processing to speed up iterative algorithms. Spark can be used for batch processing, real-time stream processing, machine learning, graph processing, and SQL queries. 3. Apache Cassandra is a distributed NoSQL database management system designed to handle large amounts of data across commodity servers with high availability and fault tolerance. 4. Talend is an open-source data integration platform that enables organizations to extract, transform, and load (ETL) data from various sources into target systems. Talend supports big data technologies such as Hadoop, Spark, Hive, Pig, and HBase. Etc..
  • 9.
    Big Data Challenges 1. DataGrowth • Managing datasets having terabytes of information can be a big challenge for companies. As datasets grow in size, storing them not only becomes a challenge but also becomes an expensive affair for companies. • To overcome this, companies are now starting to pay attention to data compression and de- duplication. Data compression reduces the number of bits that the data needs, resulting in a reduction in space being consumed. Data de-duplication is the process of making sure duplicate and unwanted data does not reside in our database. 2. Data Security • Data security is often prioritized quite low in the Big Data workflow, which can backfire at times. With such a large amount of data being collected, security challenges are bound to come up sooner or later. • Mining of sensitive information, fake data generation, and lack of cryptographic protection (encryption) are some of the challenges businesses face when trying to adopt Big Data techniques.
  • 10.
    Big Data Challenges 3. DataIntegration • Data is coming in from a lot of different sources (social media applications, emails, customer verification documents, survey forms, etc.). It often becomes a very big operational challenge for companies to combine and reconcile all of this data. • There are several Big Data solution vendors that offer ETL (Extract, Transform, Load) and data integration solutions to companies that are trying to overcome data integration problems. There are also several APIs that have already been built to tackle issues related to data integration.
  • 11.
    Big Data Advantages ofBig Data • Improved decision-making: Big data can provide insights and patterns that help organizations make more informed decisions. • Increased efficiency: Big data analytics can help organizations identify inefficiencies in their operations and improve processes to reduce costs. • Better customer targeting: By analyzing customer data, businesses can develop targeted marketing campaigns that are relevant to individual customers, resulting in better customer engagement and loyalty. • New revenue streams: Big data can uncover new business opportunities, enabling organizations to create new products and services that meet market demand. • Competitive advantage: Organizations that can effectively leverage big data have a competitive advantage over those that cannot, as they can make faster, more informed decisions based on data-driven insights.
  • 12.
    Big Data Disadvantages ofBig Data • Privacy concerns: Collecting and storing large amounts of data can raise privacy concerns, particularly if the data includes sensitive personal information. • Risk of data breaches: Big data increases the risk of data breaches, leading to loss of confidential data and negative publicity for the organization. • Technical challenges: Managing and processing large volumes of data requires specialized technologies and skilled personnel, which can be expensive and time- consuming. • Difficulty in integrating data sources: Integrating data from multiple sources can be challenging, particularly if the data is unstructured or stored in different formats. • Complexity of analysis: Analyzing large datasets can be complex and time- consuming, requiring specialized skills and expertise.
  • 13.
    Big Data Implementation AcrossIndustries Industry Use of Big data Healthcare Analyze patient data to improve healthcare outcomes, identify trends and patterns, and develop personalized treatment Retail Track and analyze customer data to personalize marketing campaigns, improve inventory management and enhance CX Finance Detect fraud, assess risks and make informed investment decisions Manufacturing Optimize supply chain processes, reduce costs and improve product quality through predictive maintenance Transportation Optimize routes, improve fleet management and enhance safety by predicting accidents before they happen Energy Monitor and analyze energy usage patterns, optimize production, and reduce waste through predictive analytics
  • 14.
    Big Data Implementation Across Industries Telecommunications Managenetwork traffic, improve service quality, and reduce downtime through predictive maintenance and outage prediction Government and public Address issues such as preventing crime, improving traffic management, and predicting natural disasters Advertising and marketing Understand consumer behavior, target specific audiences and measure the effectiveness of campaigns Education Personalize learning experiences, monitor student p
  • 15.
    Big Data Architecture Fig.Architecture of Big Data
  • 16.
    Big Data Architecture Followingare the components of Big Data Architecture: Data Sources: • All of the sources that feed into the data extraction pipeline are subject to this definition, so this is where the starting point for the big data pipeline is located. Data sources, open and third-party, play a significant role in architecture. • Relational databases, data warehouses, cloud-based data warehouses, SaaS applications, real-time data from company servers and sensors such as IoT devices, third-party data providers, and also static files such as Windows logs, comprise several data sources. Data Storage: • There is data stored in file stores that are distributed in nature and that can hold a variety of format-based big files. It is also possible to store large numbers of different format- based big files in the data lake. • This consists of the data that is managed for batch built operations and is saved in the file stores. We provide HDFS, Microsoft Azure, AWS etc..
  • 17.
    Big Data Architecture BatchProcessing: • Each chunk of data is split into different categories using long-running jobs, which filter and aggregate and also prepare data for analysis. These jobs typically require sources, process them, and deliver the processed files to new files. • Multiple approaches to batch processing are employed, including Hive jobs map reducer jobs written in any one of the Java or Scala or other languages such as Python. Real Time-Based Message Ingestion: • A real-time streaming system that caters to the data being generated in a sequential and uniform fashion is a batch processing system. When compared to batch processing, this includes all real-time streaming systems that cater to the data being generated at the time it is received. • This data mart or store, which receives all incoming messages and discards them into a folder for data processing, is usually the only one that needs to be contacted. Message- based ingestion stores such as Apache Kafka, Apache Flume, Event hubs from Azure, and others, on the other hand, must be used if message-based processing is required. The delivery process, along with other message queuing semantics, is generally more reliable.
  • 18.
    Big Data Architecture StreamProcessing: • Real-time message ingest and stream processing are different. The latter uses the ingested data as a publish-subscribe tool, whereas the former takes into account all of the ingested data in the first place and then utilises it as a publish-subscribe tool. • Stream processing, on the other hand, handles all of that streaming data in the form of windows or streams and writes it to the sink. This includes Apache Spark, Flink, Storm, etc. Analytics-Based Datastore: In order to analyze and process already processed data, analytical tools use the data store that is based on HBase or any other NoSQL data warehouse technology. The data can be presented with the help of a hive database, which can provide metadata abstraction, or interactive use of a hive database.
  • 19.
    Big Data Architecture Reportingand Analysis: The generated insights, on the other hand, must be processed and that is effectively accomplished by the reporting and analysis tools that utilize embedded technology and a solution to produce useful graphs, analysis, and insights that are beneficial to the businesses. For example, Cognos, Hyperion, and others. Orchestration: Data-based solutions that utilize big data are data-related tasks that are repetitive in nature, and which are also contained in workflow chains that can transform the source data and also move data across sources as well as sinks and loads in stores. Sqoop, oozie, data factory, and others are just a few examples.