Industrial Revolution – IV
By
Professor Jihad Mohamad ALJA’AM
1
BIG DATA
We live surrounded and submerged by DATA
DIGITAL DATA
2
3
Introduction
Since the invention of computers, people have
used the term data to refer to computer
information.
Example: I have data, I don’t have data
DATA
Data can be texts or numbers written on
papers, or it can be bytes and bits inside the
memory of electronic devices, or it could be
facts that are stored inside a person’s mind.
DIGITAL DATA Binary 0, 1
4
Data: Text + Video + Audio + Logs from Web
Web Logs File
Contains Information about Visitors
WE CAN FIND DATA EVERY WHERE
ALL PEOPLE CAN GENERATE DATA
MASSIVELY
5
DATA EXPLOSION WITH INTERNET AND SOCIAL MEDIA
6
7
WORLD OF DATA
8
Internet and Data
You are on the Internet almost daily. You check
your email, send replies, maybe browse
websites, and even click on things (image, link).
Every move you make online generate data.
With around 4.66 billion active Internet users
worldwide, the data produced daily surpasses
the imagination.
9
SOCIAL MEDIA GENERATE DATA
A HUGE AMOUNT OF DATA IS GENERATED
10
The Internet & DATA & CORONA VIRUS
The coronavirus pandemic shuttered offices,
schools, restaurants, and other establishments.
It allowed people to spend more time on the
Internet for work, learning, and entertainment.
1.7 MB is how much data is created every
second per person. (Northeastern
University)
Photos + Videos + Voices + Text
11
2.5 quintillion bytes of data were created
every day. (SG Analytics, 2020):
1000000000000000000 = 1018 Byte/day
That is equivalent to 10 million discs, which
when stacked would be as tall as two Eiffel
Towers combined. (Dihuni, 2020)
12
As of August 2020, in one Internet minute
there were:
41,666,667 messages
by WhatsApp users. That is the most media
usage in 2020. (Domo, 2020)
That is followed by voice or video calls,
which amounted to 1,388,889 per Internet
minute. (Domo, 2020)
13
404,444 users streamed on Netflix every
minute. (Domo, 2020)
Amazon shipped 6,659 packages per
minute. This figure contributed to the
explosive of E-Commerce (Domo, 2020)
14
Email users sent 306.4 billion emails per
day in 2020. In contrast, 293.6 billion were
exchanged in 2019. (Radicati Group, 2019;
TechJury, 2020)
People sent 500 million tweets daily.
(TechJury, 2020). That was 5,787 tweets
per second. (e-Learning Infographics, 2020)
3.5 billion searches were made on Google.
(e-Learning Infographics, 2020). Most
visited search engine.
15
300 hours of video were uploaded on
YouTube per minute. (e-Learning
Infographics, 2020)
A connected car produced 4 TB of data in
one day. (Raconteur, 2020)
16
Smart Transportation
WHAT TO DO WITH DATA?
17
DATA ENGINEERING
1.DATA STORAGE
2.DATA PROCESSING
3.INFORMATION RETRIEVAL
4.SEARCHING DATA
5.ORGANISING DATA
6.DATA CLASSIFICATION
7.DATA CLEANING
8.COMPLETING MISSING DATA
9.REASONNING, PREDECTION, PLANNING
18
DATA & EVENTS
Over six million posts were made in one
day to commemorate Supreme Court
Justice Ruth Bader Ginsburg. (Facebook,
2020)
When Kamala Harris was voted as United
States vice president, the announcement
drew over 10 million posts per day in
August. (Facebook, 2020)
19
Facebook created 4 PB of data in one day.
(Raconteur, 2020)
Users posted 350 million photos in a day on
Facebook. (Raconteur, 2020)
47 million stories with the Support Small
Business Sticker were created on Instagram
in the last quarter of 2020. (Facebook,
2020)
20
Instagram users uploaded 95 million photos
per day over the year. (e-Learning
Infographics, 2020).
The average user stayed on the Instagram
app for 15 minutes. Within those 15
minutes, they comment, like, search, and
scroll, adding more to the data produced.
(e-Learning Infographics, 2020).
Two professionals signed up on LinkedIn
every second in 2020. (e-Learning
Infographics, 2020).
TECHNOLOGY GENERATE DATA
21
DATA GROWTH IN 2021
How much data is created every day 2021? As
of April 2021, the number of people on the
Internet has grown by 7.6%. This means 60%
of the world’s population is now online.
74 zettabytes – the total data in the world
by the end of 2021, according to expert
predictions. (IDC & Statista, 2020)
There would be a 3% growth of email users
in 2021. (Radicati Group, 2019)
One study shows that 1.145 trillion MB of
data is created every day. (TechJury, 2021)
There could be 2 trillion searches on Google
by the end of 2021. (Internet Live Stats,
2021)
22
That would be six billion searches in 365
days. (Internet Live Stats, 2021)
3,026,626 emails are sent every second.
(Internet Live Stats, 2021)
Of which, 67% are spam. (Internet Live
Stats, 2021)
Users send 31 million messages every
minute each day on Facebook. (Strategic
Tech Investor, 2021)
Facebook users view around 2.7 million
videos per minute every day. (Strategic
Tech Investor, 2021),
Every year, more than 2.5 billion blog posts
go up (GrowthBadger, 2021)
23
Each month, users publish 70 million blog
posts and post 77 million new comments on
WordPress. (GrowthBadger, 2021)
As more and more people use the internet,
cybersecurity threats also continue to
grow. To date, 230,000 new malware
samples are created every day. (PurpleSec,
2021).
24
WORLD OF ZETTABYTE
25
DATA = KNOWLEDGE
KNOWLEDGE GENERATE MONEY
26
27
PILE OF DVD THAT REACHES THE MOON
WHEN STACKED
DIFFICULTIES with DATA
28
Importance of DATA
If you work in human services because you hate
math, terms like “data,” “quantitative analysis,”
might sound scary.
Don’t be intimidated! Data does not have to be
complicated.
Data is useful information that you collect
to support organizational decision-making
and strategy.
IMPROVE OUTCOMES
29
Quality
Improving quality is first and foremost among the
reasons why organizations should be using data.
DATA = KNOWLEDGE
MORE DATA = MORE KNOWLEDGE
YOU CAN SEE THE WORLD BETTER WITH DATA
30
MONITORING
Data allows you to monitor the health of your
organization:
Organizations are able to respond to challenges
before they become full-blown crises.
Effective quality monitoring will allow your
organization to be proactive rather than reactive
and will support the organization to maintain best
practices over time.
PROACTIVE VERSU REACTIVE
31
PROACTIVE Versus REACTIVE
Weather Provider Companies
32
Example.
Data: Weather NEXT Week
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
35°C 38°C 45°C 51°C 55°C 60°C 65°C
PROACTIVE:
Inform people from now to be prepared.
Provide sufficient bottles of water.
Check your Air conditioners at home and work
Ban works from 12:00 – 15:00
Thursday Friday Saturday Sunday
51°C 55°C 60°C 65°C
REACTIVE:
Reaching Thursday, some people die by the
heat, Then take ACTIONS
33
DATA = MONEY
YOU CAN COLLECT AND SELL DATA
COMPANIES MAY BUY DATA
Data about people
Data about organisations
Data about countries
DATA SETS Data about families
Data about students in universities
Data about products
Data about hotels
34
DATA and Strategy
Data allows organizations to measure the
effectiveness of a given strategy.
When strategies are put into place to overcome a
challenge, collecting data will allow you to
determine how well your solution is performing,
and whether or not your approach needs to be
tweaked or changed over the long-term.
A data strategy is a long-term plan that
defines the technology, processes,
people, and rules required to manage an
organization's information assets.
35
Example
Solve the large number of students failure in the
physics course.
Strategy:
1. Give more homework’s
2. Use Videos to Explain Theoretical Concepts
3. Reduce score in midterms and final exams
4. Ask the students to work in groups
Data Collection:
Collect the data over a period of six months and see if
this strategy leads to solve the problem.
REDUCE THE NUMBER OF FAILING STUDENTS.
You can adjust or change something in
your strategy based on the analysis of
the data
36
Find Solutions to Problems
Data allows organizations to more effectively
determine the cause of problems.
Data allows organizations to visualize relationships
between what is happening in different locations,
and departments.
DATA ENGINEERS
SOFTWARE FOR DATA VISUALISATION
37
Example: Travel Agency:
AGA travel agency has 4 offices. Get data of sales
in every office over the year.
Collect DATA
Office-1 5 Million Consumption: 3 Millions
Office-2 25 Millions Consumption: 8 Millions
Office-3 18 Millions Consumption: 8 Millions
Office-4 1 Million Consumption: 2 Millions
Office-4: Lose money
DATA ANALYSIS
Action-1: Make training for staff.
Action-2: Reduce staff in Office-4 or even close it.
38
Systems Advocacy
Data is a key component of systems advocacy.
Utilizing data will help you present a strong
argument for systems change.
Argue why it is important to make changes
in your current systems or software
Whether you are advocating for increased
funding from public or private sources, or
making the case for changes in regulation,
illustrating your argument through the use of
data will allow you to demonstrate why
changes are needed.
1. Change something: Systems/Software
2. Ask for fund from government
3. Increase/reduce the staff
4. Buy faster computers
DATA HELPS TO STRENGTHEN YOUR ARGUMENT
FOR CHANGES
39
DATA FOR STRATIGIC DECISIONS
Data will help you explain (both good and bad)
decisions to your stakeholders. Whether or not
your strategies and decisions have the outcome
you anticipated, you can be confident that you
developed your approach based not upon guesses,
but good solid of data analysis.
DATA HELPS TO EVALUATE DECISIONS &
ADOPTED STRATEGIES
Strategic Planning
Data allows you to replicate areas of strength
across your organization. Data analysis will support
you to identify high-performing programs, service
areas, and people.
Once you identify your high-performers, you can
study them in order to develop strategies to assist
programs, service areas and people that are under-
performing (make training).
40
ALL BUSNESSES NEED BIGDATA TO FLOURISH
BIG DEMANDS TO DATA SCIENTISTS
Top Industries Hiring Data Scientists in 2022
https://www.naukri.com/learning/articles/top-industries-hiring-data-scientists/
41
WE CANNOT GROWTH UP BUSNESSES
WITHOUT DATA
The value of the data science market
is slated to reach $16 billion by 2025
42
Top Recruiters of DATA SCIENTISTS
Amazon
Flipkart
Walmart
Aditya Birla Fashion & Retail Ltd.
Future Enterprises Ltd.
Reliance Retail Ltd.
K. Raheja Group (Shoppers’ Stop)
Landmark Group (Lifestyle)
ITC
43
How is Data Stored?
Computers represent data (e.g., text, images,
sound, video), as binary values that employ two
numbers: 1 and 0.
The smallest unit of data is called a “bit,” and it
represents a single value. Additionally, a byte
is eight bits long.
Memory and storage are measured in units
such as megabytes, gigabytes, terabytes,
petabytes, and exabytes.
44
Data
Data Measurement Size
Single Binary Digit
Bit
(1 or 0)
Byte 8 bits
Kilobyte (KB) 1,024 Bytes
Megabyte (MB) 1,024 Kilobytes
Gigabyte (GB) 1,024 Megabytes
Terabyte (TB) 1,024 Gigabytes
Petabyte (PB) 1,024 Terabytes
Exabyte (EB) 1,024 Petabytes
A zettabyte is storage for 30 Billion
4K movies
45
The Human Brain Capacity is 1.2ZB
Huge Amount of Data need to be
Stored, Structured and Searched
46
Current Technology fails to work with
Bigdata
Data Processing Cycle
Data processing is defined as the re-ordering
or re-structuring of data by people or machines
to increase its utility and add value for a
specific function or purpose.
Example
Search tweets on Qatar and World Cup.
TWEETED TEXTS ARE NOT STRUCTURED
HOW TO STRUCTURE THEM IN ORDER
TO EXTRACT SOME USEFUL
INFORMATION
“What people think about Qatar”
47
We can address queries to structured data.
This is done with a language called
Structured Query Language or SQL for
short.
For example, if we want to find out how
many users made a tweet between 10am
and 11am we could do something like:
A QUERY IN SQL
SELECT Users.UserId, Twitter.Tweet, Twitter.Time
FROM Twitter
INNER JOIN Users ON
Twitter.UserId=Users.UserId
WHERE Twitter.Time >=10am OR <=11am
Organising Data
Standard data processing is made up of three basic
steps:
Input, Processing, and Output
48
Together, these three steps make up the data
processing cycle.
Input: The input data gets prepared for
processing in a convenient form that relies
on the machine carrying out the
processing.
HOW TO PROCESS BIG-DATA?
Processing: Next, the input data’s form is
changed to something more useful. For
example, information from timecards
(attendance) is used to calculate
paychecks.
Output: In the final step, the processing
results are collected as output data, with its
final form depending on what it’s being
used for. Using the previous example,
output data becomes the employees’
actual paychecks.
HOW MUCH MONEY SHOULD BE GIVEN
49
Employee Timecards: ATTENDANCE
ANALYSE THE TIMECARDS OVER THE WEEK
AND GENERATE THE PAYCHECK
ACCORDINGLY
$675.80 based on the worked hours
50
Big-Data
Big Data is a data but with a huge size
ERA OF
ZETTABYTE
'Big Data' is a term used to describe
collection of data that is huge in size and
yet growing exponentially with time.
51
TONS OF DATA
Data which are very large in size is called
Big Data like ZETTABYTES
Working with Bigdata is problematic.
We need much powerful software and
computers to work with Big Data.
BigData Needs Storage & Processing
52
WE NEED FAST INTERNET CONNECTIVITY
TO DEAL WITH BIGDATA
Connection Speed Technology
Internet Data Rate Data Rate Data Rate Data Rate
Technology (per second) (per second) (per second) (per second)
28.8K Modem 28.8 Kbps 28,800 Bits 3,600 Bytes 3.5 Kilobytes
36.6K Modem 36.6 Kbps 36,600 Bits 4,575 Bytes 4.4 Kilobytes
56K Modem 56 Kbps 56,000 Bits 7,000 Bytes 6.8 Kilobytes
ISDN 128 Kbps 128,000 Bits 16,000 Bytes 15 Kilobytes
T1 1.544 Mbps 1,544,000 Bits 193,000 Bytes 188 Kilobytes
512 Kbps to 8
DSL 8,000,000 Bits 1,000,000 Bytes 976 Kilobytes
Mbps
512 Kbps to 52 6,469 Kilobytes
Cable Modem 53,000,000 Bits 6,625,000 Bytes
Mbps (6.3MB/sec)
5,460 Kilobytes
T3 44.736 Mbps 44,736,000 Bits 5,592,000 Bytes
(5.3MB/sec)
Gigabit 1,000,000,000 125,000,000 122,070 Kilobytes
1 Gbps
Ethernet Bits Bytes (119MB/sec)
13,271,000,000 1,658,875,000 1,619,995 Kilobytes
OC-256 13.271 Gbps
Bits Bytes (1.5GB/sec)
SPEED AFFECTS BUSNIESSES
WE NEED FAST CONNECTIVITY
TO
WORK WITH BIGDATA
53
Types of various Units of Memory
Byte 01011111 011111111 00000000
Kilo Byte 1000 Bytes
Mega Byte 1024 Kilos
Giga Byte 1024 Mega
Tera Byte 1024 Giga
Peta Byte 1024 Tera
Exa Byte 1024 Peta
Zetta Byte 1024 Exa
Yotta Byte 1024 Zetta
54
Name Equal To Size(In Bytes)
Bit 1 Bit 1/8
Nibble 4 Bits ½ (rare)
Byte 8 Bits 1
Kilobyte 1024 Bytes 1024
1, 024
Megabyte Kilobytes 1, 048, 576
1, 024
Gigabyte Megabytes 1, 073, 741, 824
1, 024
Terrabyte Gigabytes 1, 099, 511, 627, 776
1, 024 1, 125, 899, 906, 842,
Petabyte Terabytes 624
1, 024 1, 152, 921, 504, 606,
Exabyte Petabytes 846, 976
1, 024 1, 180, 591, 620, 717,
Zettabyte Exabytes 411, 303, 424
1, 024 1, 208, 925, 819, 614,
Yottabyte Zettabytes 629, 174, 706, 176
55
DELUGE OF DATA
DATA SCIENTISTS ARE NEEDED IN ALL
BUSINESSES
56
TONS OF DATA GENERATED
57
Current software fail to deal with bigdata
58
The currents systems will be very slow
and almost impossible to deal with
BIGDATA
Normally we work on data of size MB
(Word Doc, Excel) or maximum GB
(Movies) but data in Zetta bytes or Peta
bytes i.e. 1012 or 1015 byte size called
Big Data, impossible to work with them
59
DATA SCIENCE ENGINEERS
GOOLE WORKS WITH BIGDATA
60
Google processes more
than 20 petabytes of data
every day. This includes
around 3.5 billion search
queries.
“Data is the new oil of
Technology”
61
Volume of DATA
74 Zettabytes (74 trillion GBs) of data
would be generated by the Internet.
In short, such a data is so large and
complex that none of the traditional data
management tools can store it or process it
efficiently.
62
The amount of data in the world was
estimated to be 44 zettabytes at the
dawn of 2020.
By 2025, the amount of data generated
each day is expected to reach 463
exabytes globally.
Google, Facebook, Microsoft, and
Amazon store at least 1,200
petabytes of information.
The world spends almost $1
million per minute on commodities on
the Internet based on BIGDATA
By 2025, there would be 75
billion Internet-of-Things (IoT)
devices in the world.
63
By 2030, nine out of every ten people
aged six and above would be digitally
active.
ZETTABYTES
The New York Stock Exchange generates about one
terabyte of new data per day.
64
Social Media generate 500 terabytes of
new data Facebook, Google, LinkedIn, …,
every day. This data is mainly generated in
terms of photo and video uploads,
message exchanges, comments etc.
Single Jet engine can generate
10+terabytes of data in 30 minutes of a
flight time.
Many thousand flights per day, generation
of data reaches up to many Petabytes.
Weather Station: All the weather stations
and satellites give very huge data which are
stored and manipulated to forecast
weather.
65
WEATHER PREDICTION COMPANIES CAN SELL
DATA TO ORGANISATIONS
Telecom company: Telecom like Ooredoo,
Vodafone study the user trends and
accordingly publish their plans and for this
they store the data of its million users for
analysis.
E-commerce site: Sites like Amazon, Flipkart,
Alibaba generates huge amount of logs from
which users buying trends can be traced.
66
SOFTWARE TO HANDLE BIGDATA
What are you interested in? Science
Fiction, Perfumes, etc.
Software to analyse the log files and detect
user trends.
67
Identification: Your IP address.
Trends: Web pages you visited
Items you are interested in.
Processing these massive amounts of
data is not impossible with new
technologies like Quantum and
Hadoop.
Quantum computers are among these
technologies, which work a thousand
times faster than traditional
computers.
NEW COMPUTERS
QUANTIB COMPUTING TO HANDLE BIGDATA
68
QUANTUM COMPUTING TO WORK WITH BITCOINS
CRYPTO CURRENCY
69
How to process BigData
70
From DATA to KNOWLEDGE
CLEAN DATA, COMPLETE DATA
71
HADOOP FRAMEWORK FOR BIGDATA
Data are Text and Tables
CAN BE PROCESSED IN PARALLEL WITH
SEVERAL COMPUTERS
72
GOOGLE STRATEGY with HADOOP
GOOGLE STORE DATA IN DIFFERENT
COMPUTERS CALLED CLUSTERS
73
Hadoop Distributed File Systems
STORAGE
STORE DATA IN DIFFERENT COMPUTERS
CALLED CLUSTERS
74
Distributed Storage into Blocks
EVERY BLOCK CAN STORE A PORTION
OF DATA
75
PROCESSING BIGDATA IN PARALLEL
FILTRING AND GIVING RESULTS AFTER
PROCESSING
DSIPLAYING THE RESULTS
76
HADOOP FOR BIG DATA
STORE DATA IN BLOCKS
PROCESS DATA IN PARALLEL
77
78
DATA TYPES
79
Structured Data
Structured data can be defined as the data that
resides in a fixed field within a record. It is split
into multiple tables. All of the data follows the
same format. Structured data is easy to enter,
query, and analyze.
80
STRUCTURED DATA
81
Semi-Structured Data
To consider what semi-structured data is,
let's start with an analogy -- interviewing.
Let's say you're conducting a semi-
structured interview. This, as the name
implies, falls somewhere in-between a
structured and unstructured interview.
For context, a structured interview is one in
which the questions being asked, as well as the
order in which they are asked, is pre-
determined by your HR team and consistent for
each candidate.
82
An unstructured interview, on the other hand,
is one in which the questions, and the order in
which they are asked, is up to the discretion of
the interviewer -- and could be entirely different
for each candidate.
When you consider these two extremes, you
can begin to see the benefits of semi-
structured interview, which are fairly consistent
and quantitative (like a structured interview),
but still provide the interviewer with a window
for building rapport, and asking follow-up
questions.
Semi-structured data is similar in nature to
a semi-structured interview -- it's not as
messy and uncontrolled as unstructured
data, but not as rigid and readily
quantifiable as structured data.
83
Semi-structured data is information that does
not reside in a relational database or any other
data table, but nonetheless has some
organizational properties to make it easier to
analyze. A good example of semi-structured
data is HTML code to build web pages.
84
Unstructured data Any format of data.
85
Data Velocity Defined
Data velocity refers to the speed in which data
is generated, and collected.
The velocity rate is based on factors such
as the number of sensors present on IoT –
enabled devices and the amount of
individuals using the internet and Social
Medias
Velocity refers to the speed at which
data is entered into a system and must
be processed.
For example, Amazon captures every
click of the mouse while shoppers are
browsing on its website. This happens
rapidly.
86
It is incredibly important to have real-
time data at any time to make better
business decisions faster.
87
Search for Coco Chanel Perfume
Give the surfer spontaneous offers
Propose to you immediately some offers
PROPOSE DISCOUNTS
88
89
DATA visualization – 3D DATA
TO DESSIMINATE IDEAS
ACCESS – ORACLE Cannot handle bigdata
90
1. STRUCTURED
2. SEMI-STRUCTURED
3. UNSTRUCTURED
91
92
HADOOP IS A SOFTWARE THAT HAS
MAN TOOLS TO
WORK WITH BIGDATA
USING CLUSTERS OF COMPUTERS
93
HADOOP IS FREE
94
95