1. Department of CSE- Data Science
Module-1
Introduction to Big Data, Big Data Analytics
2. Department of CSE- Data Science
Contents
Classification of data
Characteristics
Evolution and definition of Big data
What is Big data
Why Big data
Traditional Business Intelligence Vs Big Data
Typical data warehouse and Hadoop environment
Big Data Analytics: What is Big data Analytics
Classification of Analytics
Importance of Big Data Analytics
Technologies used in Big data Environments
Few Top Analytical Tools , NoSQL, Hadoop.
3. Department of CSE- Data Science
Introduction
Data is present internal to the enterprise and also exists outside the four walls and
firewalls of the enterprise.
Data is present in homogeneous sources as well as in heterogeneous sources.
Data → Information
Information → Insights
5. Department of CSE- Data Science
Structured data
Data which is in an organized form(e.g, rows and columns) and can be
easily used by a computer program.
Relationships exist between entities of data, such as classes and their
objects.
Data stored in databases is an example of structured data.
6. Department of CSE- Data Science
Semi-structured data
Data which does not conform to a data model but has some structure.
It is not in a form which can be used easily by a computer program.
For example XML, markup languages like HTML etc.,
7. Department of CSE- Data Science
Unstructured data
Data which does not conform to a data model or is not in a form which can be
used easily by a computer program.
About 80%-90% data of an organization is in this format
For example, memos, chat rooms, powerpoint presentations, images, videos,
letters etc,.
8. Department of CSE- Data Science
Structured Data
Most of the structured data is held in RDBMS.
An RDBMS conforms to the relational data model wherein the data is stored in
rows/columns.
The number of rows/records/tuples in a relation is called the cardinality of a
relation and the number of columns s referred to as the degree of a relation.
The first step is the design of a relation/table, the fields/columns to store the data,
the type of data that will be stored [number (integer or real), alphabets, date,
Boolean, etc.].
9. Department of CSE- Data Science
Next we think of the constraints that we would like our data to conform to
(constraints such as UNIQUE values in the column, NOT NULL values in the
column, a business constraint such as the value held in the column should not
drop below 50, the set of permissible values in the column such as the column
should accept only “CS”, “IS”, “MS”, etc., as input).
Example: Let us design a table/relation structure to store the details of the
employees of an enterprise.
11. Department of CSE- Data Science
The tables in an RDBMS can also be related. For example, the above “Employee”
table is related to the “Department” table on the basis of the common column,
“DeptNo”.
Fig: Relationship between “Employee” and “Department” tables
12. Department of CSE- Data Science
Sources of Structured Data
Oracle, IBM — DB2, Microsoft — Microsoft SQL Server, EMC — Greenplum,
Teradata — Teradata, MySQL (open source), PostgreSQL (advanced open source),
etc.] are used to hold transaction/operational data generated and collected by
day-to-day business activities.
The data of the On-Line Transaction Processing (OLTP) systems are generally quite
structured.
13. Department of CSE- Data Science
Ease of Working with Structured Data
1. Insert/update/delete: The Data
Manipulation Language (DML) operations
provide the required ease with data
input, storage, access, process, analysis,
etc.
2. Security: There are available staunch
encryption and tokenization solutions to
warrant the security of information
throughout its lifecycle. Organizations are
able to retain control and maintain
compliance adherence by ensuring that
only authorized individuals are able to
decrypt and view sensitive information.
14. Department of CSE- Data Science
3. Indexing: An index is a data structure that speeds up the data retrieval operations (primarily
the SELECT DML statement) at the cost of additional writes and storage space, but the
benefits that ensue in search operation are worth the additional writes and storage space.
4. Scalability: The storage and processing capabilities of the traditional RDBMS can be easily
scaled up by increasing the horsepower of the database server (increasing the primary and
secondary or peripheral storage capacity, processing capacity of the processor, etc,).
5. Transaction processing: RDBMS has support for Atomicity, Consistency, Isolation, and
Durability (ACID) properties of transaction. Given next is a quick explanation of the ACID
properties:
Atomicity: A transaction is atomic, means that either it happens in its entirety or none of it
at all.
Consistency: The database moves from one consistent state to another consistent state. In
other words, if the same piece of information is stored at two or more places, they are in
complete agreement.
Isolation: The resource allocation to the transaction happens such that the transaction gets
the impression that it is the only transaction happening in isolation.
Durability: All changes made to the database during a transaction are permanent and that
accounts for the durability of the transaction.
15. Department of CSE- Data Science
Semi-structured Data
Semi-structured data is also referred to as self-describing structure.
Features
1. It does not conform to the data models that
one typically associates with relational
databases or any other form of data tables.
2. It uses tags to segregate semantic elements.
3. Tags are also used to enforce hierarchies of
records and fields within data. There is no
separation between the data and the schema.
The amount of structure used is dictated by
the purpose at hand.
4. In semi-structured data, entities belonging to
the same class and also grouped together
need not necessarily have the same act of
attributes. And if at all, they have the same set
of attributes, the order of attributes may not
be similar and for all practical purposes it is
not important as well.
16. Department of CSE- Data Science
Sources of Semi-Structured Data
1. XML: Xtensible Markup Language
(XML) is hugely popularized by web
services developed utilizing the
Simple Object Access Protocol (SOAP)
principles.
2. JSON: Java Script Object Notation
(JSON) is used to transmit data
between a server and a web
application. JSON is popularized by
web services developed utilizing the
Representational State Transfer
(REST) — an architecture style for
creating scalable web services.
MongoDB (open-source, distributed,
NoSQL, documented-oriented
database) and Couchbase (originally
known as Membase, open-source,
distributed, NoSQL, document-
oriented database) store data
18. Department of CSE- Data Science
Unstructured Data
Unstructured data does not conform to any pre-defined data model.
The structure is quite unpredictable.
Table :Few examples of disparate unstructured data
20. Department of CSE- Data Science
Issues with Unstructured Data
unstructured data is known NOT to conform to a pre-defined data model or be
organized in a pre defined manner, there are incidents wherein the structure
of the data can still be implied.
21. Department of CSE- Data Science
Dealing with Unstructured Data
Today, unstructured data constitutes approximately 80% of the data that is
being generated in any enterprise.
The balance is clearly shifting in favor of unstructured data as shown in below.
It is such a big percentage that it cannot be ignored.
Figure : Unstructured data clearly constitutes a major percentage of
enterprise data.
22. Department of CSE- Data Science
The following techniques are used to find patterns in or interpret unstructured data:
1. Data mining: First, we deal with large data sets. Second, we use methods at the intersection
of artificial intelligence, machine learning, statistics, and database systems to unearth
consistent patterns in large data sets and/or systematic relationships between variables. It is
the analysis step of the “knowledge discovery in databases” process. Popular algorithms are
as follows:
i. Association rule mining: It is also called “market basket analysis” or “affinity analysis”. It is
used to determine “What goes with what?” It is about when you buy a product, what is the
other product that you are likely to purchase with it. For example, if you pick up bread from
the grocery, are you likely to pick eggs or cheese to go with it.
Figure : Dealing with unstructured data
23. Department of CSE- Data Science
ii. Regression analysis: It helps to predict the relationship between two variables. The variable
whose value needs to be predicted is called the dependent variable and the variables which
are used to predict the value are referred to as the independent variables.
iii. Collaborative filtering: It is about predicting a user's preference or preferences based on the
preferences of a group of users.
Table:
We are looking at predicting whether User 4 will prefer to learn using videos or is a textual
learner depending on one or a couple of his or her known preferences.
We analyze the preferences of similar user profiles and on the basis of it, predict that User 4
will also like to learn using videos and is not a textual learner.
24. Department of CSE- Data Science
2. Text analytics or text mining: Compared to the structured data stored in relational
databases, text is largely unstructured, amorphous, and difficult to deal with
algorithmically. Text mining is the process of gleaning high quality and meaningful
information (through devising of patterns and trends by means of statistical pattern
learning) from text. It includes tasks such as text categorization, text clustering,
sentiment analysis, concept/entity extraction, etc.
3. Natural language processing (NLP): It is related to the area of human computer
interaction. It is about enabling computers to understand human or natural
language input.
4. Noisy text analytics: It is the process of extracting structured or semi-structured
information from noisy unstructured data such as chats, blogs, wikis, emails,
message-boards, text messages, etc. The noisy unstructured data usually comprises
one or more of the following; Spelling mistakes, abbreviations, acronyms, non-
standard words, missing punctuation, missing letter case, filler words such as “ub”,
“um”, etc.
25. Department of CSE- Data Science
5. Manual tagging with metadata: This is about tagging manually with adequate metadata
to provide the requisite semantics to understand unstructured data.
6. Part-of-speech tagging: It is also called POS or POST or grammatical tagging. It is the
process of reading text and tagging each word in the sentence as belonging to a particular
part of speech such as “noun”, “verb”, “adjective”, etc.
7. Unstructured Information Management Architecture (UIMA): It is an open source
platform from IBM. It is used for real-time content analytics. It is about processing text
and other unstructured data to find latent meaning and relevant relationship buried
therein.
26. Department of CSE- Data Science
Properties Structured data Semi-structured data Unstructured data
Technology It is based on Relational
database table
It is based on
XML/RDF(Resource
Description Framework).
It is based on character and
binary data
Transaction management
Matured transaction and
various concurrency
techniques
Transaction is adapted from
DBMS not matured
No transaction management
and no concurrency
Version management
Versioning over
tuples,row,tables
Versioning over tuples or
graph is possible Versioned as a whole
Flexibility
It is schema dependent and
less flexible
It is more flexible than
structured data but less
flexible than unstructured
data
It is more flexible and there
is absence of schema
Scalability It is very difficult to scale DB
schema
It’s scaling is simpler than
structured data It is more scalable.
Robustness Very robust
New technology, not very
spread
—
Query performance
Structured query allow
complex joining
Queries over anonymous
nodes are possible
Only textual queries are
possible
29. Department of CSE- Data Science
Characteristics of Data
Data has three characteristics:
1. Composition: deals with structure of data, that is, the
sources of data , the granularity, the types, and the nature
of the data as to whether it is static or real-time
streaming.
2. Condition: The condition of data deals with the state of
the data that is “can one use this data as is for analysis?”
or “Does it require cleansing for further enhancement
and enrichment?”
3. Context: deals with “Where has this data been
generated?”, “Why was this data generated?” and so on.
Figure: Characteristics of data
30. Department of CSE- Data Science
EVOLUTION OF BIG DATA
1970s and before was the era of mainframes. The data was essentially primitive and
structured.
Relational databases evolved in 1980s and 1990s. The era was of data intensive applications.
The World Wide Web WWW) and the Internet of Things (IoT) have led to an onslaught of
structured, unstructured, and multimedia data.
Table : The evolution of big data
31. Department of CSE- Data Science
Definition of Big Data
Figure : Definition of big data.
Anything beyond the human and technical
infrastructure needed to support storage,
processing, and analysis.
Terabytes or petabytes or zettabytes of data.
Terabytes or petabytes or zettabytes of data.
I think it is about 3 Vs.
33. Department of CSE- Data Science
Challenges With Big Data
1. Data today is growing at an exponential rate.
This high tide of data will continue to rise
incessantly. The key questions here are: “Will
all this data be useful for analysis?”, “Do we
work with all this data or a subset of it?”, “How
will we separate the knowledge from the
noise?”, etc.
2. Cloud computing and virtualization are here to
stay. Cloud computing is the answer to
managing infrastructure for big data as far as
cost-efficiency, elasticity, and easy
upgrading/downgrading is concerned. This
further complicates the decision to host big
data solutions outside the enterprise
34. Department of CSE- Data Science
3. The other challenge is to decide on the period of retention of big data. Just how long
should one retain this data? some data is useful for making long-term decisions,
whereas in few cases, the data may quickly become irrelevant and obsolete just a few
hours after having being generated.
4. There is a dearth of skilled professionals who possess a high level of proficiency in data
sciences that is vital in implementing big data solutions.
5. Then, of course, there are other challenges with respect to capture, storage,
preparation, search, analysis, transfer, security, and visualization of big data. There is
no explicit definition of how big the dataset should be for it to be considered “big
data.” Here we are to deal with data that is just too big, moves way to fast, and does
not fit the structures of typical database systems. The data changes are highly dynamic
and therefore there is a need to ingest this as quickly as possible.
6. Data visualization is becoming popular as a separate discipline. We are short by quite a
number, as far as business visualization experts are concerned.
35. Department of CSE- Data Science
WHAT IS BIG DATA?
Big data is data that is big in volume, velocity, and variety.
Fig: Data: Big in volume, variety, and velocity. Fig: Growth of data
36. Department of CSE- Data Science
Volume
We have seen it grow from bits to bytes to petabytes and
exabytes.
Where Does This Data get Generated?
→ There are a multitude of sources for big data.
→ An XLS, a DOC, a PDE, etc. is unstructured data
→ a video on YouTube, a chat conversation on Internet
Messenger, a customer feedback form on an online
retail website, a CCTV coverage, a weather forecast
report is unstructured data too.
Fig: A mountain of data.
37. Department of CSE- Data Science
Figure: Sources of big data.
Typical internal data sources: Data present within an organization’s firewall. It is as
follows:
→ Data storage: File systems, SQL (RDBMSs — Oracle, MS SQL Server, DB2, MySQL,
PostgreSQL, etc.), NoSQL (MongoDB, Cassandra, etc.), and so on.
→ Archives: Archives of scanned documents, paper archives, customer
correspondence records, patients’ health records, students’ admission records,
students’ assessment records, and so on.
38. Department of CSE- Data Science
External data sources: Data residing outside an organization’s firewall. It is as follows:
→ Public Web: Wikipedia, weather, regulatory, compliance, census, etc.
Both (internal+external)
→ Sensor data – Car sensors, smart electric meters, office buildings, air conditioning units,
refrigerators, and so on. etc,.
→ Machine log data – Event logs, application logs, Business process logs, audit logs,
clickstream data, etc.
→ Documents, PDSocial media – Twitter, blogs, Facebook, LinkedIn, Youtube, Instagram etc,.
→ Business apps – ERP,CRM, HR, Google Docs, and so on.
→ Media – Audio, Video, Image, Podcast, etc.
→ Docs – CSV, Word F,XLS, PPT and so on.
39. Department of CSE- Data Science
Velocity
We have moved from the days of batch processing to real-time processing.
Variety
Variety deals with a wide range of data types and sources of data.
1. Structured data: From traditional transaction processing systems and RDBMS, etc.
2. Semi-structured data: For example Hyper Text Markup Language (HTML),
eXtensible Markup Language (XML).
3. Unstructured data: For example unstructured text documents, audios, videos,
emails, photos, PDFs, social media, etc.
Batch Periodic Near real time Real-time processing
41. Department of CSE- Data Science
Traditional Business Intelligence (Bi) Versus Big Data
Business Intelligence Big Data
All the enterprise’s data is housed in a
central server
In a big data environment data resides in
a distributed file system
Scales vertically Scales in or out horizontally
Traditional BI is about structured data,
and it is here that data is taken to
processing functions
Big data is about variety and here the
processing functions are taken to the
data.
42. Department of CSE- Data Science
A Typical Data Warehouse Environment
Operational or transactional or day-to-day
business data is gathered from Enterprise
Resource Planning (ERP) systems,
Customer Relationship Management
(CRM), legacy systems, and several third
party applications.
The data from these sources may differ in format
Data may come from data sources located in the same geography or different geographies.
This data is then integrated, cleaned up, transformed, and standardized through the process
of Extraction, Transformation, and Loading (ETL).
The transformed data is then loaded into the enterprise data warehouse or to marts
Business intelligence and analytics tools are then used to enable decision making
Fig: A typical data warehouse environment.
43. Department of CSE- Data Science
A Typical Hadoop Environment
The data sources are quite
disparate from web logs to
images, audios and videos to
social media data to the
various docs, pdfs, etc
Here the data in focus is not just the data within the company's firewall but also data
residing outside the company's firewall. This data is placed in Hadoop Distributed File
System (HDFS).
If need be, this can be repopulated back to operational systems a fed to the enterprise data
warehouse or data marts or Operational Data Store (ODS) to be picked for further
processing and analysis.
Fig: A typical Hadoop environment
44. Department of CSE- Data Science
WHAT IS BIG DATA ANALYTICS?
Big Data Analytics is
1. Technology-enabled analytics: Quite a few data analytics and visualization tools are
available in the market today from leading vendors such as IBM, Tableau, SAS, R Analytics,
Statistical, World Programming Systems (WPS), etc. to help process and analyze your big
data.
2. About gaining a meaningful, deeper, and richer insight into your business to steer in the
right direction, understanding the customer’s demographics to cross-sell and up-sell to
them, better leveraging the services of your vendors and suppliers, etc.
45. Department of CSE- Data Science
3. About a competitive edge over your competitors by enabling you with findings that allow
quicker and better decision-making.
4. A tight handshake between three communities: IT, business users, and data scientists.
5. Working with datasets whose volume and variety exceed the current storage and
processing capabilities and infrastructure of your enterprise.
6. About moving code to data. This makes perfect sense as the program for distributed
processing is tiny (just a few KBs) compared to the data (Terabytes or Petabytes today and
likely to be Exabytes or Zettabytes in the near future).
46. Department of CSE- Data Science
Classification Of Analytics
These are basically two schools of thought:
1.Those that classify analytics into basic, operationalized, advanced, and monetized.
2.Those that classify analytics into analytics 1.0, analytics 2.0, and analytics 3.0.
First School of Thought
3. Basic analytics: This primarily is slicing and dicing of data to help with basic business
insights. This is about reporting on historical data, basic visualization, etc.
4. Operationalized analytics: It is operationalized analytics if it gets woven into the
enterprise’s business processes.
5. Advanced analytics: This largely is about forecasting for the future by way of predictive
and prescriptive modeling.
6. Monetized analytics: This is analytics in use to derive direct business revenue.
47. Department of CSE- Data Science
Second School of Thought
• Let us take a closer look at analytics 1.0, analytics 2.0, and analytics 3.0.
Table : Analytics 1.0, 2.0, and 3.0
50. Department of CSE- Data Science
Importance of Big Data Analytics
Let us study the various approaches to analysis of data and what it leads to.
1. Reactive — Business Intelligence: What does Business Intelligence (BI) help us with? It
allows the businesses to make faster and better decisions by providing the right
information to the right person at the right time in the right format. It is about analysis
of the past or historical data and then displaying the findings of the analysis or reports
in the form of enterprise dashboards, alerts, notifications, etc. It has support for both
pre-specified reports as well as ad hoc querying.
2. Reactive — Big Data Analytics: Here the analysis is done on huge datasets but the
approach is still reactive as it is still based on static data.
51. Department of CSE- Data Science
3. Proactive — Analytics: This is to support futuristic decision making by the use of data
mining, predictive modeling, text mining, and statistical analysis. This analysis is not on
big data as it still uses the traditional database management practices on big data and
therefore has severe limitations on the storage capacity and the processing capability.
4. Proactive - Big Data Analytics: This is sieving through terabytes, petabytes, exabytes of
information to filter out the relevant data to analyze. This also includes high performance
analytics to gain rapid insights from big data and the ability to solve complex problems
using more data.
52. Department of CSE- Data Science
Terminologies used in Big data Environments
In-Memory Analytics
Data access from non-volatile storage such as hard disk is a slow process. The more the
data is required to be fetched from hard disk or secondary storage, the slower the process
gets. One way to combat this challenge is to pre-process and store data (cubes, aggregate
tables, query sets, etc.) so that the CPU has to fetch a small subset of records. But this
requires thinking in advance as to what data will be required for analysis.
If there is a need for different or more data, it is back to the initial process of pre-
computing and storing data or fetching it from secondary storage. This problem has been
addressed using in-memory analytics. Here all the relevant data is stored in Random Access
Memory (RAM) or primary storage thus eliminating the need to access the data from hard
disk. The advantage is faster access, rapid deployment, better insights, and minimal IT
involvement.
53. Department of CSE- Data Science
In-Database Processing
In-database processing is also called as in-database analytics. It works by fusing
data warehouses with analytical systems.
Typically the data from various enterprise On Line Transaction Processing (OLTP)
systems after cleaning up (de-duplication, scrubbing, etc.) through the process of
ETL is stored in the Enterprise Data Warehouse (EDW) or data marts.
The huge datasets are then exported to analytical programs for complex and
extensive computations.
With in-database processing, the database program itself can run the
computations eliminating the need for export and thereby saving on time.
Leading database vendors are offering this feature to large businesses.
54. Department of CSE- Data Science
Symmetric Multiprocessor System (SMP)
• In SMP there is a single common main memory that is shared by two or more
identical processors.
• The processors have full access to all I/O devices and are controlled by a single
operating system instance.
• SMP are tightly coupled multiprocessor systems. Each processor has its own high-
speed memory, called cache memory and are connected using a system bus.
Figure : Symmetric Multiprocessor
System.
55. Department of CSE- Data Science
Massively Parallel Processing
Massive Parallel Processing (MPP) refers to the coordinated processing of
programs by a number of processors working parallel.
The processors, each have their own operating systems and dedicated memory.
They work on different parts of the same program.
The MPP processors communicate using some sort of messaging interface. The
MPP systems are more difficult to program as the application must be divided in
such a way that all the executing segments can communicate with each other.
MPP is different from Symmetrically Multiprocessing (SMP) in that SMP works
with the processors sharing the same operating system and same memory. SMP is
also referred to as tightly-coupled multiprocessing.
56. Department of CSE- Data Science
Difference Between Parallel and Distributed Systems
Parallel Systems
A parallel database system is a tightly coupled system. The processors co-operate
for query processing.
Figure : Parallel
system
57. Department of CSE- Data Science
The user is unaware of the parallelism since he/she has no access to a specific
processor of the system.
Either the processors have access to a common memory or make use of message
passing for communication.
Figure : Parallel system.
58. Department of CSE- Data Science
Distributed database systems
Distributed database systems are known to be loosely coupled and are
composed by individual machines.
Each of the machines can run their individual application and serve their own
respective user. The data is usually distributed across several machines,
thereby necessitating quite a number of machines to be accessed to answer a
user query.
Figure : Distributed system.
59. Department of CSE- Data Science
Shared Nothing Architecture
There are three most common types of architecture for multiprocessor high
transaction rate systems.
1. Shared Memory (SM)
2. Shared Disk (SD).
3. Shared Nothing (SN).
In shared memory architecture, a common central memory is shared by multiple
processors.
In shared disk architecture, multiple processors share a common collection of disks
while having their own private memory
In shared nothing architecture, neither memory nor disk is shared among multiple
processors.
60. Department of CSE- Data Science
Advantages of a “Shared Nothing Architecture”
1. Fault Isolation: A “Shared Nothing Architecture” provides the benefit of isolating
fault. A fault in single node is contained and confined to that node exclusively and
exposed only through messages (or lack of it).
2. Scalability: Assume that the disk is a shared resource. It implies that the controller
and the disk bandwidth are also shared. Synchronization will have to be
implemented to maintain a consistent shred state. This would mean that different
nodes will have to take turns to access the critical data. This imposes a limit on
how many nodes can be added to the distributed shared disk system, thus
compromising on scalability.
61. Department of CSE- Data Science
CAP Theorem Explained
The CAP theorem is also called the Brewer’s Theorem.
It states that in a distributed computing environment , it is impossible to provide the
following guarantees.
1. Consistency
2. Availability
3. Partition tolerance
Consistency implies that every read fetches the last write.
Availability implies that reads and writes always succeed. Each non-failing node will
return a response in a reasonable amount of time.
Partition tolerance implies that the system will continue to function when network
partition occurs.
Figure : Brewer's
CAP
62. Department of CSE- Data Science
Examples of databases that follow one of the possible three combinations
1.Availability and Partition Tolerance (AP)
2.Consistency and Partition Tolerance (CP)
3.Consistency and Availability (CA)
Figure : Databases and
CAP
63. Department of CSE- Data Science
Classroom Activity
Puzzle on CAP Theorem
66. Department of CSE- Data Science
NoSQL (NOT ONLY SQL)
The term NoSQL was first coined by Carlo Strozzi in 1998 to name his
lightweight, open-source, relational database that did not expose the standard
SQL interface.
Few features of NoSQL databases are as follows: .
1. They are open sources
2. They are nonrelational
3. They are distributed
4. They are schema less
5. They are cluster friendly
6. They are born out of 21st
century web applications.
67. Department of CSE- Data Science
Where is it Used?
NoSQL databases are widely used in big data and other real-time web applications.
NoSQL databases is used to stock log data which can then be pulled for analysis.
It is used to store social media data and all such data which cannot be stored and
analyzed comfortably in RDBMS.
Figure : Where to use NoSQL?
68. Department of CSE- Data Science
What is it?
NoSQL stands for Not Only SQL.
These are non-relational, open source, distributed databases.
They are hugely popular today owing to their ability to scale out or scale
horizontally and the adeptness at dealing with a rich variety of data: structured,
semi-structured and unstructured data,
Figure: What is NoSQL?
69. Department of CSE- Data Science
1. Are non-relational: They do not adhere to relational data model, In fact, they are
either key-value pairs or document-oriented or column-oriented or graph-based
databases.
2. Are distributed: They are distributed meaning the data is distributed across
several nodes in a cluster constituted of low-cost commodity hardware.
3. Offer no support for ACID properties (Atomicity, Consistency, Isolation, and
Durability): They do not offer support for ACID properties of transactions. On the
contrary, they have adherence to Brewer’s CAP (Consistency, Availability, and
Partition tolerance) theorem and are often seen compromising on consistency in
favor of availability and partition tolerance.
4. Provide no fixed table schema: NoSQL databases are becoming increasing
popular owing to their support for flexibility to the schema. They do not mandate
for the data to strictly adhere to any schema structure at the time of storage.
70. Department of CSE- Data Science
Types of NoSQL Databases
1. Key-value or the big hash table.
2. Schema-less
Figure : Types of NoSQL databases
71. Department of CSE- Data Science
1. Key-value
It maintains a big hash table of keys and values.
For example, Dynamo, Redis, Riak, etc. Sample Key-Value Pair in Key-Value Database
2. Document
It maintains data in collections constituted of documents.
For example, MongoDB, Apache CouchDB, Couchbase, MarkLogic, etc.
72. Department of CSE- Data Science
3. Column
Each storage block has data from only one column.
For example: Cassandra, HBase, etc,
4. Graph:
They are also called network database. A graph stores data in nodes.
For example, Neodj, HyperGraphDB, etc.
74. Department of CSE- Data Science
Why NoSQL?
1. It has scale out architecture instead of the monolithic architecture of relational databases.
2. It can house large volumes of structured, semi-structured, and unstructured data.
3. Dynamic schema: NoSQL database allows insertion of data without a pre-defined schema.
In other words, it facilitates application changes in real time, which thus supports faster
development, easy code integration, and requires less database administration.
4. Auto-sharding: It automatically spreads data across an arbitrary number of servers. The
application in question is more often not even aware of the composition of the server pool.
It balances the load of data and query on the available servers; and if and when a server
goes down, it is quickly replaced without any major activity disruptions.
5. Replication: It offers good support for replication which in turn guarantees high availability,
fault tolerance, and disaster recovery.
75. Department of CSE- Data Science
Advantages of NoSQL
1. Can easily scale up and down: NoSQL database supports scaling rapidly and
elastically and even allows to scale to the cloud.
a. Cluster scale: It allows distribution of database across 100+ nodes often in
multiple data centers.
b. Performance scale: It sustains over 100,000+ database reads and writes
per second.
c. Data scale: It supports housing of 1 billion+ documents in the database.
76. Department of CSE- Data Science
2. Doesn't require a pre-defined schema: NoSQL does not require any adherence to pre-
defined schema. It is pretty flexible. For example, if we look at MongoDB, the
documents in a collection can have different sets of key-value pairs.
3. Cheap, easy to implement: Deploying NoSQL properly allows for all of the benefits of
scale, high availability, fault tolerance, etc. while also lowering operational costs.
4. Relaxes the data consistency requirement: NoSQL databases have adherence to CAP
theorem (Consistency, Availability, and Partition tolerance). Most of the NoSQL
databases compromise on consistency in favor of availability and partition tolerance.
77. Department of CSE- Data Science
5. Data can be replicated to multiple nodes and can be partitioned: There are two
terms that we will discuss here:
a) Sharding: Sharding is when different pieces of data are distributed across multiple
servers. NoSQL databases support auto-sharding; this means that they can natively
and automatically spread data across an arbitrary number of servers, without
requiring the application to even be aware of the composition of the server pool.
Servers can be added or removed from the data layer without application
downtime. This would mean that data and query load are automatically balanced
across servers, and when a server goes down, it can be quickly and transparently
replaced with no application disruption.
b) Replication: Replication is when multiple copies of data are stored across the cluster
and even across data centers. This promises high availability and fault tolerance.
78. Department of CSE- Data Science
What We Miss With NoSQL?
NoSQL does not support joins. However, it compensates for it by allowing embedded
documents as MongoDB.
It does not have provision for ACID properties of transactions. However, it obeys the
Brewer’s CAP theorem.
NoSQL does not have a standard SQL interface but NoSQL databases such MongoDB
and Cassandra have their own rich query language to compensate for the lack of it.
79. Department of CSE- Data Science
Use of NoSQL in Industry
NoSQL is being put to use in varied industries. They are used to support analysis for
applications such as web user data analysis, log analysis, sensor feed analysis, making
recommendations for upsell and cross-sell etc.
82. Department of CSE- Data Science
NewSQL
We need a database that has the same scalable performance of NoSQL systems for
On Line Transaction Processing (OLTP) while still maintaining the ACID guarantees of a
traditional database. This new modern RDBMS is called NewSQL.
It supports relational data model and uses SQL as their primary interface.
NewSQL is based on the shared nothing architecture with a SQL interface for
application interaction.
85. Department of CSE- Data Science
HADOOP
Hadoop is an open-source project of the Apache foundation.
It is a framework written in Java, originally developed by Doug Cutting in 2005
who named it after his son's toy elephant. He was working with Yahoo then.
It was created to support distribution for “Nutch”, the text search engine.
Hadoop uses Google’s MapReduce and Google File System technologies as its
foundation.
Hadoop is now a core part of the computing infrastructure for companies such as
Yahoo, Facebook, Linkedn, Twitter, etc.
87. Department of CSE- Data Science
Features of Hadoop
1. It is optimized to handle massive quantities of structured, semi-structured, and
unstructured data, using commodity hardware, that is, relatively inexpensive
computers.
2. Hadoop has a shared nothing architecture.
3. It replicates its data across multiple computers so that if one goes down, the data can
still be processed from another machine that stores its replica.
4. Hadoop is for high throughput rather than low latency. It is a batch operation handling
massive quantities of data; therefore the response time is not immediate.
5. It complements On-Line Transaction Processing (OLTP) and On-Line Analytical
Processing (OLAP). However, it is not a replacement for a relational database
management system.
6. It is NOT good when work cannot be parallelized or when there are dependencies
within the data.
7. It is NOT good for processing small files. It works best with huge data files and
datasets.
89. Department of CSE- Data Science
1. Stores data in its native format: Hadoop’s data storage framework (HDFS — Hadoop
Distributed File System) can store data in its native format. There is no structure that
is imposed while keying in data or storing data. HDFS is pretty much schema-less. It is
only later when the data needs to be processed that structure is imposed on the raw
data.
2. Scalable: Hadoop can store and distribute very large datasets (involving thousands of
terabytes of data) across hundreds of inexpensive servers that operate in parallel.
3. Cost-effective: Owing to its scale-out architecture, Hadoop has a much reduced
cost/terabyte of storage and processing.
90. Department of CSE- Data Science
3. Resilient to failure: Hadoop is fault-tolerant. It practices replication of data diligently
which means whenever data is sent to any node, the same data also gets replicated to
other nodes in the cluster, thereby ensuring that in the event of a node failure, there will
always be another copy of data available for use.
4. Flexibility: One of the key advantages of Hadoop is its ability to work with all kinds of
data: structured, semi-structured, and unstructured data. It can help derive meaningful
business insights from email conversations, social media data, click-stream data, etc. It
can be put to several purposes such as log analysis, data mining, recommendation
systems, market campaign analysis, etc.
5. Fast: Processing is extremely fast in Hadoop as compared to other conventional systems
owing to the “move code to data” paradigm.
91. Department of CSE- Data Science
Versions of Hadoop
There are two versions of Hadoop available:
1. Hadoop 1.0
2. Hadoop 2.0
92. Department of CSE- Data Science
Hadoop 1.0
It has two main parts:
1. Data storage framework: It is a general-purpose file system called Hadoop
Distributed File System(HDFS). HDFS is schema-less. It simply stores data files.
These data files can be in just about any format. The idea is to store files as close to
their original form as possible. This is turn provides the business units and the
organization the much needed flexibility and agility without being overly worried by
what it can implement.
2. Data processing framework: This is a simple functional programming model initially
popularized by Google as MapReduce. It essentially uses two functions: the MAP
and the REDUCE functions to process data. The “Mappers” take in a set of key-value
pairs and generate intermediate data (which is another list of key—value pairs). The
“Reducers” then act on this input to produce the output data. The two functions
seemingly work in isolation from one another, thus enabling the processing to be
highly distributed in a highly-parallel, fault-tolerant, and scalable way.
93. Department of CSE- Data Science
Limitations of Hadoop 1.0
1. The first limitation was the requirement for MapReduce programming expertise along
with proficiency required in other programming languages, notably Java.
2. It supported only batch processing which although is suitable for tasks such as log
analysis, large-scale data mining projects but pretty much unsuitable for other kinds of
projects.
3. One major limitation was that Hadoop 1.0 was tightly computationally coupled with
MapReduce, which meant that the established data management vendors were left
with two options: Either rewrite their functionality in MapReduce so that it could be
executed in Hadoop or extract the data from HDFS and process it outside of Hadoop.
None of the options were viable as it led to process inefficiencies caused by the data
being moved in and out of the Hadoop cluster.
94. Department of CSE- Data Science
Hadoop 2.0
HDFS continues to be the data storage framework.
A new and separate resource management framework called Yet Another Resource
Negotiator (YARN) has been added.
Any application capable of dividing itself into parallel tasks is supported by YARN.
YARN coordinates the allocation of subtasks of the submitted application, thereby
further enhancing the flexibility, scalability, and efficiency of the applications.
It works by having an ApplicationMaster which is able to run any application and not
just MapReduce.
it not only supports batch processing but also real-time processing.
95. Department of CSE- Data Science
Overview of Hadoop Ecosystems
There are components available in the Hadoop ecosystem for data ingestion, processing,
and analysis.
Data Ingestion → Data Processing → Data Analysis
96. Department of CSE- Data Science
It is the distributed storage unit of Hadoop. It provides streaming access to file
system data as well as file permissions and authentication.
It is based on GFS (Google File System).
It is used to scale a single cluster node to hundreds and thousands of nodes.
It handles large datasets running on commodity hardware.
HDFS is highly fault-tolerant. It stores files across multiple machines.
These files are stored in redundant fashion to allow for data recovery in case of
failure.
HDFS
97. Department of CSE- Data Science
HBase stores data in HDFS.
It is the first non-batch component of the Hadoop Ecosystem.
It is a database on top of HDFS. It provides a quick random access to the stored data.
It has very low latency compared to HDFS.
It is a NoSQL database, is non-relational and is a column-oriented database.
A table can have thousands of columns.
A table can have multiple rows.
Each row can have several column families.
Each column family can have several columns.
Each column can have several key values. It is based on Google BigTable.
This is widely used by Facebook, Tiwitter, Yahoo, etc.
HBase
98. Department of CSE- Data Science
Difference between HBase and Hadoop/HDFS
1. HDFS is the file system whereas HBase is a Hadoop database. It is like NTES and MySQL.
2. HDFS is WORM (Write once and read multiple times or many times). Latest versions support
appending of data but this feature is rarely used. However, HBase supports real-time
random read and write
3. HDFS is based on Google File System (GFS) whereas HBase is based on Google Big Table.
4. HDFS supports only full table scan or partition table scan. Hbase supports random small
range scan or table scan,
5. Performance of Hive on HDFS is relatively very good but for HBase it becomes 4—5 times
slower.
6. The access to data is via MapReduce job only in HDFS whereas in HBase the access is via
Java APIs, Rest, Avro, Thrift APIs.
7. HDFS does not support dynamic storage owing to its rigid structure whereas HBase
supports dynamic storage.
8. HDFS has high latency operations whereas HBase has low latency operations.
9. HDFS is most suitable for batch analytics whereas HBase is for real-time analytics.
99. Department of CSE- Data Science
Hadoop Ecosystem Components for Data Ingestion
1. Sqoop: Sqoop stands for SQL to Hadoop. Its main functions are
a. Importing data from RDBMS such as MySQL, Oracle, DB2, etc. to Hadoop file system
(HDFS, HBase, Hive).
b. Exporting data from Hadoop File system (HDFS, HBase, Hive) to RDBMS (MySQL,
Oracle, DB2).
Uses of Sqoop
c. It has a connector-based architecture to allow plug-ins to connect to external systems
such as MySQL, Oracle, DB2, etc.
d. It can provision the data from external system on to HDFS and populate tables in Hive
and HBase.
e. It integrates with Oozie allowing you to schedule and automate import and export
tasks.
2. Flume: Flume is an important log aggregator (aggregates logs from different machines
and places them in HDFS) component in the Hadoop ecosystem. Flume has been
developed by Cloudera. It s designed for high volume ingestion of event-based data into
Hadoop. The default destination in Flume (called as sink in flume parlance) is HDFS.
However it can also write to HBase or Solr.
100. Department of CSE- Data Science
1. MapReduce:
It is a programming paradigm that allows distributed and parallel processing of
huge datasets.
It is based on Google MapReduce.
Google released a paper on MapReduce programming paradigm in 2004 and that
became the genesis of Hadoop processing model.
The MapReduce framework gets the input data from HDFS.
Hadoop Ecosystem Components for Data Processing
101. Department of CSE- Data Science
There are two main phases: Map phase and the Reduce phase.
The map phase converts the input data into another set of data (key—value pairs).
This new intermediate dataset then serves as the input to the reduce phase.
The reduce phase acts on the datasets to combine (aggregate and consolidate) and
reduce them to a smaller set of tuples.
The result is then stored back in HDFS.
102. Department of CSE- Data Science
2. Spark:
It is both a programming model as well as a computing model.
It is an open-source big data processing framework. It was originally developed in
2009 at UC Berkeley's AmpLab and became an open-source project in 2010.
It is written in Scala. It provides in-memory computing for Hadoop.
In Spark, workloads execute in memory rather than on disk owing to which it is much
faster (10 to100 times) than when the workload is executed on disk.
If the datasets are too large to fit into, the available system memory, it can perform
conventional disk-based processing.
It serves as a potentially faster and more flexible alternative to MapReduce.
It accesses data from HDFS (Spark does not have its own distributed file system) but
bypasses the MapReduce processing.
103. Department of CSE- Data Science
Spark can be used with Hadoop coexisting smoothly with MapReduce (sitting on top
of Hadoop YARN) or used independently of Hadoop (standalone).
As a programming model, it works well with Scala, Python (it has API connectors for
using it with Java or Python) or R programming language.
The following are the Spark libraries:
a.Spark SQL: Spark also has support for SQL. Spark SQL uses SQL to help query
data stored in disparate applications.
b.Spark streaming: It helps to analyze and present data in real time
c.MLib: It supports machine learning such as applying advanced statistical
operations on data in Spark Cluster.
d.GraphX: It helps in graph parallel computation.
104. Department of CSE- Data Science
Spark and Hadoop are usually used together by several companies.
Hadoop was primarily designed to house unstructured data and run batch processing
operations on it.
Spark is used extensively for its high speed in memory computing and ability to run
advanced real-time analytics.
The two together have been giving very good results.
105. Department of CSE- Data Science
Hadoop Ecosystem Components for Data Analysis
1. Pig: It is a high-level scripting language used with Hadoop. It serves as an alternative
to MapReduce. It has two parts:
a. Pig Latin: It is SQL-like scripting language. Pig Latin scripts are translated into
MapReduce jobs which can then run on YARN and process data in the HDFS cluster.
There is a “Load” command available to load the data from “HDFS” into Pig. Then one
can perform functions such as grouping, filtering, sorting, joining etc. The processed
or computed data can then be cither displayed on screen or placed back into HDFS. It
gives you a platform for building data flow for ETL (Extract, Transform and Load),
processing and analyzing huge data sets.
b. Pig runtime: It is the runtime environment.
106. Department of CSE- Data Science
2. Hive: Hive is a data warehouse software project built on top of Hadoop. Three main
tasks performed by Hive are summarization, querying and analysis. It supports
queries written in a language called HQL or HiveQL which is a declarative SQL-like
language. It converts the SQL-style queries into MapReduce jobs which are then
executed on the Hadoop platform.
107. Department of CSE- Data Science
Difference between Hive and RDBMS
1. Hive enforces schema on Read Time whereas RDBMS enforces schema on Write
Time.
In RDBMS, at the time of loading/inserting data, the table’s schema is enforced. If the
data being loaded does not conform to the schema then it is rejected. Thus, the
schema is enforced on write (loading the data into the database). Schema on write
takes longer to load the data into the database; however it makes up for it during
data retrieval with a good query time performance.
Hive does not enforce the schema when the data is being loaded into the D/W. It is
enforced only when the data is being read/retrieved. This is called schema on read. It
definitely makes for fast initial load as the data load or insertion operation is just a
file copy or move.
108. Department of CSE- Data Science
2. Hive is based on the notion of write once and read many times whereas the RDBMS is
designed for read and write many times.
3. Hadoop is a batch-oriented system. Hive, therefore, is not suitable for OLTP (Online
Transaction Processing) but, although not ideal, seems closer to OLAP (Online Analytical
Processing). The reason being that there is quite a latency between issuing a query and
receiving a reply as the query written in HiveQL will be converted to MapReduce jobs
which are then executed on the Hadoop cluster. RDBMS is suitable for housing day-to-day
transaction data and supports all OLTP operations with frequent insertions, modifications
(updates), deletions of the data.
4. Hive handles static data analysis which is non-real-time data. Hive is the data
warehouse of Hadoop. There are no frequent updates to the data and the query
response time is not fast. RDBMS is suited for handling dynamic data which is real
time.
109. Department of CSE- Data Science
5. Hive can be easily scaled at a very low cost when compared to RDMS. Hive uses HDFS
to store data thus it cannot be considered as the owner of the data, while on the
other hand RDBMS is the own of the data responsible for storing, managing and
manipulating it in the database.
6. Hive uses the concept of parallel computing, whereas RDBMS uses serial computing.
112. Department of CSE- Data Science
Difference between Hive and HBase
1. Hive is a MapReduce-based SQL engine that runs on top of Hadoop. HBase is a key—
value NoSQL database that runs on top of HDFS.
2. Hive is for batch processing of big data. HBase is for real-time data streaming,
Impala
It is a high performance SQL engine that runs on Hadoop cluster. It is ideal for
interactive analysis. It has very low latency measured in milliseconds. It supports a
dialect of SQL called Impala SQL.
ZooKeeper
It is a coordination service for distributed applications.
Oozie
It is workflow scheduler system to manage Apache Hadoop jobs.
113. Department of CSE- Data Science
Mahout
It is a scalable machine learning and data mining library.
Chukwa
It is a data collection system for managing large distributed systems.
Ambari
It is a web-based tool for provisioning, managing, and monitoring Apache Hadoop
clusters.
114. Department of CSE- Data Science
Hadoop Distributions
Hadoop is an open-source Apache project.
Anyone can freely download the core
aspects of Hadoop.
core aspects of Hadoop include the
following:
1.Hadoop Common
2.Hadoop Distributed File System (HDFS)
3.Hadoop YARN (Yet Another Resource
Negotiator)
4. Hadoop MapReduce
116. Department of CSE- Data Science
Integrated Hadoop Systems Offered by Leading Market Vendors
117. Department of CSE- Data Science
Cloud-Based Hadoop Solutions
Amazon Web Services holds out a comprehensive, end-to-end portfolio of cloud
computing services to help manage big data. The aim is to achieve this and more
along with retaining the emphasis on reducing costs, scaling to meet demand, and
accelerating the speed of innovation.
The Google Cloud Storage connector for Hadoop empowers one to perform MapReduce
jobs directly on data in Google Cloud Storage, without the need to copy it to local disk and
running it in the Hadoop Distributed File System (HDFS). The connector simplifies Hadoop
deployment, and at the same time reduces cost and provides performance comparable to
HDFS, all this while increasing reliability by eliminating the single point of failure of the
name node.