Department of CSE- Data Science
Module-1
Introduction to Big Data, Big Data Analytics
Department of CSE- Data Science
Contents
 Classification of data
 Characteristics
 Evolution and definition of Big data
 What is Big data
 Why Big data
 Traditional Business Intelligence Vs Big Data
 Typical data warehouse and Hadoop environment
 Big Data Analytics: What is Big data Analytics
 Classification of Analytics
 Importance of Big Data Analytics
 Technologies used in Big data Environments
 Few Top Analytical Tools , NoSQL, Hadoop.
Department of CSE- Data Science
Introduction
 Data is present internal to the enterprise and also exists outside the four walls and
firewalls of the enterprise.
 Data is present in homogeneous sources as well as in heterogeneous sources.
Data → Information
Information → Insights
Department of CSE- Data Science
Classification of Digital data
Department of CSE- Data Science
Structured data
 Data which is in an organized form(e.g, rows and columns) and can be
easily used by a computer program.
 Relationships exist between entities of data, such as classes and their
objects.
 Data stored in databases is an example of structured data.
Department of CSE- Data Science
Semi-structured data
 Data which does not conform to a data model but has some structure.
 It is not in a form which can be used easily by a computer program.
 For example XML, markup languages like HTML etc.,
Department of CSE- Data Science
Unstructured data
 Data which does not conform to a data model or is not in a form which can be
used easily by a computer program.
 About 80%-90% data of an organization is in this format
 For example, memos, chat rooms, powerpoint presentations, images, videos,
letters etc,.
Department of CSE- Data Science
Structured Data
 Most of the structured data is held in RDBMS.
 An RDBMS conforms to the relational data model wherein the data is stored in
rows/columns.
 The number of rows/records/tuples in a relation is called the cardinality of a
relation and the number of columns s referred to as the degree of a relation.
 The first step is the design of a relation/table, the fields/columns to store the data,
the type of data that will be stored [number (integer or real), alphabets, date,
Boolean, etc.].
Department of CSE- Data Science
 Next we think of the constraints that we would like our data to conform to
(constraints such as UNIQUE values in the column, NOT NULL values in the
column, a business constraint such as the value held in the column should not
drop below 50, the set of permissible values in the column such as the column
should accept only “CS”, “IS”, “MS”, etc., as input).
 Example: Let us design a table/relation structure to store the details of the
employees of an enterprise.
Department of CSE- Data Science
Department of CSE- Data Science
 The tables in an RDBMS can also be related. For example, the above “Employee”
table is related to the “Department” table on the basis of the common column,
“DeptNo”.
Fig: Relationship between “Employee” and “Department” tables
Department of CSE- Data Science
Sources of Structured Data
 Oracle, IBM — DB2, Microsoft — Microsoft SQL Server, EMC — Greenplum,
Teradata — Teradata, MySQL (open source), PostgreSQL (advanced open source),
etc.] are used to hold transaction/operational data generated and collected by
day-to-day business activities.
 The data of the On-Line Transaction Processing (OLTP) systems are generally quite
structured.
Department of CSE- Data Science
Ease of Working with Structured Data
1. Insert/update/delete: The Data
Manipulation Language (DML) operations
provide the required ease with data
input, storage, access, process, analysis,
etc.
2. Security: There are available staunch
encryption and tokenization solutions to
warrant the security of information
throughout its lifecycle. Organizations are
able to retain control and maintain
compliance adherence by ensuring that
only authorized individuals are able to
decrypt and view sensitive information.
Department of CSE- Data Science
3. Indexing: An index is a data structure that speeds up the data retrieval operations (primarily
the SELECT DML statement) at the cost of additional writes and storage space, but the
benefits that ensue in search operation are worth the additional writes and storage space.
4. Scalability: The storage and processing capabilities of the traditional RDBMS can be easily
scaled up by increasing the horsepower of the database server (increasing the primary and
secondary or peripheral storage capacity, processing capacity of the processor, etc,).
5. Transaction processing: RDBMS has support for Atomicity, Consistency, Isolation, and
Durability (ACID) properties of transaction. Given next is a quick explanation of the ACID
properties:
 Atomicity: A transaction is atomic, means that either it happens in its entirety or none of it
at all.
 Consistency: The database moves from one consistent state to another consistent state. In
other words, if the same piece of information is stored at two or more places, they are in
complete agreement.
 Isolation: The resource allocation to the transaction happens such that the transaction gets
the impression that it is the only transaction happening in isolation.
 Durability: All changes made to the database during a transaction are permanent and that
accounts for the durability of the transaction.
Department of CSE- Data Science
Semi-structured Data
 Semi-structured data is also referred to as self-describing structure.
 Features
1. It does not conform to the data models that
one typically associates with relational
databases or any other form of data tables.
2. It uses tags to segregate semantic elements.
3. Tags are also used to enforce hierarchies of
records and fields within data. There is no
separation between the data and the schema.
The amount of structure used is dictated by
the purpose at hand.
4. In semi-structured data, entities belonging to
the same class and also grouped together
need not necessarily have the same act of
attributes. And if at all, they have the same set
of attributes, the order of attributes may not
be similar and for all practical purposes it is
not important as well.
Department of CSE- Data Science
Sources of Semi-Structured Data
1. XML: Xtensible Markup Language
(XML) is hugely popularized by web
services developed utilizing the
Simple Object Access Protocol (SOAP)
principles.
2. JSON: Java Script Object Notation
(JSON) is used to transmit data
between a server and a web
application. JSON is popularized by
web services developed utilizing the
Representational State Transfer
(REST) — an architecture style for
creating scalable web services.
MongoDB (open-source, distributed,
NoSQL, documented-oriented
database) and Couchbase (originally
known as Membase, open-source,
distributed, NoSQL, document-
oriented database) store data
Department of CSE- Data Science
Department of CSE- Data Science
Unstructured Data
 Unstructured data does not conform to any pre-defined data model.
 The structure is quite unpredictable.
Table :Few examples of disparate unstructured data
Department of CSE- Data Science
Sources of Unstructured Data
Department of CSE- Data Science
Issues with Unstructured Data
 unstructured data is known NOT to conform to a pre-defined data model or be
organized in a pre defined manner, there are incidents wherein the structure
of the data can still be implied.
Department of CSE- Data Science
Dealing with Unstructured Data
 Today, unstructured data constitutes approximately 80% of the data that is
being generated in any enterprise.
 The balance is clearly shifting in favor of unstructured data as shown in below.
It is such a big percentage that it cannot be ignored.
Figure : Unstructured data clearly constitutes a major percentage of
enterprise data.
Department of CSE- Data Science
The following techniques are used to find patterns in or interpret unstructured data:
1. Data mining: First, we deal with large data sets. Second, we use methods at the intersection
of artificial intelligence, machine learning, statistics, and database systems to unearth
consistent patterns in large data sets and/or systematic relationships between variables. It is
the analysis step of the “knowledge discovery in databases” process. Popular algorithms are
as follows:
i. Association rule mining: It is also called “market basket analysis” or “affinity analysis”. It is
used to determine “What goes with what?” It is about when you buy a product, what is the
other product that you are likely to purchase with it. For example, if you pick up bread from
the grocery, are you likely to pick eggs or cheese to go with it.
Figure : Dealing with unstructured data
Department of CSE- Data Science
ii. Regression analysis: It helps to predict the relationship between two variables. The variable
whose value needs to be predicted is called the dependent variable and the variables which
are used to predict the value are referred to as the independent variables.
iii. Collaborative filtering: It is about predicting a user's preference or preferences based on the
preferences of a group of users.
Table:
 We are looking at predicting whether User 4 will prefer to learn using videos or is a textual
learner depending on one or a couple of his or her known preferences.
 We analyze the preferences of similar user profiles and on the basis of it, predict that User 4
will also like to learn using videos and is not a textual learner.
Department of CSE- Data Science
2. Text analytics or text mining: Compared to the structured data stored in relational
databases, text is largely unstructured, amorphous, and difficult to deal with
algorithmically. Text mining is the process of gleaning high quality and meaningful
information (through devising of patterns and trends by means of statistical pattern
learning) from text. It includes tasks such as text categorization, text clustering,
sentiment analysis, concept/entity extraction, etc.
3. Natural language processing (NLP): It is related to the area of human computer
interaction. It is about enabling computers to understand human or natural
language input.
4. Noisy text analytics: It is the process of extracting structured or semi-structured
information from noisy unstructured data such as chats, blogs, wikis, emails,
message-boards, text messages, etc. The noisy unstructured data usually comprises
one or more of the following; Spelling mistakes, abbreviations, acronyms, non-
standard words, missing punctuation, missing letter case, filler words such as “ub”,
“um”, etc.
Department of CSE- Data Science
5. Manual tagging with metadata: This is about tagging manually with adequate metadata
to provide the requisite semantics to understand unstructured data.
6. Part-of-speech tagging: It is also called POS or POST or grammatical tagging. It is the
process of reading text and tagging each word in the sentence as belonging to a particular
part of speech such as “noun”, “verb”, “adjective”, etc.
7. Unstructured Information Management Architecture (UIMA): It is an open source
platform from IBM. It is used for real-time content analytics. It is about processing text
and other unstructured data to find latent meaning and relevant relationship buried
therein.
Department of CSE- Data Science
Properties Structured data Semi-structured data Unstructured data
Technology It is based on Relational
database table
It is based on
XML/RDF(Resource
Description Framework).
It is based on character and
binary data
Transaction management
Matured transaction and
various concurrency
techniques
Transaction is adapted from
DBMS not matured
No transaction management
and no concurrency
Version management
Versioning over
tuples,row,tables
Versioning over tuples or
graph is possible Versioned as a whole
Flexibility
It is schema dependent and
less flexible
It is more flexible than
structured data but less
flexible than unstructured
data
It is more flexible and there
is absence of schema
Scalability It is very difficult to scale DB
schema
It’s scaling is simpler than
structured data It is more scalable.
Robustness Very robust
New technology, not very
spread
—
Query performance
Structured query allow
complex joining
Queries over anonymous
nodes are possible
Only textual queries are
possible
Department of CSE- Data Science
Classroom Exercise
Department of CSE- Data Science
Department of CSE- Data Science
Characteristics of Data
Data has three characteristics:
1. Composition: deals with structure of data, that is, the
sources of data , the granularity, the types, and the nature
of the data as to whether it is static or real-time
streaming.
2. Condition: The condition of data deals with the state of
the data that is “can one use this data as is for analysis?”
or “Does it require cleansing for further enhancement
and enrichment?”
3. Context: deals with “Where has this data been
generated?”, “Why was this data generated?” and so on.
Figure: Characteristics of data
Department of CSE- Data Science
EVOLUTION OF BIG DATA
 1970s and before was the era of mainframes. The data was essentially primitive and
structured.
 Relational databases evolved in 1980s and 1990s. The era was of data intensive applications.
 The World Wide Web WWW) and the Internet of Things (IoT) have led to an onslaught of
structured, unstructured, and multimedia data.
Table : The evolution of big data
Department of CSE- Data Science
Definition of Big Data
Figure : Definition of big data.
 Anything beyond the human and technical
infrastructure needed to support storage,
processing, and analysis.
 Terabytes or petabytes or zettabytes of data.
 Terabytes or petabytes or zettabytes of data.
 I think it is about 3 Vs.
Department of CSE- Data Science
Definition of Big Data
Department of CSE- Data Science
Challenges With Big Data
1. Data today is growing at an exponential rate.
This high tide of data will continue to rise
incessantly. The key questions here are: “Will
all this data be useful for analysis?”, “Do we
work with all this data or a subset of it?”, “How
will we separate the knowledge from the
noise?”, etc.
2. Cloud computing and virtualization are here to
stay. Cloud computing is the answer to
managing infrastructure for big data as far as
cost-efficiency, elasticity, and easy
upgrading/downgrading is concerned. This
further complicates the decision to host big
data solutions outside the enterprise
Department of CSE- Data Science
3. The other challenge is to decide on the period of retention of big data. Just how long
should one retain this data? some data is useful for making long-term decisions,
whereas in few cases, the data may quickly become irrelevant and obsolete just a few
hours after having being generated.
4. There is a dearth of skilled professionals who possess a high level of proficiency in data
sciences that is vital in implementing big data solutions.
5. Then, of course, there are other challenges with respect to capture, storage,
preparation, search, analysis, transfer, security, and visualization of big data. There is
no explicit definition of how big the dataset should be for it to be considered “big
data.” Here we are to deal with data that is just too big, moves way to fast, and does
not fit the structures of typical database systems. The data changes are highly dynamic
and therefore there is a need to ingest this as quickly as possible.
6. Data visualization is becoming popular as a separate discipline. We are short by quite a
number, as far as business visualization experts are concerned.
Department of CSE- Data Science
WHAT IS BIG DATA?
 Big data is data that is big in volume, velocity, and variety.
Fig: Data: Big in volume, variety, and velocity. Fig: Growth of data
Department of CSE- Data Science
Volume
 We have seen it grow from bits to bytes to petabytes and
exabytes.
 Where Does This Data get Generated?
→ There are a multitude of sources for big data.
→ An XLS, a DOC, a PDE, etc. is unstructured data
→ a video on YouTube, a chat conversation on Internet
Messenger, a customer feedback form on an online
retail website, a CCTV coverage, a weather forecast
report is unstructured data too.
Fig: A mountain of data.
Department of CSE- Data Science
Figure: Sources of big data.
 Typical internal data sources: Data present within an organization’s firewall. It is as
follows:
→ Data storage: File systems, SQL (RDBMSs — Oracle, MS SQL Server, DB2, MySQL,
PostgreSQL, etc.), NoSQL (MongoDB, Cassandra, etc.), and so on.
→ Archives: Archives of scanned documents, paper archives, customer
correspondence records, patients’ health records, students’ admission records,
students’ assessment records, and so on.
Department of CSE- Data Science
 External data sources: Data residing outside an organization’s firewall. It is as follows:
→ Public Web: Wikipedia, weather, regulatory, compliance, census, etc.
 Both (internal+external)
→ Sensor data – Car sensors, smart electric meters, office buildings, air conditioning units,
refrigerators, and so on. etc,.
→ Machine log data – Event logs, application logs, Business process logs, audit logs,
clickstream data, etc.
→ Documents, PDSocial media – Twitter, blogs, Facebook, LinkedIn, Youtube, Instagram etc,.
→ Business apps – ERP,CRM, HR, Google Docs, and so on.
→ Media – Audio, Video, Image, Podcast, etc.
→ Docs – CSV, Word F,XLS, PPT and so on.
Department of CSE- Data Science
Velocity
 We have moved from the days of batch processing to real-time processing.
Variety
 Variety deals with a wide range of data types and sources of data.
1. Structured data: From traditional transaction processing systems and RDBMS, etc.
2. Semi-structured data: For example Hyper Text Markup Language (HTML),
eXtensible Markup Language (XML).
3. Unstructured data: For example unstructured text documents, audios, videos,
emails, photos, PDFs, social media, etc.
Batch  Periodic  Near real time  Real-time processing
Department of CSE- Data Science
Why Big Data?
Department of CSE- Data Science
Traditional Business Intelligence (Bi) Versus Big Data
Business Intelligence Big Data
All the enterprise’s data is housed in a
central server
In a big data environment data resides in
a distributed file system
Scales vertically Scales in or out horizontally
Traditional BI is about structured data,
and it is here that data is taken to
processing functions
Big data is about variety and here the
processing functions are taken to the
data.
Department of CSE- Data Science
A Typical Data Warehouse Environment
 Operational or transactional or day-to-day
business data is gathered from Enterprise
Resource Planning (ERP) systems,
Customer Relationship Management
(CRM), legacy systems, and several third
party applications.
 The data from these sources may differ in format
 Data may come from data sources located in the same geography or different geographies.
 This data is then integrated, cleaned up, transformed, and standardized through the process
of Extraction, Transformation, and Loading (ETL).
 The transformed data is then loaded into the enterprise data warehouse or to marts
 Business intelligence and analytics tools are then used to enable decision making
Fig: A typical data warehouse environment.
Department of CSE- Data Science
A Typical Hadoop Environment
 The data sources are quite
disparate from web logs to
images, audios and videos to
social media data to the
various docs, pdfs, etc
 Here the data in focus is not just the data within the company's firewall but also data
residing outside the company's firewall. This data is placed in Hadoop Distributed File
System (HDFS).
 If need be, this can be repopulated back to operational systems a fed to the enterprise data
warehouse or data marts or Operational Data Store (ODS) to be picked for further
processing and analysis.
Fig: A typical Hadoop environment
Department of CSE- Data Science
WHAT IS BIG DATA ANALYTICS?
Big Data Analytics is
1. Technology-enabled analytics: Quite a few data analytics and visualization tools are
available in the market today from leading vendors such as IBM, Tableau, SAS, R Analytics,
Statistical, World Programming Systems (WPS), etc. to help process and analyze your big
data.
2. About gaining a meaningful, deeper, and richer insight into your business to steer in the
right direction, understanding the customer’s demographics to cross-sell and up-sell to
them, better leveraging the services of your vendors and suppliers, etc.
Department of CSE- Data Science
3. About a competitive edge over your competitors by enabling you with findings that allow
quicker and better decision-making.
4. A tight handshake between three communities: IT, business users, and data scientists.
5. Working with datasets whose volume and variety exceed the current storage and
processing capabilities and infrastructure of your enterprise.
6. About moving code to data. This makes perfect sense as the program for distributed
processing is tiny (just a few KBs) compared to the data (Terabytes or Petabytes today and
likely to be Exabytes or Zettabytes in the near future).
Department of CSE- Data Science
Classification Of Analytics
 These are basically two schools of thought:
1.Those that classify analytics into basic, operationalized, advanced, and monetized.
2.Those that classify analytics into analytics 1.0, analytics 2.0, and analytics 3.0.
First School of Thought
3. Basic analytics: This primarily is slicing and dicing of data to help with basic business
insights. This is about reporting on historical data, basic visualization, etc.
4. Operationalized analytics: It is operationalized analytics if it gets woven into the
enterprise’s business processes.
5. Advanced analytics: This largely is about forecasting for the future by way of predictive
and prescriptive modeling.
6. Monetized analytics: This is analytics in use to derive direct business revenue.
Department of CSE- Data Science
Second School of Thought
• Let us take a closer look at analytics 1.0, analytics 2.0, and analytics 3.0.
Table : Analytics 1.0, 2.0, and 3.0
Department of CSE- Data Science
49
Department of CSE- Data Science
Figure : Analytics 1.0, 2.0, and 3.0.
Department of CSE- Data Science
Importance of Big Data Analytics
Let us study the various approaches to analysis of data and what it leads to.
1. Reactive — Business Intelligence: What does Business Intelligence (BI) help us with? It
allows the businesses to make faster and better decisions by providing the right
information to the right person at the right time in the right format. It is about analysis
of the past or historical data and then displaying the findings of the analysis or reports
in the form of enterprise dashboards, alerts, notifications, etc. It has support for both
pre-specified reports as well as ad hoc querying.
2. Reactive — Big Data Analytics: Here the analysis is done on huge datasets but the
approach is still reactive as it is still based on static data.
Department of CSE- Data Science
3. Proactive — Analytics: This is to support futuristic decision making by the use of data
mining, predictive modeling, text mining, and statistical analysis. This analysis is not on
big data as it still uses the traditional database management practices on big data and
therefore has severe limitations on the storage capacity and the processing capability.
4. Proactive - Big Data Analytics: This is sieving through terabytes, petabytes, exabytes of
information to filter out the relevant data to analyze. This also includes high performance
analytics to gain rapid insights from big data and the ability to solve complex problems
using more data.
Department of CSE- Data Science
Terminologies used in Big data Environments
In-Memory Analytics
 Data access from non-volatile storage such as hard disk is a slow process. The more the
data is required to be fetched from hard disk or secondary storage, the slower the process
gets. One way to combat this challenge is to pre-process and store data (cubes, aggregate
tables, query sets, etc.) so that the CPU has to fetch a small subset of records. But this
requires thinking in advance as to what data will be required for analysis.
 If there is a need for different or more data, it is back to the initial process of pre-
computing and storing data or fetching it from secondary storage. This problem has been
addressed using in-memory analytics. Here all the relevant data is stored in Random Access
Memory (RAM) or primary storage thus eliminating the need to access the data from hard
disk. The advantage is faster access, rapid deployment, better insights, and minimal IT
involvement.
Department of CSE- Data Science
In-Database Processing
 In-database processing is also called as in-database analytics. It works by fusing
data warehouses with analytical systems.
 Typically the data from various enterprise On Line Transaction Processing (OLTP)
systems after cleaning up (de-duplication, scrubbing, etc.) through the process of
ETL is stored in the Enterprise Data Warehouse (EDW) or data marts.
 The huge datasets are then exported to analytical programs for complex and
extensive computations.
 With in-database processing, the database program itself can run the
computations eliminating the need for export and thereby saving on time.
Leading database vendors are offering this feature to large businesses.
Department of CSE- Data Science
Symmetric Multiprocessor System (SMP)
• In SMP there is a single common main memory that is shared by two or more
identical processors.
• The processors have full access to all I/O devices and are controlled by a single
operating system instance.
• SMP are tightly coupled multiprocessor systems. Each processor has its own high-
speed memory, called cache memory and are connected using a system bus.
Figure : Symmetric Multiprocessor
System.
Department of CSE- Data Science
Massively Parallel Processing
 Massive Parallel Processing (MPP) refers to the coordinated processing of
programs by a number of processors working parallel.
 The processors, each have their own operating systems and dedicated memory.
They work on different parts of the same program.
 The MPP processors communicate using some sort of messaging interface. The
MPP systems are more difficult to program as the application must be divided in
such a way that all the executing segments can communicate with each other.
 MPP is different from Symmetrically Multiprocessing (SMP) in that SMP works
with the processors sharing the same operating system and same memory. SMP is
also referred to as tightly-coupled multiprocessing.
Department of CSE- Data Science
Difference Between Parallel and Distributed Systems
Parallel Systems
 A parallel database system is a tightly coupled system. The processors co-operate
for query processing.
Figure : Parallel
system
Department of CSE- Data Science
 The user is unaware of the parallelism since he/she has no access to a specific
processor of the system.
 Either the processors have access to a common memory or make use of message
passing for communication.
Figure : Parallel system.
Department of CSE- Data Science
Distributed database systems
 Distributed database systems are known to be loosely coupled and are
composed by individual machines.
 Each of the machines can run their individual application and serve their own
respective user. The data is usually distributed across several machines,
thereby necessitating quite a number of machines to be accessed to answer a
user query.
Figure : Distributed system.
Department of CSE- Data Science
Shared Nothing Architecture
 There are three most common types of architecture for multiprocessor high
transaction rate systems.
1. Shared Memory (SM)
2. Shared Disk (SD).
3. Shared Nothing (SN).
 In shared memory architecture, a common central memory is shared by multiple
processors.
 In shared disk architecture, multiple processors share a common collection of disks
while having their own private memory
 In shared nothing architecture, neither memory nor disk is shared among multiple
processors.
Department of CSE- Data Science
Advantages of a “Shared Nothing Architecture”
1. Fault Isolation: A “Shared Nothing Architecture” provides the benefit of isolating
fault. A fault in single node is contained and confined to that node exclusively and
exposed only through messages (or lack of it).
2. Scalability: Assume that the disk is a shared resource. It implies that the controller
and the disk bandwidth are also shared. Synchronization will have to be
implemented to maintain a consistent shred state. This would mean that different
nodes will have to take turns to access the critical data. This imposes a limit on
how many nodes can be added to the distributed shared disk system, thus
compromising on scalability.
Department of CSE- Data Science
CAP Theorem Explained
 The CAP theorem is also called the Brewer’s Theorem.
 It states that in a distributed computing environment , it is impossible to provide the
following guarantees.
1. Consistency
2. Availability
3. Partition tolerance
 Consistency implies that every read fetches the last write.
 Availability implies that reads and writes always succeed. Each non-failing node will
return a response in a reasonable amount of time.
 Partition tolerance implies that the system will continue to function when network
partition occurs.
Figure : Brewer's
CAP
Department of CSE- Data Science
Examples of databases that follow one of the possible three combinations
1.Availability and Partition Tolerance (AP)
2.Consistency and Partition Tolerance (CP)
3.Consistency and Availability (CA)
Figure : Databases and
CAP
Department of CSE- Data Science
Classroom Activity
Puzzle on CAP Theorem
Department of CSE- Data Science
Puzzle on architecture
Department of CSE- Data Science
Solutions
Puzzle-1
Puzzle-2
Department of CSE- Data Science
NoSQL (NOT ONLY SQL)
 The term NoSQL was first coined by Carlo Strozzi in 1998 to name his
lightweight, open-source, relational database that did not expose the standard
SQL interface.
 Few features of NoSQL databases are as follows: .
1. They are open sources
2. They are nonrelational
3. They are distributed
4. They are schema less
5. They are cluster friendly
6. They are born out of 21st
century web applications.
Department of CSE- Data Science
Where is it Used?
 NoSQL databases are widely used in big data and other real-time web applications.
 NoSQL databases is used to stock log data which can then be pulled for analysis.
 It is used to store social media data and all such data which cannot be stored and
analyzed comfortably in RDBMS.
Figure : Where to use NoSQL?
Department of CSE- Data Science
What is it?
 NoSQL stands for Not Only SQL.
 These are non-relational, open source, distributed databases.
 They are hugely popular today owing to their ability to scale out or scale
horizontally and the adeptness at dealing with a rich variety of data: structured,
semi-structured and unstructured data,
Figure: What is NoSQL?
Department of CSE- Data Science
1. Are non-relational: They do not adhere to relational data model, In fact, they are
either key-value pairs or document-oriented or column-oriented or graph-based
databases.
2. Are distributed: They are distributed meaning the data is distributed across
several nodes in a cluster constituted of low-cost commodity hardware.
3. Offer no support for ACID properties (Atomicity, Consistency, Isolation, and
Durability): They do not offer support for ACID properties of transactions. On the
contrary, they have adherence to Brewer’s CAP (Consistency, Availability, and
Partition tolerance) theorem and are often seen compromising on consistency in
favor of availability and partition tolerance.
4. Provide no fixed table schema: NoSQL databases are becoming increasing
popular owing to their support for flexibility to the schema. They do not mandate
for the data to strictly adhere to any schema structure at the time of storage.
Department of CSE- Data Science
Types of NoSQL Databases
1. Key-value or the big hash table.
2. Schema-less
Figure : Types of NoSQL databases
Department of CSE- Data Science
1. Key-value
 It maintains a big hash table of keys and values.
 For example, Dynamo, Redis, Riak, etc. Sample Key-Value Pair in Key-Value Database
2. Document
 It maintains data in collections constituted of documents.
 For example, MongoDB, Apache CouchDB, Couchbase, MarkLogic, etc.
Department of CSE- Data Science
3. Column
 Each storage block has data from only one column.
 For example: Cassandra, HBase, etc,
4. Graph:
 They are also called network database. A graph stores data in nodes.
 For example, Neodj, HyperGraphDB, etc.
Department of CSE- Data Science
Department of CSE- Data Science
Why NoSQL?
1. It has scale out architecture instead of the monolithic architecture of relational databases.
2. It can house large volumes of structured, semi-structured, and unstructured data.
3. Dynamic schema: NoSQL database allows insertion of data without a pre-defined schema.
In other words, it facilitates application changes in real time, which thus supports faster
development, easy code integration, and requires less database administration.
4. Auto-sharding: It automatically spreads data across an arbitrary number of servers. The
application in question is more often not even aware of the composition of the server pool.
It balances the load of data and query on the available servers; and if and when a server
goes down, it is quickly replaced without any major activity disruptions.
5. Replication: It offers good support for replication which in turn guarantees high availability,
fault tolerance, and disaster recovery.
Department of CSE- Data Science
Advantages of NoSQL
1. Can easily scale up and down: NoSQL database supports scaling rapidly and
elastically and even allows to scale to the cloud.
a. Cluster scale: It allows distribution of database across 100+ nodes often in
multiple data centers.
b. Performance scale: It sustains over 100,000+ database reads and writes
per second.
c. Data scale: It supports housing of 1 billion+ documents in the database.
Department of CSE- Data Science
2. Doesn't require a pre-defined schema: NoSQL does not require any adherence to pre-
defined schema. It is pretty flexible. For example, if we look at MongoDB, the
documents in a collection can have different sets of key-value pairs.
3. Cheap, easy to implement: Deploying NoSQL properly allows for all of the benefits of
scale, high availability, fault tolerance, etc. while also lowering operational costs.
4. Relaxes the data consistency requirement: NoSQL databases have adherence to CAP
theorem (Consistency, Availability, and Partition tolerance). Most of the NoSQL
databases compromise on consistency in favor of availability and partition tolerance.
Department of CSE- Data Science
5. Data can be replicated to multiple nodes and can be partitioned: There are two
terms that we will discuss here:
a) Sharding: Sharding is when different pieces of data are distributed across multiple
servers. NoSQL databases support auto-sharding; this means that they can natively
and automatically spread data across an arbitrary number of servers, without
requiring the application to even be aware of the composition of the server pool.
Servers can be added or removed from the data layer without application
downtime. This would mean that data and query load are automatically balanced
across servers, and when a server goes down, it can be quickly and transparently
replaced with no application disruption.
b) Replication: Replication is when multiple copies of data are stored across the cluster
and even across data centers. This promises high availability and fault tolerance.
Department of CSE- Data Science
What We Miss With NoSQL?
 NoSQL does not support joins. However, it compensates for it by allowing embedded
documents as MongoDB.
 It does not have provision for ACID properties of transactions. However, it obeys the
Brewer’s CAP theorem.
 NoSQL does not have a standard SQL interface but NoSQL databases such MongoDB
and Cassandra have their own rich query language to compensate for the lack of it.
Department of CSE- Data Science
Use of NoSQL in Industry
 NoSQL is being put to use in varied industries. They are used to support analysis for
applications such as web user data analysis, log analysis, sensor feed analysis, making
recommendations for upsell and cross-sell etc.
Department of CSE- Data Science
NoSQL Vendors
Department of CSE- Data Science
SQL versus NoSQL
Department of CSE- Data Science
NewSQL
 We need a database that has the same scalable performance of NoSQL systems for
On Line Transaction Processing (OLTP) while still maintaining the ACID guarantees of a
traditional database. This new modern RDBMS is called NewSQL.
 It supports relational data model and uses SQL as their primary interface.
 NewSQL is based on the shared nothing architecture with a SQL interface for
application interaction.
Department of CSE- Data Science
Characteristics of NewSQL
Department of CSE- Data Science
Comparison of SQL, NoSQL, and NewSQL
Department of CSE- Data Science
HADOOP
 Hadoop is an open-source project of the Apache foundation.
 It is a framework written in Java, originally developed by Doug Cutting in 2005
who named it after his son's toy elephant. He was working with Yahoo then.
 It was created to support distribution for “Nutch”, the text search engine.
Hadoop uses Google’s MapReduce and Google File System technologies as its
foundation.
 Hadoop is now a core part of the computing infrastructure for companies such as
Yahoo, Facebook, Linkedn, Twitter, etc.
Department of CSE- Data Science
Figure : Hadoop
Department of CSE- Data Science
Features of Hadoop
1. It is optimized to handle massive quantities of structured, semi-structured, and
unstructured data, using commodity hardware, that is, relatively inexpensive
computers.
2. Hadoop has a shared nothing architecture.
3. It replicates its data across multiple computers so that if one goes down, the data can
still be processed from another machine that stores its replica.
4. Hadoop is for high throughput rather than low latency. It is a batch operation handling
massive quantities of data; therefore the response time is not immediate.
5. It complements On-Line Transaction Processing (OLTP) and On-Line Analytical
Processing (OLAP). However, it is not a replacement for a relational database
management system.
6. It is NOT good when work cannot be parallelized or when there are dependencies
within the data.
7. It is NOT good for processing small files. It works best with huge data files and
datasets.
Department of CSE- Data Science
Key Advantages of Hadoop
Department of CSE- Data Science
1. Stores data in its native format: Hadoop’s data storage framework (HDFS — Hadoop
Distributed File System) can store data in its native format. There is no structure that
is imposed while keying in data or storing data. HDFS is pretty much schema-less. It is
only later when the data needs to be processed that structure is imposed on the raw
data.
2. Scalable: Hadoop can store and distribute very large datasets (involving thousands of
terabytes of data) across hundreds of inexpensive servers that operate in parallel.
3. Cost-effective: Owing to its scale-out architecture, Hadoop has a much reduced
cost/terabyte of storage and processing.
Department of CSE- Data Science
3. Resilient to failure: Hadoop is fault-tolerant. It practices replication of data diligently
which means whenever data is sent to any node, the same data also gets replicated to
other nodes in the cluster, thereby ensuring that in the event of a node failure, there will
always be another copy of data available for use.
4. Flexibility: One of the key advantages of Hadoop is its ability to work with all kinds of
data: structured, semi-structured, and unstructured data. It can help derive meaningful
business insights from email conversations, social media data, click-stream data, etc. It
can be put to several purposes such as log analysis, data mining, recommendation
systems, market campaign analysis, etc.
5. Fast: Processing is extremely fast in Hadoop as compared to other conventional systems
owing to the “move code to data” paradigm.
Department of CSE- Data Science
Versions of Hadoop
There are two versions of Hadoop available:
1. Hadoop 1.0
2. Hadoop 2.0
Department of CSE- Data Science
Hadoop 1.0
 It has two main parts:
1. Data storage framework: It is a general-purpose file system called Hadoop
Distributed File System(HDFS). HDFS is schema-less. It simply stores data files.
These data files can be in just about any format. The idea is to store files as close to
their original form as possible. This is turn provides the business units and the
organization the much needed flexibility and agility without being overly worried by
what it can implement.
2. Data processing framework: This is a simple functional programming model initially
popularized by Google as MapReduce. It essentially uses two functions: the MAP
and the REDUCE functions to process data. The “Mappers” take in a set of key-value
pairs and generate intermediate data (which is another list of key—value pairs). The
“Reducers” then act on this input to produce the output data. The two functions
seemingly work in isolation from one another, thus enabling the processing to be
highly distributed in a highly-parallel, fault-tolerant, and scalable way.
Department of CSE- Data Science
Limitations of Hadoop 1.0
1. The first limitation was the requirement for MapReduce programming expertise along
with proficiency required in other programming languages, notably Java.
2. It supported only batch processing which although is suitable for tasks such as log
analysis, large-scale data mining projects but pretty much unsuitable for other kinds of
projects.
3. One major limitation was that Hadoop 1.0 was tightly computationally coupled with
MapReduce, which meant that the established data management vendors were left
with two options: Either rewrite their functionality in MapReduce so that it could be
executed in Hadoop or extract the data from HDFS and process it outside of Hadoop.
None of the options were viable as it led to process inefficiencies caused by the data
being moved in and out of the Hadoop cluster.
Department of CSE- Data Science
Hadoop 2.0
 HDFS continues to be the data storage framework.
 A new and separate resource management framework called Yet Another Resource
Negotiator (YARN) has been added.
 Any application capable of dividing itself into parallel tasks is supported by YARN.
 YARN coordinates the allocation of subtasks of the submitted application, thereby
further enhancing the flexibility, scalability, and efficiency of the applications.
 It works by having an ApplicationMaster which is able to run any application and not
just MapReduce.
 it not only supports batch processing but also real-time processing.
Department of CSE- Data Science
Overview of Hadoop Ecosystems
 There are components available in the Hadoop ecosystem for data ingestion, processing,
and analysis.
Data Ingestion → Data Processing → Data Analysis
Department of CSE- Data Science
 It is the distributed storage unit of Hadoop. It provides streaming access to file
system data as well as file permissions and authentication.
 It is based on GFS (Google File System).
 It is used to scale a single cluster node to hundreds and thousands of nodes.
 It handles large datasets running on commodity hardware.
 HDFS is highly fault-tolerant. It stores files across multiple machines.
 These files are stored in redundant fashion to allow for data recovery in case of
failure.
HDFS
Department of CSE- Data Science
 HBase stores data in HDFS.
 It is the first non-batch component of the Hadoop Ecosystem.
 It is a database on top of HDFS. It provides a quick random access to the stored data.
 It has very low latency compared to HDFS.
 It is a NoSQL database, is non-relational and is a column-oriented database.
 A table can have thousands of columns.
 A table can have multiple rows.
 Each row can have several column families.
 Each column family can have several columns.
 Each column can have several key values. It is based on Google BigTable.
 This is widely used by Facebook, Tiwitter, Yahoo, etc.
HBase
Department of CSE- Data Science
Difference between HBase and Hadoop/HDFS
1. HDFS is the file system whereas HBase is a Hadoop database. It is like NTES and MySQL.
2. HDFS is WORM (Write once and read multiple times or many times). Latest versions support
appending of data but this feature is rarely used. However, HBase supports real-time
random read and write
3. HDFS is based on Google File System (GFS) whereas HBase is based on Google Big Table.
4. HDFS supports only full table scan or partition table scan. Hbase supports random small
range scan or table scan,
5. Performance of Hive on HDFS is relatively very good but for HBase it becomes 4—5 times
slower.
6. The access to data is via MapReduce job only in HDFS whereas in HBase the access is via
Java APIs, Rest, Avro, Thrift APIs.
7. HDFS does not support dynamic storage owing to its rigid structure whereas HBase
supports dynamic storage.
8. HDFS has high latency operations whereas HBase has low latency operations.
9. HDFS is most suitable for batch analytics whereas HBase is for real-time analytics.
Department of CSE- Data Science
 Hadoop Ecosystem Components for Data Ingestion
1. Sqoop: Sqoop stands for SQL to Hadoop. Its main functions are
a. Importing data from RDBMS such as MySQL, Oracle, DB2, etc. to Hadoop file system
(HDFS, HBase, Hive).
b. Exporting data from Hadoop File system (HDFS, HBase, Hive) to RDBMS (MySQL,
Oracle, DB2).
Uses of Sqoop
c. It has a connector-based architecture to allow plug-ins to connect to external systems
such as MySQL, Oracle, DB2, etc.
d. It can provision the data from external system on to HDFS and populate tables in Hive
and HBase.
e. It integrates with Oozie allowing you to schedule and automate import and export
tasks.
2. Flume: Flume is an important log aggregator (aggregates logs from different machines
and places them in HDFS) component in the Hadoop ecosystem. Flume has been
developed by Cloudera. It s designed for high volume ingestion of event-based data into
Hadoop. The default destination in Flume (called as sink in flume parlance) is HDFS.
However it can also write to HBase or Solr.
Department of CSE- Data Science
1. MapReduce:
 It is a programming paradigm that allows distributed and parallel processing of
huge datasets.
 It is based on Google MapReduce.
 Google released a paper on MapReduce programming paradigm in 2004 and that
became the genesis of Hadoop processing model.
 The MapReduce framework gets the input data from HDFS.
Hadoop Ecosystem Components for Data Processing
Department of CSE- Data Science
 There are two main phases: Map phase and the Reduce phase.
 The map phase converts the input data into another set of data (key—value pairs).
 This new intermediate dataset then serves as the input to the reduce phase.
 The reduce phase acts on the datasets to combine (aggregate and consolidate) and
reduce them to a smaller set of tuples.
 The result is then stored back in HDFS.
Department of CSE- Data Science
2. Spark:
 It is both a programming model as well as a computing model.
 It is an open-source big data processing framework. It was originally developed in
2009 at UC Berkeley's AmpLab and became an open-source project in 2010.
 It is written in Scala. It provides in-memory computing for Hadoop.
 In Spark, workloads execute in memory rather than on disk owing to which it is much
faster (10 to100 times) than when the workload is executed on disk.
 If the datasets are too large to fit into, the available system memory, it can perform
conventional disk-based processing.
 It serves as a potentially faster and more flexible alternative to MapReduce.
 It accesses data from HDFS (Spark does not have its own distributed file system) but
bypasses the MapReduce processing.
Department of CSE- Data Science
 Spark can be used with Hadoop coexisting smoothly with MapReduce (sitting on top
of Hadoop YARN) or used independently of Hadoop (standalone).
 As a programming model, it works well with Scala, Python (it has API connectors for
using it with Java or Python) or R programming language.
 The following are the Spark libraries:
a.Spark SQL: Spark also has support for SQL. Spark SQL uses SQL to help query
data stored in disparate applications.
b.Spark streaming: It helps to analyze and present data in real time
c.MLib: It supports machine learning such as applying advanced statistical
operations on data in Spark Cluster.
d.GraphX: It helps in graph parallel computation.
Department of CSE- Data Science
 Spark and Hadoop are usually used together by several companies.
 Hadoop was primarily designed to house unstructured data and run batch processing
operations on it.
 Spark is used extensively for its high speed in memory computing and ability to run
advanced real-time analytics.
 The two together have been giving very good results.
Department of CSE- Data Science
Hadoop Ecosystem Components for Data Analysis
1. Pig: It is a high-level scripting language used with Hadoop. It serves as an alternative
to MapReduce. It has two parts:
a. Pig Latin: It is SQL-like scripting language. Pig Latin scripts are translated into
MapReduce jobs which can then run on YARN and process data in the HDFS cluster.
There is a “Load” command available to load the data from “HDFS” into Pig. Then one
can perform functions such as grouping, filtering, sorting, joining etc. The processed
or computed data can then be cither displayed on screen or placed back into HDFS. It
gives you a platform for building data flow for ETL (Extract, Transform and Load),
processing and analyzing huge data sets.
b. Pig runtime: It is the runtime environment.
Department of CSE- Data Science
2. Hive: Hive is a data warehouse software project built on top of Hadoop. Three main
tasks performed by Hive are summarization, querying and analysis. It supports
queries written in a language called HQL or HiveQL which is a declarative SQL-like
language. It converts the SQL-style queries into MapReduce jobs which are then
executed on the Hadoop platform.
Department of CSE- Data Science
Difference between Hive and RDBMS
1. Hive enforces schema on Read Time whereas RDBMS enforces schema on Write
Time.
 In RDBMS, at the time of loading/inserting data, the table’s schema is enforced. If the
data being loaded does not conform to the schema then it is rejected. Thus, the
schema is enforced on write (loading the data into the database). Schema on write
takes longer to load the data into the database; however it makes up for it during
data retrieval with a good query time performance.
 Hive does not enforce the schema when the data is being loaded into the D/W. It is
enforced only when the data is being read/retrieved. This is called schema on read. It
definitely makes for fast initial load as the data load or insertion operation is just a
file copy or move.
Department of CSE- Data Science
2. Hive is based on the notion of write once and read many times whereas the RDBMS is
designed for read and write many times.
3. Hadoop is a batch-oriented system. Hive, therefore, is not suitable for OLTP (Online
Transaction Processing) but, although not ideal, seems closer to OLAP (Online Analytical
Processing). The reason being that there is quite a latency between issuing a query and
receiving a reply as the query written in HiveQL will be converted to MapReduce jobs
which are then executed on the Hadoop cluster. RDBMS is suitable for housing day-to-day
transaction data and supports all OLTP operations with frequent insertions, modifications
(updates), deletions of the data.
4. Hive handles static data analysis which is non-real-time data. Hive is the data
warehouse of Hadoop. There are no frequent updates to the data and the query
response time is not fast. RDBMS is suited for handling dynamic data which is real
time.
Department of CSE- Data Science
5. Hive can be easily scaled at a very low cost when compared to RDMS. Hive uses HDFS
to store data thus it cannot be considered as the owner of the data, while on the
other hand RDBMS is the own of the data responsible for storing, managing and
manipulating it in the database.
6. Hive uses the concept of parallel computing, whereas RDBMS uses serial computing.
Department of CSE- Data Science
Department of CSE- Data Science
Department of CSE- Data Science
Difference between Hive and HBase
1. Hive is a MapReduce-based SQL engine that runs on top of Hadoop. HBase is a key—
value NoSQL database that runs on top of HDFS.
2. Hive is for batch processing of big data. HBase is for real-time data streaming,
Impala
 It is a high performance SQL engine that runs on Hadoop cluster. It is ideal for
interactive analysis. It has very low latency measured in milliseconds. It supports a
dialect of SQL called Impala SQL.
ZooKeeper
 It is a coordination service for distributed applications.
Oozie
 It is workflow scheduler system to manage Apache Hadoop jobs.
Department of CSE- Data Science
Mahout
 It is a scalable machine learning and data mining library.
Chukwa
 It is a data collection system for managing large distributed systems.
Ambari
 It is a web-based tool for provisioning, managing, and monitoring Apache Hadoop
clusters.
Department of CSE- Data Science
Hadoop Distributions
 Hadoop is an open-source Apache project.
Anyone can freely download the core
aspects of Hadoop.
 core aspects of Hadoop include the
following:
1.Hadoop Common
2.Hadoop Distributed File System (HDFS)
3.Hadoop YARN (Yet Another Resource
Negotiator)
4. Hadoop MapReduce
Department of CSE- Data Science
Hadoop versus SQL
Department of CSE- Data Science
Integrated Hadoop Systems Offered by Leading Market Vendors
Department of CSE- Data Science
Cloud-Based Hadoop Solutions
 Amazon Web Services holds out a comprehensive, end-to-end portfolio of cloud
computing services to help manage big data. The aim is to achieve this and more
along with retaining the emphasis on reducing costs, scaling to meet demand, and
accelerating the speed of innovation.
 The Google Cloud Storage connector for Hadoop empowers one to perform MapReduce
jobs directly on data in Google Cloud Storage, without the need to copy it to local disk and
running it in the Hadoop Distributed File System (HDFS). The connector simplifies Hadoop
deployment, and at the same time reduces cost and provides performance comparable to
HDFS, all this while increasing reliability by eliminating the single point of failure of the
name node.

More Related Content

PPTX
Big data Analytics(BAD601) -module-1 ppt
PPTX
Introduction of Data Science and Data Analytics
PDF
Databases for Data Science
PDF
PDF
Data science training in hyderabad
PDF
20CS601 - Big data Analytics - types of data , definition of big data
PPTX
Introduction of big data unit 1
PPTX
Introductio to Data Science and types of data
Big data Analytics(BAD601) -module-1 ppt
Introduction of Data Science and Data Analytics
Databases for Data Science
Data science training in hyderabad
20CS601 - Big data Analytics - types of data , definition of big data
Introduction of big data unit 1
Introductio to Data Science and types of data

Similar to Big data analytics(BAD601) module-1 ppt (20)

PPTX
CCS334 BIG DATA ANALYTICS Session 2 Types NoSQL.pptx
PPTX
Data Never Lies Presentation for beginners in data field.pptx
PPTX
Chapter -2- Data science Emerging Tech.pptx
PPTX
Data science notes for reference/ engineering
PPTX
Ch1_Introduction to DATA SCIENCE_TYBSC(CS)_2024.pptx
PDF
Database revolution opening webcast 01 18-12
PDF
Database Revolution - Exploratory Webcast
DOCX
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
PDF
Bill howe 2_databases
PPTX
Data-Fundamentals-in Aws cho người mới bắt đầu
PPTX
NoSQL.pptx
PPTX
cours database pour etudiant NoSQL (1).pptx
PPTX
Session 1 Introduction to NoSQL.pptx
PDF
Database Systems - A Historical Perspective
PDF
Database Survival Guide: Exploratory Webcast
PDF
Data science
PPTX
Chapter 2- Data Science and big data.pptx
PPTX
U - 2 Emerging.pptx
PDF
Big Data Fundamentals
PPTX
DATABASE MANAGEMENT SYSTEM-MRS. LAXMI B PANDYA FOR 25TH AUGUST,2022.pptx
CCS334 BIG DATA ANALYTICS Session 2 Types NoSQL.pptx
Data Never Lies Presentation for beginners in data field.pptx
Chapter -2- Data science Emerging Tech.pptx
Data science notes for reference/ engineering
Ch1_Introduction to DATA SCIENCE_TYBSC(CS)_2024.pptx
Database revolution opening webcast 01 18-12
Database Revolution - Exploratory Webcast
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
Bill howe 2_databases
Data-Fundamentals-in Aws cho người mới bắt đầu
NoSQL.pptx
cours database pour etudiant NoSQL (1).pptx
Session 1 Introduction to NoSQL.pptx
Database Systems - A Historical Perspective
Database Survival Guide: Exploratory Webcast
Data science
Chapter 2- Data Science and big data.pptx
U - 2 Emerging.pptx
Big Data Fundamentals
DATABASE MANAGEMENT SYSTEM-MRS. LAXMI B PANDYA FOR 25TH AUGUST,2022.pptx
Ad

More from AmbikaVenkatesh4 (17)

PPTX
Module-3Key Management and Distribution.pptx
PPTX
Module-2Other Public-Key Cryptosystems.pptx
PPTX
Module-2 Public-Key Cryptography and RSA.pptx
PPTX
Block Ciphers and the data encryption standard.pptx
PPTX
moudule-1classical Encyption Techniques.pptx
PPTX
Business Intelligence Module 3_Datawarehousing.pptx
PPTX
big data analytics (BAD601) Module-5.pptx
PPTX
UHV Module-4 Exploring_Harmony_Assignment.pptx
PPTX
Universal Human Values (BUHK408)Module-4.pptx
PPTX
Aptitude Training Module-2_Data Suffciency.pptx
PPTX
Big Data Analytics (BAD601) Module-4.pptx
PPTX
UHV(BUHK408) Module-value education and self exploration
PPTX
Os Module 4_Virtual Memory Management.pptx
PPTX
Operating Systems Module 4_Memory Management.pptx
PPTX
Introduction to Big data analytics subject
PPTX
Network Lab simulation program ping.pptx
PPTX
NErwork Lab Simulation Introduction.pptx
Module-3Key Management and Distribution.pptx
Module-2Other Public-Key Cryptosystems.pptx
Module-2 Public-Key Cryptography and RSA.pptx
Block Ciphers and the data encryption standard.pptx
moudule-1classical Encyption Techniques.pptx
Business Intelligence Module 3_Datawarehousing.pptx
big data analytics (BAD601) Module-5.pptx
UHV Module-4 Exploring_Harmony_Assignment.pptx
Universal Human Values (BUHK408)Module-4.pptx
Aptitude Training Module-2_Data Suffciency.pptx
Big Data Analytics (BAD601) Module-4.pptx
UHV(BUHK408) Module-value education and self exploration
Os Module 4_Virtual Memory Management.pptx
Operating Systems Module 4_Memory Management.pptx
Introduction to Big data analytics subject
Network Lab simulation program ping.pptx
NErwork Lab Simulation Introduction.pptx
Ad

Recently uploaded (20)

PPTX
ARCHITECTURE AND PROGRAMMING OF EMBEDDED SYSTEMS
PPTX
Soft Skills Unit 2 Listening Speaking Reading Writing.pptx
PPT
Module_1_Lecture_1_Introduction_To_Automation_In_Production_Systems2023.ppt
PDF
Artificial Intelligence_ Basics .Artificial Intelligence_ Basics .
PPTX
quantum theory on the next future in.pptx
PDF
electrical machines course file-anna university
PPTX
SE unit 1.pptx aaahshdhajdviwhsiehebeiwheiebeiev
PDF
Module 1 part 1.pdf engineering notes s7
PDF
Performance, energy consumption and costs: a comparative analysis of automati...
PDF
MACCAFERRY GUIA GAVIONES TERRAPLENES EN ESPAÑOL
PDF
Introduction to Machine Learning -Basic concepts,Models and Description
PDF
VTU IOT LAB MANUAL (BCS701) Computer science and Engineering
PPTX
Solar energy pdf of gitam songa hemant k
PPTX
INTERNET OF THINGS - EMBEDDED SYSTEMS AND INTERNET OF THINGS
PDF
Software defined netwoks is useful to learn NFV and virtual Lans
PPT
Comprehensive Java Training Deck - Advanced topics
PDF
V2500 Owner and Operatore Guide for Airbus
PDF
B461227.pdf American Journal of Multidisciplinary Research and Review
DOCX
ENVIRONMENTAL PROTECTION AND MANAGEMENT (18CVL756)
PPTX
SC Robotics Team Safety Training Presentation
ARCHITECTURE AND PROGRAMMING OF EMBEDDED SYSTEMS
Soft Skills Unit 2 Listening Speaking Reading Writing.pptx
Module_1_Lecture_1_Introduction_To_Automation_In_Production_Systems2023.ppt
Artificial Intelligence_ Basics .Artificial Intelligence_ Basics .
quantum theory on the next future in.pptx
electrical machines course file-anna university
SE unit 1.pptx aaahshdhajdviwhsiehebeiwheiebeiev
Module 1 part 1.pdf engineering notes s7
Performance, energy consumption and costs: a comparative analysis of automati...
MACCAFERRY GUIA GAVIONES TERRAPLENES EN ESPAÑOL
Introduction to Machine Learning -Basic concepts,Models and Description
VTU IOT LAB MANUAL (BCS701) Computer science and Engineering
Solar energy pdf of gitam songa hemant k
INTERNET OF THINGS - EMBEDDED SYSTEMS AND INTERNET OF THINGS
Software defined netwoks is useful to learn NFV and virtual Lans
Comprehensive Java Training Deck - Advanced topics
V2500 Owner and Operatore Guide for Airbus
B461227.pdf American Journal of Multidisciplinary Research and Review
ENVIRONMENTAL PROTECTION AND MANAGEMENT (18CVL756)
SC Robotics Team Safety Training Presentation

Big data analytics(BAD601) module-1 ppt

  • 1. Department of CSE- Data Science Module-1 Introduction to Big Data, Big Data Analytics
  • 2. Department of CSE- Data Science Contents  Classification of data  Characteristics  Evolution and definition of Big data  What is Big data  Why Big data  Traditional Business Intelligence Vs Big Data  Typical data warehouse and Hadoop environment  Big Data Analytics: What is Big data Analytics  Classification of Analytics  Importance of Big Data Analytics  Technologies used in Big data Environments  Few Top Analytical Tools , NoSQL, Hadoop.
  • 3. Department of CSE- Data Science Introduction  Data is present internal to the enterprise and also exists outside the four walls and firewalls of the enterprise.  Data is present in homogeneous sources as well as in heterogeneous sources. Data → Information Information → Insights
  • 4. Department of CSE- Data Science Classification of Digital data
  • 5. Department of CSE- Data Science Structured data  Data which is in an organized form(e.g, rows and columns) and can be easily used by a computer program.  Relationships exist between entities of data, such as classes and their objects.  Data stored in databases is an example of structured data.
  • 6. Department of CSE- Data Science Semi-structured data  Data which does not conform to a data model but has some structure.  It is not in a form which can be used easily by a computer program.  For example XML, markup languages like HTML etc.,
  • 7. Department of CSE- Data Science Unstructured data  Data which does not conform to a data model or is not in a form which can be used easily by a computer program.  About 80%-90% data of an organization is in this format  For example, memos, chat rooms, powerpoint presentations, images, videos, letters etc,.
  • 8. Department of CSE- Data Science Structured Data  Most of the structured data is held in RDBMS.  An RDBMS conforms to the relational data model wherein the data is stored in rows/columns.  The number of rows/records/tuples in a relation is called the cardinality of a relation and the number of columns s referred to as the degree of a relation.  The first step is the design of a relation/table, the fields/columns to store the data, the type of data that will be stored [number (integer or real), alphabets, date, Boolean, etc.].
  • 9. Department of CSE- Data Science  Next we think of the constraints that we would like our data to conform to (constraints such as UNIQUE values in the column, NOT NULL values in the column, a business constraint such as the value held in the column should not drop below 50, the set of permissible values in the column such as the column should accept only “CS”, “IS”, “MS”, etc., as input).  Example: Let us design a table/relation structure to store the details of the employees of an enterprise.
  • 10. Department of CSE- Data Science
  • 11. Department of CSE- Data Science  The tables in an RDBMS can also be related. For example, the above “Employee” table is related to the “Department” table on the basis of the common column, “DeptNo”. Fig: Relationship between “Employee” and “Department” tables
  • 12. Department of CSE- Data Science Sources of Structured Data  Oracle, IBM — DB2, Microsoft — Microsoft SQL Server, EMC — Greenplum, Teradata — Teradata, MySQL (open source), PostgreSQL (advanced open source), etc.] are used to hold transaction/operational data generated and collected by day-to-day business activities.  The data of the On-Line Transaction Processing (OLTP) systems are generally quite structured.
  • 13. Department of CSE- Data Science Ease of Working with Structured Data 1. Insert/update/delete: The Data Manipulation Language (DML) operations provide the required ease with data input, storage, access, process, analysis, etc. 2. Security: There are available staunch encryption and tokenization solutions to warrant the security of information throughout its lifecycle. Organizations are able to retain control and maintain compliance adherence by ensuring that only authorized individuals are able to decrypt and view sensitive information.
  • 14. Department of CSE- Data Science 3. Indexing: An index is a data structure that speeds up the data retrieval operations (primarily the SELECT DML statement) at the cost of additional writes and storage space, but the benefits that ensue in search operation are worth the additional writes and storage space. 4. Scalability: The storage and processing capabilities of the traditional RDBMS can be easily scaled up by increasing the horsepower of the database server (increasing the primary and secondary or peripheral storage capacity, processing capacity of the processor, etc,). 5. Transaction processing: RDBMS has support for Atomicity, Consistency, Isolation, and Durability (ACID) properties of transaction. Given next is a quick explanation of the ACID properties:  Atomicity: A transaction is atomic, means that either it happens in its entirety or none of it at all.  Consistency: The database moves from one consistent state to another consistent state. In other words, if the same piece of information is stored at two or more places, they are in complete agreement.  Isolation: The resource allocation to the transaction happens such that the transaction gets the impression that it is the only transaction happening in isolation.  Durability: All changes made to the database during a transaction are permanent and that accounts for the durability of the transaction.
  • 15. Department of CSE- Data Science Semi-structured Data  Semi-structured data is also referred to as self-describing structure.  Features 1. It does not conform to the data models that one typically associates with relational databases or any other form of data tables. 2. It uses tags to segregate semantic elements. 3. Tags are also used to enforce hierarchies of records and fields within data. There is no separation between the data and the schema. The amount of structure used is dictated by the purpose at hand. 4. In semi-structured data, entities belonging to the same class and also grouped together need not necessarily have the same act of attributes. And if at all, they have the same set of attributes, the order of attributes may not be similar and for all practical purposes it is not important as well.
  • 16. Department of CSE- Data Science Sources of Semi-Structured Data 1. XML: Xtensible Markup Language (XML) is hugely popularized by web services developed utilizing the Simple Object Access Protocol (SOAP) principles. 2. JSON: Java Script Object Notation (JSON) is used to transmit data between a server and a web application. JSON is popularized by web services developed utilizing the Representational State Transfer (REST) — an architecture style for creating scalable web services. MongoDB (open-source, distributed, NoSQL, documented-oriented database) and Couchbase (originally known as Membase, open-source, distributed, NoSQL, document- oriented database) store data
  • 17. Department of CSE- Data Science
  • 18. Department of CSE- Data Science Unstructured Data  Unstructured data does not conform to any pre-defined data model.  The structure is quite unpredictable. Table :Few examples of disparate unstructured data
  • 19. Department of CSE- Data Science Sources of Unstructured Data
  • 20. Department of CSE- Data Science Issues with Unstructured Data  unstructured data is known NOT to conform to a pre-defined data model or be organized in a pre defined manner, there are incidents wherein the structure of the data can still be implied.
  • 21. Department of CSE- Data Science Dealing with Unstructured Data  Today, unstructured data constitutes approximately 80% of the data that is being generated in any enterprise.  The balance is clearly shifting in favor of unstructured data as shown in below. It is such a big percentage that it cannot be ignored. Figure : Unstructured data clearly constitutes a major percentage of enterprise data.
  • 22. Department of CSE- Data Science The following techniques are used to find patterns in or interpret unstructured data: 1. Data mining: First, we deal with large data sets. Second, we use methods at the intersection of artificial intelligence, machine learning, statistics, and database systems to unearth consistent patterns in large data sets and/or systematic relationships between variables. It is the analysis step of the “knowledge discovery in databases” process. Popular algorithms are as follows: i. Association rule mining: It is also called “market basket analysis” or “affinity analysis”. It is used to determine “What goes with what?” It is about when you buy a product, what is the other product that you are likely to purchase with it. For example, if you pick up bread from the grocery, are you likely to pick eggs or cheese to go with it. Figure : Dealing with unstructured data
  • 23. Department of CSE- Data Science ii. Regression analysis: It helps to predict the relationship between two variables. The variable whose value needs to be predicted is called the dependent variable and the variables which are used to predict the value are referred to as the independent variables. iii. Collaborative filtering: It is about predicting a user's preference or preferences based on the preferences of a group of users. Table:  We are looking at predicting whether User 4 will prefer to learn using videos or is a textual learner depending on one or a couple of his or her known preferences.  We analyze the preferences of similar user profiles and on the basis of it, predict that User 4 will also like to learn using videos and is not a textual learner.
  • 24. Department of CSE- Data Science 2. Text analytics or text mining: Compared to the structured data stored in relational databases, text is largely unstructured, amorphous, and difficult to deal with algorithmically. Text mining is the process of gleaning high quality and meaningful information (through devising of patterns and trends by means of statistical pattern learning) from text. It includes tasks such as text categorization, text clustering, sentiment analysis, concept/entity extraction, etc. 3. Natural language processing (NLP): It is related to the area of human computer interaction. It is about enabling computers to understand human or natural language input. 4. Noisy text analytics: It is the process of extracting structured or semi-structured information from noisy unstructured data such as chats, blogs, wikis, emails, message-boards, text messages, etc. The noisy unstructured data usually comprises one or more of the following; Spelling mistakes, abbreviations, acronyms, non- standard words, missing punctuation, missing letter case, filler words such as “ub”, “um”, etc.
  • 25. Department of CSE- Data Science 5. Manual tagging with metadata: This is about tagging manually with adequate metadata to provide the requisite semantics to understand unstructured data. 6. Part-of-speech tagging: It is also called POS or POST or grammatical tagging. It is the process of reading text and tagging each word in the sentence as belonging to a particular part of speech such as “noun”, “verb”, “adjective”, etc. 7. Unstructured Information Management Architecture (UIMA): It is an open source platform from IBM. It is used for real-time content analytics. It is about processing text and other unstructured data to find latent meaning and relevant relationship buried therein.
  • 26. Department of CSE- Data Science Properties Structured data Semi-structured data Unstructured data Technology It is based on Relational database table It is based on XML/RDF(Resource Description Framework). It is based on character and binary data Transaction management Matured transaction and various concurrency techniques Transaction is adapted from DBMS not matured No transaction management and no concurrency Version management Versioning over tuples,row,tables Versioning over tuples or graph is possible Versioned as a whole Flexibility It is schema dependent and less flexible It is more flexible than structured data but less flexible than unstructured data It is more flexible and there is absence of schema Scalability It is very difficult to scale DB schema It’s scaling is simpler than structured data It is more scalable. Robustness Very robust New technology, not very spread — Query performance Structured query allow complex joining Queries over anonymous nodes are possible Only textual queries are possible
  • 27. Department of CSE- Data Science Classroom Exercise
  • 28. Department of CSE- Data Science
  • 29. Department of CSE- Data Science Characteristics of Data Data has three characteristics: 1. Composition: deals with structure of data, that is, the sources of data , the granularity, the types, and the nature of the data as to whether it is static or real-time streaming. 2. Condition: The condition of data deals with the state of the data that is “can one use this data as is for analysis?” or “Does it require cleansing for further enhancement and enrichment?” 3. Context: deals with “Where has this data been generated?”, “Why was this data generated?” and so on. Figure: Characteristics of data
  • 30. Department of CSE- Data Science EVOLUTION OF BIG DATA  1970s and before was the era of mainframes. The data was essentially primitive and structured.  Relational databases evolved in 1980s and 1990s. The era was of data intensive applications.  The World Wide Web WWW) and the Internet of Things (IoT) have led to an onslaught of structured, unstructured, and multimedia data. Table : The evolution of big data
  • 31. Department of CSE- Data Science Definition of Big Data Figure : Definition of big data.  Anything beyond the human and technical infrastructure needed to support storage, processing, and analysis.  Terabytes or petabytes or zettabytes of data.  Terabytes or petabytes or zettabytes of data.  I think it is about 3 Vs.
  • 32. Department of CSE- Data Science Definition of Big Data
  • 33. Department of CSE- Data Science Challenges With Big Data 1. Data today is growing at an exponential rate. This high tide of data will continue to rise incessantly. The key questions here are: “Will all this data be useful for analysis?”, “Do we work with all this data or a subset of it?”, “How will we separate the knowledge from the noise?”, etc. 2. Cloud computing and virtualization are here to stay. Cloud computing is the answer to managing infrastructure for big data as far as cost-efficiency, elasticity, and easy upgrading/downgrading is concerned. This further complicates the decision to host big data solutions outside the enterprise
  • 34. Department of CSE- Data Science 3. The other challenge is to decide on the period of retention of big data. Just how long should one retain this data? some data is useful for making long-term decisions, whereas in few cases, the data may quickly become irrelevant and obsolete just a few hours after having being generated. 4. There is a dearth of skilled professionals who possess a high level of proficiency in data sciences that is vital in implementing big data solutions. 5. Then, of course, there are other challenges with respect to capture, storage, preparation, search, analysis, transfer, security, and visualization of big data. There is no explicit definition of how big the dataset should be for it to be considered “big data.” Here we are to deal with data that is just too big, moves way to fast, and does not fit the structures of typical database systems. The data changes are highly dynamic and therefore there is a need to ingest this as quickly as possible. 6. Data visualization is becoming popular as a separate discipline. We are short by quite a number, as far as business visualization experts are concerned.
  • 35. Department of CSE- Data Science WHAT IS BIG DATA?  Big data is data that is big in volume, velocity, and variety. Fig: Data: Big in volume, variety, and velocity. Fig: Growth of data
  • 36. Department of CSE- Data Science Volume  We have seen it grow from bits to bytes to petabytes and exabytes.  Where Does This Data get Generated? → There are a multitude of sources for big data. → An XLS, a DOC, a PDE, etc. is unstructured data → a video on YouTube, a chat conversation on Internet Messenger, a customer feedback form on an online retail website, a CCTV coverage, a weather forecast report is unstructured data too. Fig: A mountain of data.
  • 37. Department of CSE- Data Science Figure: Sources of big data.  Typical internal data sources: Data present within an organization’s firewall. It is as follows: → Data storage: File systems, SQL (RDBMSs — Oracle, MS SQL Server, DB2, MySQL, PostgreSQL, etc.), NoSQL (MongoDB, Cassandra, etc.), and so on. → Archives: Archives of scanned documents, paper archives, customer correspondence records, patients’ health records, students’ admission records, students’ assessment records, and so on.
  • 38. Department of CSE- Data Science  External data sources: Data residing outside an organization’s firewall. It is as follows: → Public Web: Wikipedia, weather, regulatory, compliance, census, etc.  Both (internal+external) → Sensor data – Car sensors, smart electric meters, office buildings, air conditioning units, refrigerators, and so on. etc,. → Machine log data – Event logs, application logs, Business process logs, audit logs, clickstream data, etc. → Documents, PDSocial media – Twitter, blogs, Facebook, LinkedIn, Youtube, Instagram etc,. → Business apps – ERP,CRM, HR, Google Docs, and so on. → Media – Audio, Video, Image, Podcast, etc. → Docs – CSV, Word F,XLS, PPT and so on.
  • 39. Department of CSE- Data Science Velocity  We have moved from the days of batch processing to real-time processing. Variety  Variety deals with a wide range of data types and sources of data. 1. Structured data: From traditional transaction processing systems and RDBMS, etc. 2. Semi-structured data: For example Hyper Text Markup Language (HTML), eXtensible Markup Language (XML). 3. Unstructured data: For example unstructured text documents, audios, videos, emails, photos, PDFs, social media, etc. Batch  Periodic  Near real time  Real-time processing
  • 40. Department of CSE- Data Science Why Big Data?
  • 41. Department of CSE- Data Science Traditional Business Intelligence (Bi) Versus Big Data Business Intelligence Big Data All the enterprise’s data is housed in a central server In a big data environment data resides in a distributed file system Scales vertically Scales in or out horizontally Traditional BI is about structured data, and it is here that data is taken to processing functions Big data is about variety and here the processing functions are taken to the data.
  • 42. Department of CSE- Data Science A Typical Data Warehouse Environment  Operational or transactional or day-to-day business data is gathered from Enterprise Resource Planning (ERP) systems, Customer Relationship Management (CRM), legacy systems, and several third party applications.  The data from these sources may differ in format  Data may come from data sources located in the same geography or different geographies.  This data is then integrated, cleaned up, transformed, and standardized through the process of Extraction, Transformation, and Loading (ETL).  The transformed data is then loaded into the enterprise data warehouse or to marts  Business intelligence and analytics tools are then used to enable decision making Fig: A typical data warehouse environment.
  • 43. Department of CSE- Data Science A Typical Hadoop Environment  The data sources are quite disparate from web logs to images, audios and videos to social media data to the various docs, pdfs, etc  Here the data in focus is not just the data within the company's firewall but also data residing outside the company's firewall. This data is placed in Hadoop Distributed File System (HDFS).  If need be, this can be repopulated back to operational systems a fed to the enterprise data warehouse or data marts or Operational Data Store (ODS) to be picked for further processing and analysis. Fig: A typical Hadoop environment
  • 44. Department of CSE- Data Science WHAT IS BIG DATA ANALYTICS? Big Data Analytics is 1. Technology-enabled analytics: Quite a few data analytics and visualization tools are available in the market today from leading vendors such as IBM, Tableau, SAS, R Analytics, Statistical, World Programming Systems (WPS), etc. to help process and analyze your big data. 2. About gaining a meaningful, deeper, and richer insight into your business to steer in the right direction, understanding the customer’s demographics to cross-sell and up-sell to them, better leveraging the services of your vendors and suppliers, etc.
  • 45. Department of CSE- Data Science 3. About a competitive edge over your competitors by enabling you with findings that allow quicker and better decision-making. 4. A tight handshake between three communities: IT, business users, and data scientists. 5. Working with datasets whose volume and variety exceed the current storage and processing capabilities and infrastructure of your enterprise. 6. About moving code to data. This makes perfect sense as the program for distributed processing is tiny (just a few KBs) compared to the data (Terabytes or Petabytes today and likely to be Exabytes or Zettabytes in the near future).
  • 46. Department of CSE- Data Science Classification Of Analytics  These are basically two schools of thought: 1.Those that classify analytics into basic, operationalized, advanced, and monetized. 2.Those that classify analytics into analytics 1.0, analytics 2.0, and analytics 3.0. First School of Thought 3. Basic analytics: This primarily is slicing and dicing of data to help with basic business insights. This is about reporting on historical data, basic visualization, etc. 4. Operationalized analytics: It is operationalized analytics if it gets woven into the enterprise’s business processes. 5. Advanced analytics: This largely is about forecasting for the future by way of predictive and prescriptive modeling. 6. Monetized analytics: This is analytics in use to derive direct business revenue.
  • 47. Department of CSE- Data Science Second School of Thought • Let us take a closer look at analytics 1.0, analytics 2.0, and analytics 3.0. Table : Analytics 1.0, 2.0, and 3.0
  • 48. Department of CSE- Data Science
  • 49. 49 Department of CSE- Data Science Figure : Analytics 1.0, 2.0, and 3.0.
  • 50. Department of CSE- Data Science Importance of Big Data Analytics Let us study the various approaches to analysis of data and what it leads to. 1. Reactive — Business Intelligence: What does Business Intelligence (BI) help us with? It allows the businesses to make faster and better decisions by providing the right information to the right person at the right time in the right format. It is about analysis of the past or historical data and then displaying the findings of the analysis or reports in the form of enterprise dashboards, alerts, notifications, etc. It has support for both pre-specified reports as well as ad hoc querying. 2. Reactive — Big Data Analytics: Here the analysis is done on huge datasets but the approach is still reactive as it is still based on static data.
  • 51. Department of CSE- Data Science 3. Proactive — Analytics: This is to support futuristic decision making by the use of data mining, predictive modeling, text mining, and statistical analysis. This analysis is not on big data as it still uses the traditional database management practices on big data and therefore has severe limitations on the storage capacity and the processing capability. 4. Proactive - Big Data Analytics: This is sieving through terabytes, petabytes, exabytes of information to filter out the relevant data to analyze. This also includes high performance analytics to gain rapid insights from big data and the ability to solve complex problems using more data.
  • 52. Department of CSE- Data Science Terminologies used in Big data Environments In-Memory Analytics  Data access from non-volatile storage such as hard disk is a slow process. The more the data is required to be fetched from hard disk or secondary storage, the slower the process gets. One way to combat this challenge is to pre-process and store data (cubes, aggregate tables, query sets, etc.) so that the CPU has to fetch a small subset of records. But this requires thinking in advance as to what data will be required for analysis.  If there is a need for different or more data, it is back to the initial process of pre- computing and storing data or fetching it from secondary storage. This problem has been addressed using in-memory analytics. Here all the relevant data is stored in Random Access Memory (RAM) or primary storage thus eliminating the need to access the data from hard disk. The advantage is faster access, rapid deployment, better insights, and minimal IT involvement.
  • 53. Department of CSE- Data Science In-Database Processing  In-database processing is also called as in-database analytics. It works by fusing data warehouses with analytical systems.  Typically the data from various enterprise On Line Transaction Processing (OLTP) systems after cleaning up (de-duplication, scrubbing, etc.) through the process of ETL is stored in the Enterprise Data Warehouse (EDW) or data marts.  The huge datasets are then exported to analytical programs for complex and extensive computations.  With in-database processing, the database program itself can run the computations eliminating the need for export and thereby saving on time. Leading database vendors are offering this feature to large businesses.
  • 54. Department of CSE- Data Science Symmetric Multiprocessor System (SMP) • In SMP there is a single common main memory that is shared by two or more identical processors. • The processors have full access to all I/O devices and are controlled by a single operating system instance. • SMP are tightly coupled multiprocessor systems. Each processor has its own high- speed memory, called cache memory and are connected using a system bus. Figure : Symmetric Multiprocessor System.
  • 55. Department of CSE- Data Science Massively Parallel Processing  Massive Parallel Processing (MPP) refers to the coordinated processing of programs by a number of processors working parallel.  The processors, each have their own operating systems and dedicated memory. They work on different parts of the same program.  The MPP processors communicate using some sort of messaging interface. The MPP systems are more difficult to program as the application must be divided in such a way that all the executing segments can communicate with each other.  MPP is different from Symmetrically Multiprocessing (SMP) in that SMP works with the processors sharing the same operating system and same memory. SMP is also referred to as tightly-coupled multiprocessing.
  • 56. Department of CSE- Data Science Difference Between Parallel and Distributed Systems Parallel Systems  A parallel database system is a tightly coupled system. The processors co-operate for query processing. Figure : Parallel system
  • 57. Department of CSE- Data Science  The user is unaware of the parallelism since he/she has no access to a specific processor of the system.  Either the processors have access to a common memory or make use of message passing for communication. Figure : Parallel system.
  • 58. Department of CSE- Data Science Distributed database systems  Distributed database systems are known to be loosely coupled and are composed by individual machines.  Each of the machines can run their individual application and serve their own respective user. The data is usually distributed across several machines, thereby necessitating quite a number of machines to be accessed to answer a user query. Figure : Distributed system.
  • 59. Department of CSE- Data Science Shared Nothing Architecture  There are three most common types of architecture for multiprocessor high transaction rate systems. 1. Shared Memory (SM) 2. Shared Disk (SD). 3. Shared Nothing (SN).  In shared memory architecture, a common central memory is shared by multiple processors.  In shared disk architecture, multiple processors share a common collection of disks while having their own private memory  In shared nothing architecture, neither memory nor disk is shared among multiple processors.
  • 60. Department of CSE- Data Science Advantages of a “Shared Nothing Architecture” 1. Fault Isolation: A “Shared Nothing Architecture” provides the benefit of isolating fault. A fault in single node is contained and confined to that node exclusively and exposed only through messages (or lack of it). 2. Scalability: Assume that the disk is a shared resource. It implies that the controller and the disk bandwidth are also shared. Synchronization will have to be implemented to maintain a consistent shred state. This would mean that different nodes will have to take turns to access the critical data. This imposes a limit on how many nodes can be added to the distributed shared disk system, thus compromising on scalability.
  • 61. Department of CSE- Data Science CAP Theorem Explained  The CAP theorem is also called the Brewer’s Theorem.  It states that in a distributed computing environment , it is impossible to provide the following guarantees. 1. Consistency 2. Availability 3. Partition tolerance  Consistency implies that every read fetches the last write.  Availability implies that reads and writes always succeed. Each non-failing node will return a response in a reasonable amount of time.  Partition tolerance implies that the system will continue to function when network partition occurs. Figure : Brewer's CAP
  • 62. Department of CSE- Data Science Examples of databases that follow one of the possible three combinations 1.Availability and Partition Tolerance (AP) 2.Consistency and Partition Tolerance (CP) 3.Consistency and Availability (CA) Figure : Databases and CAP
  • 63. Department of CSE- Data Science Classroom Activity Puzzle on CAP Theorem
  • 64. Department of CSE- Data Science Puzzle on architecture
  • 65. Department of CSE- Data Science Solutions Puzzle-1 Puzzle-2
  • 66. Department of CSE- Data Science NoSQL (NOT ONLY SQL)  The term NoSQL was first coined by Carlo Strozzi in 1998 to name his lightweight, open-source, relational database that did not expose the standard SQL interface.  Few features of NoSQL databases are as follows: . 1. They are open sources 2. They are nonrelational 3. They are distributed 4. They are schema less 5. They are cluster friendly 6. They are born out of 21st century web applications.
  • 67. Department of CSE- Data Science Where is it Used?  NoSQL databases are widely used in big data and other real-time web applications.  NoSQL databases is used to stock log data which can then be pulled for analysis.  It is used to store social media data and all such data which cannot be stored and analyzed comfortably in RDBMS. Figure : Where to use NoSQL?
  • 68. Department of CSE- Data Science What is it?  NoSQL stands for Not Only SQL.  These are non-relational, open source, distributed databases.  They are hugely popular today owing to their ability to scale out or scale horizontally and the adeptness at dealing with a rich variety of data: structured, semi-structured and unstructured data, Figure: What is NoSQL?
  • 69. Department of CSE- Data Science 1. Are non-relational: They do not adhere to relational data model, In fact, they are either key-value pairs or document-oriented or column-oriented or graph-based databases. 2. Are distributed: They are distributed meaning the data is distributed across several nodes in a cluster constituted of low-cost commodity hardware. 3. Offer no support for ACID properties (Atomicity, Consistency, Isolation, and Durability): They do not offer support for ACID properties of transactions. On the contrary, they have adherence to Brewer’s CAP (Consistency, Availability, and Partition tolerance) theorem and are often seen compromising on consistency in favor of availability and partition tolerance. 4. Provide no fixed table schema: NoSQL databases are becoming increasing popular owing to their support for flexibility to the schema. They do not mandate for the data to strictly adhere to any schema structure at the time of storage.
  • 70. Department of CSE- Data Science Types of NoSQL Databases 1. Key-value or the big hash table. 2. Schema-less Figure : Types of NoSQL databases
  • 71. Department of CSE- Data Science 1. Key-value  It maintains a big hash table of keys and values.  For example, Dynamo, Redis, Riak, etc. Sample Key-Value Pair in Key-Value Database 2. Document  It maintains data in collections constituted of documents.  For example, MongoDB, Apache CouchDB, Couchbase, MarkLogic, etc.
  • 72. Department of CSE- Data Science 3. Column  Each storage block has data from only one column.  For example: Cassandra, HBase, etc, 4. Graph:  They are also called network database. A graph stores data in nodes.  For example, Neodj, HyperGraphDB, etc.
  • 73. Department of CSE- Data Science
  • 74. Department of CSE- Data Science Why NoSQL? 1. It has scale out architecture instead of the monolithic architecture of relational databases. 2. It can house large volumes of structured, semi-structured, and unstructured data. 3. Dynamic schema: NoSQL database allows insertion of data without a pre-defined schema. In other words, it facilitates application changes in real time, which thus supports faster development, easy code integration, and requires less database administration. 4. Auto-sharding: It automatically spreads data across an arbitrary number of servers. The application in question is more often not even aware of the composition of the server pool. It balances the load of data and query on the available servers; and if and when a server goes down, it is quickly replaced without any major activity disruptions. 5. Replication: It offers good support for replication which in turn guarantees high availability, fault tolerance, and disaster recovery.
  • 75. Department of CSE- Data Science Advantages of NoSQL 1. Can easily scale up and down: NoSQL database supports scaling rapidly and elastically and even allows to scale to the cloud. a. Cluster scale: It allows distribution of database across 100+ nodes often in multiple data centers. b. Performance scale: It sustains over 100,000+ database reads and writes per second. c. Data scale: It supports housing of 1 billion+ documents in the database.
  • 76. Department of CSE- Data Science 2. Doesn't require a pre-defined schema: NoSQL does not require any adherence to pre- defined schema. It is pretty flexible. For example, if we look at MongoDB, the documents in a collection can have different sets of key-value pairs. 3. Cheap, easy to implement: Deploying NoSQL properly allows for all of the benefits of scale, high availability, fault tolerance, etc. while also lowering operational costs. 4. Relaxes the data consistency requirement: NoSQL databases have adherence to CAP theorem (Consistency, Availability, and Partition tolerance). Most of the NoSQL databases compromise on consistency in favor of availability and partition tolerance.
  • 77. Department of CSE- Data Science 5. Data can be replicated to multiple nodes and can be partitioned: There are two terms that we will discuss here: a) Sharding: Sharding is when different pieces of data are distributed across multiple servers. NoSQL databases support auto-sharding; this means that they can natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware of the composition of the server pool. Servers can be added or removed from the data layer without application downtime. This would mean that data and query load are automatically balanced across servers, and when a server goes down, it can be quickly and transparently replaced with no application disruption. b) Replication: Replication is when multiple copies of data are stored across the cluster and even across data centers. This promises high availability and fault tolerance.
  • 78. Department of CSE- Data Science What We Miss With NoSQL?  NoSQL does not support joins. However, it compensates for it by allowing embedded documents as MongoDB.  It does not have provision for ACID properties of transactions. However, it obeys the Brewer’s CAP theorem.  NoSQL does not have a standard SQL interface but NoSQL databases such MongoDB and Cassandra have their own rich query language to compensate for the lack of it.
  • 79. Department of CSE- Data Science Use of NoSQL in Industry  NoSQL is being put to use in varied industries. They are used to support analysis for applications such as web user data analysis, log analysis, sensor feed analysis, making recommendations for upsell and cross-sell etc.
  • 80. Department of CSE- Data Science NoSQL Vendors
  • 81. Department of CSE- Data Science SQL versus NoSQL
  • 82. Department of CSE- Data Science NewSQL  We need a database that has the same scalable performance of NoSQL systems for On Line Transaction Processing (OLTP) while still maintaining the ACID guarantees of a traditional database. This new modern RDBMS is called NewSQL.  It supports relational data model and uses SQL as their primary interface.  NewSQL is based on the shared nothing architecture with a SQL interface for application interaction.
  • 83. Department of CSE- Data Science Characteristics of NewSQL
  • 84. Department of CSE- Data Science Comparison of SQL, NoSQL, and NewSQL
  • 85. Department of CSE- Data Science HADOOP  Hadoop is an open-source project of the Apache foundation.  It is a framework written in Java, originally developed by Doug Cutting in 2005 who named it after his son's toy elephant. He was working with Yahoo then.  It was created to support distribution for “Nutch”, the text search engine. Hadoop uses Google’s MapReduce and Google File System technologies as its foundation.  Hadoop is now a core part of the computing infrastructure for companies such as Yahoo, Facebook, Linkedn, Twitter, etc.
  • 86. Department of CSE- Data Science Figure : Hadoop
  • 87. Department of CSE- Data Science Features of Hadoop 1. It is optimized to handle massive quantities of structured, semi-structured, and unstructured data, using commodity hardware, that is, relatively inexpensive computers. 2. Hadoop has a shared nothing architecture. 3. It replicates its data across multiple computers so that if one goes down, the data can still be processed from another machine that stores its replica. 4. Hadoop is for high throughput rather than low latency. It is a batch operation handling massive quantities of data; therefore the response time is not immediate. 5. It complements On-Line Transaction Processing (OLTP) and On-Line Analytical Processing (OLAP). However, it is not a replacement for a relational database management system. 6. It is NOT good when work cannot be parallelized or when there are dependencies within the data. 7. It is NOT good for processing small files. It works best with huge data files and datasets.
  • 88. Department of CSE- Data Science Key Advantages of Hadoop
  • 89. Department of CSE- Data Science 1. Stores data in its native format: Hadoop’s data storage framework (HDFS — Hadoop Distributed File System) can store data in its native format. There is no structure that is imposed while keying in data or storing data. HDFS is pretty much schema-less. It is only later when the data needs to be processed that structure is imposed on the raw data. 2. Scalable: Hadoop can store and distribute very large datasets (involving thousands of terabytes of data) across hundreds of inexpensive servers that operate in parallel. 3. Cost-effective: Owing to its scale-out architecture, Hadoop has a much reduced cost/terabyte of storage and processing.
  • 90. Department of CSE- Data Science 3. Resilient to failure: Hadoop is fault-tolerant. It practices replication of data diligently which means whenever data is sent to any node, the same data also gets replicated to other nodes in the cluster, thereby ensuring that in the event of a node failure, there will always be another copy of data available for use. 4. Flexibility: One of the key advantages of Hadoop is its ability to work with all kinds of data: structured, semi-structured, and unstructured data. It can help derive meaningful business insights from email conversations, social media data, click-stream data, etc. It can be put to several purposes such as log analysis, data mining, recommendation systems, market campaign analysis, etc. 5. Fast: Processing is extremely fast in Hadoop as compared to other conventional systems owing to the “move code to data” paradigm.
  • 91. Department of CSE- Data Science Versions of Hadoop There are two versions of Hadoop available: 1. Hadoop 1.0 2. Hadoop 2.0
  • 92. Department of CSE- Data Science Hadoop 1.0  It has two main parts: 1. Data storage framework: It is a general-purpose file system called Hadoop Distributed File System(HDFS). HDFS is schema-less. It simply stores data files. These data files can be in just about any format. The idea is to store files as close to their original form as possible. This is turn provides the business units and the organization the much needed flexibility and agility without being overly worried by what it can implement. 2. Data processing framework: This is a simple functional programming model initially popularized by Google as MapReduce. It essentially uses two functions: the MAP and the REDUCE functions to process data. The “Mappers” take in a set of key-value pairs and generate intermediate data (which is another list of key—value pairs). The “Reducers” then act on this input to produce the output data. The two functions seemingly work in isolation from one another, thus enabling the processing to be highly distributed in a highly-parallel, fault-tolerant, and scalable way.
  • 93. Department of CSE- Data Science Limitations of Hadoop 1.0 1. The first limitation was the requirement for MapReduce programming expertise along with proficiency required in other programming languages, notably Java. 2. It supported only batch processing which although is suitable for tasks such as log analysis, large-scale data mining projects but pretty much unsuitable for other kinds of projects. 3. One major limitation was that Hadoop 1.0 was tightly computationally coupled with MapReduce, which meant that the established data management vendors were left with two options: Either rewrite their functionality in MapReduce so that it could be executed in Hadoop or extract the data from HDFS and process it outside of Hadoop. None of the options were viable as it led to process inefficiencies caused by the data being moved in and out of the Hadoop cluster.
  • 94. Department of CSE- Data Science Hadoop 2.0  HDFS continues to be the data storage framework.  A new and separate resource management framework called Yet Another Resource Negotiator (YARN) has been added.  Any application capable of dividing itself into parallel tasks is supported by YARN.  YARN coordinates the allocation of subtasks of the submitted application, thereby further enhancing the flexibility, scalability, and efficiency of the applications.  It works by having an ApplicationMaster which is able to run any application and not just MapReduce.  it not only supports batch processing but also real-time processing.
  • 95. Department of CSE- Data Science Overview of Hadoop Ecosystems  There are components available in the Hadoop ecosystem for data ingestion, processing, and analysis. Data Ingestion → Data Processing → Data Analysis
  • 96. Department of CSE- Data Science  It is the distributed storage unit of Hadoop. It provides streaming access to file system data as well as file permissions and authentication.  It is based on GFS (Google File System).  It is used to scale a single cluster node to hundreds and thousands of nodes.  It handles large datasets running on commodity hardware.  HDFS is highly fault-tolerant. It stores files across multiple machines.  These files are stored in redundant fashion to allow for data recovery in case of failure. HDFS
  • 97. Department of CSE- Data Science  HBase stores data in HDFS.  It is the first non-batch component of the Hadoop Ecosystem.  It is a database on top of HDFS. It provides a quick random access to the stored data.  It has very low latency compared to HDFS.  It is a NoSQL database, is non-relational and is a column-oriented database.  A table can have thousands of columns.  A table can have multiple rows.  Each row can have several column families.  Each column family can have several columns.  Each column can have several key values. It is based on Google BigTable.  This is widely used by Facebook, Tiwitter, Yahoo, etc. HBase
  • 98. Department of CSE- Data Science Difference between HBase and Hadoop/HDFS 1. HDFS is the file system whereas HBase is a Hadoop database. It is like NTES and MySQL. 2. HDFS is WORM (Write once and read multiple times or many times). Latest versions support appending of data but this feature is rarely used. However, HBase supports real-time random read and write 3. HDFS is based on Google File System (GFS) whereas HBase is based on Google Big Table. 4. HDFS supports only full table scan or partition table scan. Hbase supports random small range scan or table scan, 5. Performance of Hive on HDFS is relatively very good but for HBase it becomes 4—5 times slower. 6. The access to data is via MapReduce job only in HDFS whereas in HBase the access is via Java APIs, Rest, Avro, Thrift APIs. 7. HDFS does not support dynamic storage owing to its rigid structure whereas HBase supports dynamic storage. 8. HDFS has high latency operations whereas HBase has low latency operations. 9. HDFS is most suitable for batch analytics whereas HBase is for real-time analytics.
  • 99. Department of CSE- Data Science  Hadoop Ecosystem Components for Data Ingestion 1. Sqoop: Sqoop stands for SQL to Hadoop. Its main functions are a. Importing data from RDBMS such as MySQL, Oracle, DB2, etc. to Hadoop file system (HDFS, HBase, Hive). b. Exporting data from Hadoop File system (HDFS, HBase, Hive) to RDBMS (MySQL, Oracle, DB2). Uses of Sqoop c. It has a connector-based architecture to allow plug-ins to connect to external systems such as MySQL, Oracle, DB2, etc. d. It can provision the data from external system on to HDFS and populate tables in Hive and HBase. e. It integrates with Oozie allowing you to schedule and automate import and export tasks. 2. Flume: Flume is an important log aggregator (aggregates logs from different machines and places them in HDFS) component in the Hadoop ecosystem. Flume has been developed by Cloudera. It s designed for high volume ingestion of event-based data into Hadoop. The default destination in Flume (called as sink in flume parlance) is HDFS. However it can also write to HBase or Solr.
  • 100. Department of CSE- Data Science 1. MapReduce:  It is a programming paradigm that allows distributed and parallel processing of huge datasets.  It is based on Google MapReduce.  Google released a paper on MapReduce programming paradigm in 2004 and that became the genesis of Hadoop processing model.  The MapReduce framework gets the input data from HDFS. Hadoop Ecosystem Components for Data Processing
  • 101. Department of CSE- Data Science  There are two main phases: Map phase and the Reduce phase.  The map phase converts the input data into another set of data (key—value pairs).  This new intermediate dataset then serves as the input to the reduce phase.  The reduce phase acts on the datasets to combine (aggregate and consolidate) and reduce them to a smaller set of tuples.  The result is then stored back in HDFS.
  • 102. Department of CSE- Data Science 2. Spark:  It is both a programming model as well as a computing model.  It is an open-source big data processing framework. It was originally developed in 2009 at UC Berkeley's AmpLab and became an open-source project in 2010.  It is written in Scala. It provides in-memory computing for Hadoop.  In Spark, workloads execute in memory rather than on disk owing to which it is much faster (10 to100 times) than when the workload is executed on disk.  If the datasets are too large to fit into, the available system memory, it can perform conventional disk-based processing.  It serves as a potentially faster and more flexible alternative to MapReduce.  It accesses data from HDFS (Spark does not have its own distributed file system) but bypasses the MapReduce processing.
  • 103. Department of CSE- Data Science  Spark can be used with Hadoop coexisting smoothly with MapReduce (sitting on top of Hadoop YARN) or used independently of Hadoop (standalone).  As a programming model, it works well with Scala, Python (it has API connectors for using it with Java or Python) or R programming language.  The following are the Spark libraries: a.Spark SQL: Spark also has support for SQL. Spark SQL uses SQL to help query data stored in disparate applications. b.Spark streaming: It helps to analyze and present data in real time c.MLib: It supports machine learning such as applying advanced statistical operations on data in Spark Cluster. d.GraphX: It helps in graph parallel computation.
  • 104. Department of CSE- Data Science  Spark and Hadoop are usually used together by several companies.  Hadoop was primarily designed to house unstructured data and run batch processing operations on it.  Spark is used extensively for its high speed in memory computing and ability to run advanced real-time analytics.  The two together have been giving very good results.
  • 105. Department of CSE- Data Science Hadoop Ecosystem Components for Data Analysis 1. Pig: It is a high-level scripting language used with Hadoop. It serves as an alternative to MapReduce. It has two parts: a. Pig Latin: It is SQL-like scripting language. Pig Latin scripts are translated into MapReduce jobs which can then run on YARN and process data in the HDFS cluster. There is a “Load” command available to load the data from “HDFS” into Pig. Then one can perform functions such as grouping, filtering, sorting, joining etc. The processed or computed data can then be cither displayed on screen or placed back into HDFS. It gives you a platform for building data flow for ETL (Extract, Transform and Load), processing and analyzing huge data sets. b. Pig runtime: It is the runtime environment.
  • 106. Department of CSE- Data Science 2. Hive: Hive is a data warehouse software project built on top of Hadoop. Three main tasks performed by Hive are summarization, querying and analysis. It supports queries written in a language called HQL or HiveQL which is a declarative SQL-like language. It converts the SQL-style queries into MapReduce jobs which are then executed on the Hadoop platform.
  • 107. Department of CSE- Data Science Difference between Hive and RDBMS 1. Hive enforces schema on Read Time whereas RDBMS enforces schema on Write Time.  In RDBMS, at the time of loading/inserting data, the table’s schema is enforced. If the data being loaded does not conform to the schema then it is rejected. Thus, the schema is enforced on write (loading the data into the database). Schema on write takes longer to load the data into the database; however it makes up for it during data retrieval with a good query time performance.  Hive does not enforce the schema when the data is being loaded into the D/W. It is enforced only when the data is being read/retrieved. This is called schema on read. It definitely makes for fast initial load as the data load or insertion operation is just a file copy or move.
  • 108. Department of CSE- Data Science 2. Hive is based on the notion of write once and read many times whereas the RDBMS is designed for read and write many times. 3. Hadoop is a batch-oriented system. Hive, therefore, is not suitable for OLTP (Online Transaction Processing) but, although not ideal, seems closer to OLAP (Online Analytical Processing). The reason being that there is quite a latency between issuing a query and receiving a reply as the query written in HiveQL will be converted to MapReduce jobs which are then executed on the Hadoop cluster. RDBMS is suitable for housing day-to-day transaction data and supports all OLTP operations with frequent insertions, modifications (updates), deletions of the data. 4. Hive handles static data analysis which is non-real-time data. Hive is the data warehouse of Hadoop. There are no frequent updates to the data and the query response time is not fast. RDBMS is suited for handling dynamic data which is real time.
  • 109. Department of CSE- Data Science 5. Hive can be easily scaled at a very low cost when compared to RDMS. Hive uses HDFS to store data thus it cannot be considered as the owner of the data, while on the other hand RDBMS is the own of the data responsible for storing, managing and manipulating it in the database. 6. Hive uses the concept of parallel computing, whereas RDBMS uses serial computing.
  • 110. Department of CSE- Data Science
  • 111. Department of CSE- Data Science
  • 112. Department of CSE- Data Science Difference between Hive and HBase 1. Hive is a MapReduce-based SQL engine that runs on top of Hadoop. HBase is a key— value NoSQL database that runs on top of HDFS. 2. Hive is for batch processing of big data. HBase is for real-time data streaming, Impala  It is a high performance SQL engine that runs on Hadoop cluster. It is ideal for interactive analysis. It has very low latency measured in milliseconds. It supports a dialect of SQL called Impala SQL. ZooKeeper  It is a coordination service for distributed applications. Oozie  It is workflow scheduler system to manage Apache Hadoop jobs.
  • 113. Department of CSE- Data Science Mahout  It is a scalable machine learning and data mining library. Chukwa  It is a data collection system for managing large distributed systems. Ambari  It is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters.
  • 114. Department of CSE- Data Science Hadoop Distributions  Hadoop is an open-source Apache project. Anyone can freely download the core aspects of Hadoop.  core aspects of Hadoop include the following: 1.Hadoop Common 2.Hadoop Distributed File System (HDFS) 3.Hadoop YARN (Yet Another Resource Negotiator) 4. Hadoop MapReduce
  • 115. Department of CSE- Data Science Hadoop versus SQL
  • 116. Department of CSE- Data Science Integrated Hadoop Systems Offered by Leading Market Vendors
  • 117. Department of CSE- Data Science Cloud-Based Hadoop Solutions  Amazon Web Services holds out a comprehensive, end-to-end portfolio of cloud computing services to help manage big data. The aim is to achieve this and more along with retaining the emphasis on reducing costs, scaling to meet demand, and accelerating the speed of innovation.  The Google Cloud Storage connector for Hadoop empowers one to perform MapReduce jobs directly on data in Google Cloud Storage, without the need to copy it to local disk and running it in the Hadoop Distributed File System (HDFS). The connector simplifies Hadoop deployment, and at the same time reduces cost and provides performance comparable to HDFS, all this while increasing reliability by eliminating the single point of failure of the name node.