Department of CSE- Data Science
Module-1
Introduction to Big Data, Big Data Analytics
Department of CSE- Data Science
Contents
 Classification of data
 Characteristics
 Evolution and definition of Big data
 What is Big data
 Why Big data
 Traditional Business Intelligence Vs Big Data
 Typical data warehouse and Hadoop environment
 Big Data Analytics: What is Big data Analytics
 Classification of Analytics
 Importance of Big Data Analytics
 Technologies used in Big data Environments
 Few Top Analytical Tools , NoSQL, Hadoop.
Department of CSE- Data Science
Introduction
 Data is present internal to the enterprise and also exists outside the four walls and
firewalls of the enterprise.
 Data is present in homogeneous sources as well as in heterogeneous sources.
Data → Information
Information → Insights
Department of CSE- Data Science
Classification of Digital data
Department of CSE- Data Science
Structured data
 Data which is in an organized form(e.g, rows and columns) and can be
easily used by a computer program.
 Relationships exist between entities of data, such as classes and their
objects.
 Data stored in databases is an example of structured data.
Department of CSE- Data Science
Semi-structured data
 Data which does not conform to a data model but has some structure.
 It is not in a form which can be used easily by a computer program.
 For example, emails, XML, markup languages like HTML etc.,
Department of CSE- Data Science
Unstructured data
 Data which does not conform to a data model or is not in a form which can be
used easily by a computer program.
 About 80%-90% data of an organization is in this format
 For example, memos, chat rooms, powerpoint presentations, images, videos,
letters etc,.
Department of CSE- Data Science
Structured Data
 Most of the structured data is held in RDBMS.
 An RDBMS conforms to the relational data model wherein the data is stored in
rows/columns.
 The number of rows/records/tuples in a relation is called the cardinality of a
relation and the number of columns s referred to as the degree of a relation.
 The first step is the design of a relation/table, the fields/columns to store the data,
the type of data that will 5e stored [number (integer or real), alphabets, date,
Boolean, etc.].
Department of CSE- Data Science
 Next we think of the constraints that we would like our data to conform to
(constraints such as UNIQUE values in the column, NOT NULL values in the
column, a business constraint such as the value held in the column should not
drop below 50, the set of permissible values in the column such as the column
should accept only “CS”, “IS”, “MS”, etc., as input).
 Example: Let us design a table/relation structure to store the details of the
employees of an enterprise.
Department of CSE- Data Science
Department of CSE- Data Science
 The tables in an RDBMS can also be related. For example, the above “Employee”
table is related to the “Department” table on the basis of the common column,
“DeptNo”.
Fig: Relationship between “Employee” and “Department” tables
Department of CSE- Data Science
Sources of Structured Data
 Oracle, IBM — DB2, Microsoft — Microsoft SQL Server, EMC — Greenplum,
Teradata — Teradata, MySQL (open source), PostgreSQL (advanced open source),
etc.] are used to hold transaction/operational data generated and collected by
day-to-day business activities.
 The data of the On-Line Transaction Processing (OLTP) systems are generally quite
structured.
Department of CSE- Data Science
Ease of Working with Structured Data
1. Insert/update/delete: The Data
Manipulation Language (DML) operations
provide the required ease with data
input, storage, access, process, analysis,
etc.
2. Security: There are available staunch
encryption and tokenization solutions to
warrant the security of information
throughout its lifecycle. Organizations are
able to retain control and maintain
compliance adherence by ensuring that
only authorized individuals are able to
decrypt and view sensitive information.
Department of CSE- Data Science
3. Indexing: An index is a data structure that speeds up the data retrieval operations (primarily
the SELECT DML statement) at the cost of additional writes and storage space, but the
benefits that ensue in search operation are worth the additional writes and storage space.
4. Scalability: The storage and processing capabilities of the traditional RDBMS can be easily
scaled up by increasing the horsepower of the database server (increasing the primary and
secondary or peripheral storage capacity, processing capacity of the processor, etc,).
5. Transaction processing: RDBMS has support for Atomicity, Consistency, Isolation, and
Durability (ACID) properties of transaction. Given next is a quick explanation of the ACID
properties:
 Atomicity: A transaction is atomic, means that either it happens in its entirety or none of it
at all.
 Consistency: The database moves from one consistent state to another consistent state. In
other words, if the same piece of information is stored at two or more places, they are in
complete agreement.
 Isolation: The resource allocation to the transaction happens such that the transaction gets
the impression that it is the only transaction happening in isolation.
 Durability: All changes made to the database during a transaction are permanent and that
accounts for the durability of the transaction.
Department of CSE- Data Science
Semi-structured Data
 Semi-structured data is also referred to as self-describing structure.
 Features
1. It does not conform to the data models that one typically associates with relational
databases or any other form of data tables.
2. It uses tags to segregate semantic elements.
3. Tags are also used to enforce hierarchies of records and fields within data.
4. There is no separation between the data and the schema. The amount of structure used is
dictated by the purpose at hand.
5. In semi-structured data, entities belonging to the same class and also grouped together
need not necessarily have the same act of attributes. And if at all, they have the same set
of attributes, the order of attributes may not be similar and for all practical purposes it is
not important as well.
Department of CSE- Data Science
Characteristics of semi-structured data
Department of CSE- Data Science
Sources of Semi-Structured Data
Department of CSE- Data Science
Unstructured Data
Sources of Unstructured Data
Department of CSE- Data Science
Issues with Unstructured Data
Department of CSE- Data Science
Dealing with Unstructured Data
Department of CSE- Data Science
Properties Structured data Semi-structured data Unstructured data
Technology It is based on Relational
database table
It is based on
XML/RDF(Resource
Description Framework).
It is based on character and
binary data
Transaction management
Matured transaction and
various concurrency
techniques
Transaction is adapted from
DBMS not matured
No transaction management
and no concurrency
Version management
Versioning over
tuples,row,tables
Versioning over tuples or
graph is possible Versioned as a whole
Flexibility
It is schema dependent and
less flexible
It is more flexible than
structured data but less
flexible than unstructured
data
It is more flexible and there
is absence of schema
Scalability It is very difficult to scale DB
schema
It’s scaling is simpler than
structured data It is more scalable.
Robustness Very robust
New technology, not very
spread
—
Query performance
Structured query allow
complex joining
Queries over anonymous
nodes are possible
Only textual queries are
possible
Department of CSE- Data Science
Characteristics of Data
Data has three characteristics:
1. Composition: deals with structure of data, that is, the sources of data , the
granularity, the types, and the nature of the data as to whether it is static or
real-time streaming.
2. Condition: The condition of data deals with the state of the data that is “can one
use this data as is for analysis?” or “Does it require cleansing for further
enhancement and enrichment?”
3. Context: deals with “Where has this data been generated?”, “Why was this data
generated?” and so on.
Department of CSE- Data Science
EVOLUTION OF BIG DATA
Department of CSE- Data Science
Definition of Big Data
Department of CSE- Data Science
Challenges With Big Data
Department of CSE- Data Science
1. Data today is growing at an exponential rate. Most of the data that we have
today has been generated in the last 2-3 years. This high tide of data will
continue to rise incessantly. The key questions here are: “Will all this dara be
useful for analysis?”, “Do we work with all this data or a subset of it?”, “How will
we separate the knowledge from the noise?”, etc. Cloud computing and
virtualization are here to stay.
2. Cloud computing is the answer to managing infrastructure for big data as far as
cost-efficiency, elasticity, and easy upgrading/downgrading is concerned. This
further complicates the decision to host big data solutions outside the enterprise.
3. The other challenge is to decide on the period of retention of big data. Just how
long should one retain this data? A tricky question indeed as some data is useful
for making long-term decisions, whereas in few cases, the data may quickly
become irrelevant and obsolete just a few hours after having being generated.
Department of CSE- Data Science
4. There is a dearth of skilled professionals who possess a high level of proficiency in
data sciences that is vital in implementing big data solutions.
5. Then, of course, there are other challenges with respect to capture, storage,
preparation, search, analysis, transfer, security, and visualization of big data. Big data
refers to datasets whose size is typically beyond the storage capacity of traditional
database software tools. There is no explicit definition of how big the dataset should
be for it to be considered “big data.” Here we are to deal with data that is just too big,
moves way to fast, and does not fit the structures of typical database systems. The
data changes are highly dynamic and therefore there is a need o ingest this as quickly
as possible.
6. Data visualization is becoming popular as a separate discipline. We are short by quite
a number, as far as business visualization experts are concerned.
Department of CSE- Data Science
WHAT IS BIG DATA?
 Big data is data that is big in volume, velocity, and variety.
Volume
1. Typical internal sources:
• Data Storage- File systems, SQL, NoSQL (MongoDB, Cassandra).
• Archives – Archives of scanned documents, paper archives, customer
records, patient health records etc,.
2. External data sources:
• public web - Wikipedia, weather, regulatory, census etc.
Department of CSE- Data Science
3. Both (internal+external)
• Sensor data – Car sensors, smart electric meters, office buildings etc,.
• Machine log data – Event logs, application logs, Business process logs, audit
logs etc.
• Social media – Twitter, blogs, Facebook, LinkedIn, Youtube, Instagram etc,.
• Business apps – ERP,CRM, HR, Google Docs, and so on.
• Media – Audio, Video, Image, Podcast, etc.
• Docs – CSV, Word Documents, PDF,XLS, PPT and so on.
Department of CSE- Data Science
A Mountain of Data
Department of CSE- Data Science
Sources of Big Data
Department of CSE- Data Science
Velocity
Variety
 Variety deals with a wide range of data types and sources of data.
1. Structured data: From traditional transaction processing systems and RDBMS, etc.
2. Semi-structured data: For example Hyper Text Markup Language (HTML),
eXtensible Markup Language (XML).
3. Unstructured data: For example unstructured text documents, audios, videos,
emails, photos, PDFs, social media, etc.
Batch  Periodic  Near real time  Real-time processing
Department of CSE- Data Science
Why Big Data?
Department of CSE- Data Science
Traditional Business Intelligence (Bi) Versus Big Data
1. In traditional BI environment, all the enterprise’s data is housed in a central
server whereas in a big data environment data resides in a distributed file
system. The distributed file system scales by scaling in or out horizontally as
compared to typical database server that scales vertically.
2. In traditional BI, data is generally analyzed in an offline mode whereas in big
data, it is analyzed in both real time as well as in offline mode.
3. Traditional BI is about structured data and it is here that data is taken to
processing functions whereas big data is about variety and here the
processing functions are taken to the data.
Department of CSE- Data Science
A Typical Data Warehouse Environment
Department of CSE- Data Science
A Typical Hadoop Environment
Department of CSE- Data Science
WHAT IS BIG DATA ANALYTICS?
1. Technology-enabled analytics: Quite a few data analytics and visualization tools
are available in the market today from leading vendors such as IBM, Tableau,
SAS, R Analytics, Statistical, World Programming Systems (WPS), etc. to help
process and analyze your big data.
2. About gaining a meaningful, deeper, and richer insight into your business to
steer in the right direction, understanding the customer’s demographics to
cross-sell and up-sell to them, better leveraging the services of your vendors
and suppliers, etc.
Author’s experience: The other day I was pleasantly surprised to get a few
recommendations via email from one of my frequently visited online
retailers, They had recommended clothing line from my favorite brand and
also the color suggested was one to my liking. How did they arrive at this? In
the recent past, I had been buying clothing line of a particular brand and the
color preference was pastel shades. They had it stored in their database and
pulled it out while making recommendations to me.
Department of CSE- Data Science
3. About a competitive edge over your competitors by enabling you with findings that allow
quicker and better decision-making.
4. A tight handshake between three communities: IT, business users, and data scientists.
5. Working with datasets whose volume and variety exceed the current storage and
processing capabilities and infrastructure of your enterprise.
6. About moving code to data. This makes perfect sense as the program for distributed
processing is tiny (just a few KBs) compared to the data (Terabytes or Petabytes today and
likely to be Exabytes or Zettabytes in the near future).
Department of CSE- Data Science
Classification Of Analytics
 These are basically two schools of thought:
1.Those that classify analytics into basic, operationalized, advanced, and
monetized.
2.Those that classify analytics into analytics 1.0, analytics 2.0, and analytics 3.0.
First School of Thought
3. Basic analytics: This primarily is slicing and dicing of data to help with basic
business insights. This is about reporting on historical data, basic visualization, etc.
4. Operationalized analytics: It is operationalized analytics if it gets woven into the
enterprise’s business processes.
5. Advanced analytics: This largely is about forecasting for the future by way of
predictive and prescriptive modeling.
6. Monetized analytics: This is analytics in use to derive direct business revenue.
Department of CSE- Data Science
Second School of Thought
• Let us take a closer look at analytics 1.0, analytics 2.0, and analytics 3.0.
Table : Analytics 1.0, 2.0, and 3.0
Department of CSE- Data Science
Department of CSE- Data Science
Figure : Analytics 1.0, 2.0, and 3.0.
Department of CSE- Data Science
Importance of Big Data Analytics
Let us study the various approaches to analysis of data and what it leads to.
1. Reactive — Business Intelligence: What does Business Intelligence (BI) help us
with? It allows the businesses to make faster and better decisions by providing
the right information to the right person at the right time in the right format. It is
about analysis of the past or historical data and then displaying the findings of the
analysis or reports in the form of enterprise dashboards, alerts, notifications, etc.
It has support for both pre-specified reports as well as ad hoc querying.
2. Reactive — Big Data Analytics: Here the analysis is done on huge datasets but the
approach is still reactive as it is still based on static data.
Department of CSE- Data Science
3. Proactive — Analytics: This is to support futuristic decision making by the use of
data mining, predictive modeling, text mining, and statistical analysis. This
analysis is not on big data as it still us the traditional database management
practices on big data and therefore has severe limitations on the storage capacity
and the processing capability.
4. Proactive - Big Data Analytics: This is sieving through terabytes, petabytes,
exabytes of information to filter out the relevant data to analyze. This also
includes high performance analytics to gain rapid insights from big data and the
ability to solve complex problems using more data.
Department of CSE- Data Science
Terminologies used in Big data Environments
In-Memory Analytics
 Data access from non-volatile storage such as hard disk is a slow process. The
more the data is required to be fetched from hard disk or secondary storage, the
slower the process gets. One way to combat this challenge is to pre-process and
store data (cubes, aggregate tables, query sets, etc.) so that the CPU has to fetch
a small subset of records. But this requires thinking in advance as to what data
will be required for analysis.
 If there is a need for different or more data, it is back to the initial process of
pre-computing and storing data or fetching it from secondary storage. This
problem has been addressed using in-memory analytics. Here all the relevant
data is stored in Random Access Memory (RAM) or primary storage thus
eliminating the need to access the data from hard disk. The advantage is faster
access, rapid deployment, better insights, and minimal IT involvement.
Department of CSE- Data Science
In-Database Processing
 In-database processing is also called as in-database analytics. It works by fusing
data warehouses with analytical systems.
 Typically the data from various enterprise On Line Transaction Processing (OLTP)
systems after cleaning up (de-duplication, scrubbing, etc.) through the process of
ETL is stored in the Enterprise Data Warehouse (EDW) or data marts.
 The huge datasets are then exported to analytical programs for complex and
extensive computations.
 With in-database processing, the database program itself can run the
computations eliminating the need for export and thereby saving on time.
Leading database vendors are offering this feature to large businesses.
Department of CSE- Data Science
Symmetric Multiprocessor System (SMP)
• In SMP there is a single common main memory that is shared by two or more
identical processors.
• The processors have full access to all I/O devices and are controlled by a single
operating system instance.
• SMP are tightly coupled multiprocessor systems. Each processor has its own high-
speed memory, called cache memory and are connected using a system bus.
Figure : Symmetric Multiprocessor
System.
Department of CSE- Data Science
Massively Parallel Processing
 Massive Parallel Processing (MPP) refers to the coordinated processing of
programs by a number of processors working parallel.
 The processors, each have their own operating systems and dedicated memory.
They work on different parts of the same program.
 The MPP processors communicate using some sort of messaging interface. The
MPP systems are more difficult to program as the application must be divided in
such a way that all the executing segments can communicate with each other.
 MPP is different from Symmetrically Multiprocessing (SMP) in that SMP works
with the processors sharing the same operating system and same memory. SMP is
also referred to as tightly-coupled multiprocessing.
Department of CSE- Data Science
Difference Between Parallel and Distributed Systems
Parallel Systems
 A parallel database system is a tightly coupled system. The processors co-operate
for query processing.
Figure : Parallel
system
Department of CSE- Data Science
 The user is unaware of the parallelism since he/she has no access to a specific
processor of the system.
 Either the processors have access to a common memory or make use of message
passing for communication.
Figure : Parallel system.
Department of CSE- Data Science
Distributed database systems
 Distributed database systems are known to be loosely coupled and are
composed by individual machines.
 Each of the machines can run their individual application and serve their own
respective user. The data is usually distributed across several machines,
thereby necessitating quite a number of machines to be accessed to answer a
user query.
Figure : Distributed system.
Department of CSE- Data Science
Shared Nothing Architecture
 Let us look at the three most common types of architecture for multiprocessor
high transaction rate systems.
 They are:
1. Shared Memory (SM)
2. Shared Disk (SD).
3. Shared Nothing (SN).
 In shared memory architecture, a common central memory is shared by multiple
processors.
 In shared disk architecture, multiple processors share a common collection of
disks while having their own private memory
 In shared nothing architecture, neither memory nor disk is shared among
multiple processors.
Department of CSE- Data Science
Advantages of a “Shared Nothing Architecture”
1. Fault Isolation: A “Shared Nothing Architecture” provides the benefit of isolating
fault. A fault in single node is contained and confined to that node exclusively and
exposed only through messages (or lack of it).
2. Scalability: Assume that the disk is a shared resource. It implies that the controller
and the disk bandwidth are also shared. Synchronization will have to be
implemented to maintain a consistent shred state. This would mean that different
nodes will have to take turns to access the critical data. This imposes a limit on
how many nodes can be added to the distributed shared disk system, thus
compromising on scalability.
Department of CSE- Data Science
CAP Theorem Explained
 The CAP theorem is also called the Brewer’s Theorem. It states that in a
distributed computing environment a collection of interconnected nodes that
share data), it is impossible to provide the following guarantees.
 At best you can have two of the following three — one must be sacrificed.
1. Consistency
2. Availability
3. Partition tolerance
Figure : Brewer's
CAP.
Department of CSE- Data Science
 Consistency implies that every read fetches the last write.
 Availability implies that reads and writes always succeed. In other words, each
non-failing node will return a response in a reasonable amount of time.
 Partition tolerance implies that the system will continue to function when network
partition occurs.
Department of CSE- Data Science
NoSQL (NOT ONLY SQL)
 The term NoSQL was first coined by Carlo Strozzi in 1998 to name his
lightweight, open-source, relational database that did not expose the
standard SQL interface.
 Few features of NoSQL databases are as follows: .
1. They are open sources
2. They are nonrelational
3. They are distributed
4. They are schema less
5. They are cluster friendly
6. They are born out of 21st
century web applications.
Department of CSE- Data Science
Where is it Used?
 NoSQL databases are widely used in big data and other real-time web
applications.
 NoSQL. databases is used to stock log data which can then be pulled for analysis.
 It is used to store social media data and all such data which cannot be stored and
analyzed comfortably in RDBMS.
Figure : Where to use NoSQL?
Department of CSE- Data Science
What is it?
 NoSQL stands for Not Only SQL. These are non-relational, open source, distributed
databases. They are hugely popular today owing to their ability to scale out or
scale horizontally and the adeptness at dealing with a rich variety of data:
structured, semi-structured and unstructured data,
Figure: What is NoSQL?
Department of CSE- Data Science
1. Are non-relational: They do not adhere to relational data model, In fact, they are
either key-value pairs or document-oriented or column-oriented or graph-based
databases.
2. Are distributed: They are distributed meaning the data is distributed across
several nodes in a cluster constituted of low-cost commodity hardware.
3. Offer no support for ACID properties (Atomicity, Consistency, Isolation, and
Durability): They do not offer support for ACID properties of transactions. On the
contrary, they have adherence to Brewer’s CAP (Consistency, Availability, and
Partition tolerance) theorem and are often seen compromising on consistency in
favor of availability and partition tolerance.
4. Provide no fixed table schema: NoSQL databases are becoming increasing
popular owing to their support for flexibility to the schema. They do not mandate
for the daa to strictly adhere to any schema structure at the time of storage.
Department of CSE- Data Science
Types of NoSQL Databases
1.Key-value
2.Schema-less.
Key-value
 It maintains a big hash table of keys and values.
 For example, Dynamo, Redis, Riak, etc. Sample Key-Value Pair in Key-Value
Database
Department of CSE- Data Science
Figure : Types of NoSQL databases
Department of CSE- Data Science
Department of CSE- Data Science
Why NoSQL?
Department of CSE- Data Science
Advantages of NoSQL
Department of CSE- Data Science
Department of CSE- Data Science
Department of CSE- Data Science
Use of NoSQL in Industry
Department of CSE- Data Science
HADOOP
 Hadoop is an open-source project of the Apache foundation.
 It is a framework written in Java, originally developed by Doug Cutting in 2005
who named it after his son's toy elephant. He was working with Yahoo then.
 It was created to support distribution for “Nutch”, the text search engine.
Hadoop uses Google’s MapReduce and Google File System technologies as its
foundation.
 Hadoop is now a core part of the computing infrastructure for companies such as
Yahoo, Facebook, Linkedn, Twitter, etc.
Department of CSE- Data Science
Figure : Hadoop
Department of CSE- Data Science
Features of Hadoop
Department of CSE- Data Science
Key Advantages of Hadoop
Department of CSE- Data Science
Versions of Hadoop
There are two versions of Hadoop available:
1. Hadoop 1.0
2. Hadoop 2.0
Department of CSE- Data Science
Overview of Hadoop Ecosystems
There are components available in the Hadoop ecosystem for data ingestion, processing, and
analysis.
Data Ingestion → Data Processing → Data Analysis
Department of CSE- Data Science
Hadoop Distributions
 The core aspects of Hadoop include the
following:
1. Hadoop Common
2. Hadoop Distributed File System (HDFS)
3. Hadoop YARN (Yet Another Resource
Negotiator)
4. Hadoop MapReduce

More Related Content

PPTX
Cloud Infrastructure.pptx
PPTX
null.pptx
PPTX
Introduction to NOSQL databases
PPTX
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
PPTX
Odbms concepts
DOCX
Unit II -BIG DATA ANALYTICS.docx
PDF
Big data unit i
PPTX
Challenges of Conventional Systems.pptx
Cloud Infrastructure.pptx
null.pptx
Introduction to NOSQL databases
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
Odbms concepts
Unit II -BIG DATA ANALYTICS.docx
Big data unit i
Challenges of Conventional Systems.pptx

What's hot (20)

PPT
Introduction data mining
PPTX
01_PENGANTAR DATA DATA SCIENCE.pptx
PPTX
Big Data Analytics Module-4 as per vtu .pptx
PPT
Data Warehouse Modeling
PDF
UNIT 1 -BIG DATA ANALYTICS Full.pdf
PDF
Big Data: Its Characteristics And Architecture Capabilities
PPTX
Data mining concepts and work
PPTX
Big Data
PPTX
Schemas for multidimensional databases
PPT
Lecture 10 distributed database management system
PPT
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
PPTX
NoSQL Architecture Overview
PDF
Big data.
DOCX
Star ,Snow and Fact-Constullation Schemas??
PPT
1.2 steps and functionalities
PPTX
Data warehouse
PPTX
Modern data warehouse presentation
PPTX
Non relational databases-no sql
PPTX
Introduction to Data Warehousing
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Introduction data mining
01_PENGANTAR DATA DATA SCIENCE.pptx
Big Data Analytics Module-4 as per vtu .pptx
Data Warehouse Modeling
UNIT 1 -BIG DATA ANALYTICS Full.pdf
Big Data: Its Characteristics And Architecture Capabilities
Data mining concepts and work
Big Data
Schemas for multidimensional databases
Lecture 10 distributed database management system
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
NoSQL Architecture Overview
Big data.
Star ,Snow and Fact-Constullation Schemas??
1.2 steps and functionalities
Data warehouse
Modern data warehouse presentation
Non relational databases-no sql
Introduction to Data Warehousing
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Ad

Similar to Big data Analytics(BAD601) -module-1 ppt (20)

PPTX
Big data analytics(BAD601) module-1 ppt
PDF
Database Systems - A Historical Perspective
PPTX
Introduction of Data Science and Data Analytics
PDF
Database revolution opening webcast 01 18-12
PDF
Database Revolution - Exploratory Webcast
PPTX
Ch1_Introduction to DATA SCIENCE_TYBSC(CS)_2024.pptx
PPTX
DATABASE MANAGEMENT SYSTEM-MRS. LAXMI B PANDYA FOR 25TH AUGUST,2022.pptx
PPTX
Introduction to asdfghjkln b vfgh n v
PDF
Databases for Data Science
PPTX
Introduction to Data Science NoSQL.pptx
PPTX
Big Data Analytics Module-3 as per vtu syllabus.pptx
PDF
Binder1.pdf
PPTX
NoSQL.pptx
PPTX
cours database pour etudiant NoSQL (1).pptx
PPTX
NoSQL databases
PDF
BIG DATA ANALYTICS EASSY notes for sybsc it student
PPTX
Data Never Lies Presentation for beginners in data field.pptx
PPTX
RDBMS to NoSQL. An overview.
PDF
NoSQL BIg Data Analytics Mongo DB and Cassandra .pdf
PPTX
Relational databases vs Non-relational databases
Big data analytics(BAD601) module-1 ppt
Database Systems - A Historical Perspective
Introduction of Data Science and Data Analytics
Database revolution opening webcast 01 18-12
Database Revolution - Exploratory Webcast
Ch1_Introduction to DATA SCIENCE_TYBSC(CS)_2024.pptx
DATABASE MANAGEMENT SYSTEM-MRS. LAXMI B PANDYA FOR 25TH AUGUST,2022.pptx
Introduction to asdfghjkln b vfgh n v
Databases for Data Science
Introduction to Data Science NoSQL.pptx
Big Data Analytics Module-3 as per vtu syllabus.pptx
Binder1.pdf
NoSQL.pptx
cours database pour etudiant NoSQL (1).pptx
NoSQL databases
BIG DATA ANALYTICS EASSY notes for sybsc it student
Data Never Lies Presentation for beginners in data field.pptx
RDBMS to NoSQL. An overview.
NoSQL BIg Data Analytics Mongo DB and Cassandra .pdf
Relational databases vs Non-relational databases
Ad

More from AmbikaVenkatesh4 (17)

PPTX
Module-3Key Management and Distribution.pptx
PPTX
Module-2Other Public-Key Cryptosystems.pptx
PPTX
Module-2 Public-Key Cryptography and RSA.pptx
PPTX
Block Ciphers and the data encryption standard.pptx
PPTX
moudule-1classical Encyption Techniques.pptx
PPTX
Business Intelligence Module 3_Datawarehousing.pptx
PPTX
big data analytics (BAD601) Module-5.pptx
PPTX
UHV Module-4 Exploring_Harmony_Assignment.pptx
PPTX
Universal Human Values (BUHK408)Module-4.pptx
PPTX
Aptitude Training Module-2_Data Suffciency.pptx
PPTX
Big Data Analytics (BAD601) Module-4.pptx
PPTX
UHV(BUHK408) Module-value education and self exploration
PPTX
Os Module 4_Virtual Memory Management.pptx
PPTX
Operating Systems Module 4_Memory Management.pptx
PPTX
Introduction to Big data analytics subject
PPTX
Network Lab simulation program ping.pptx
PPTX
NErwork Lab Simulation Introduction.pptx
Module-3Key Management and Distribution.pptx
Module-2Other Public-Key Cryptosystems.pptx
Module-2 Public-Key Cryptography and RSA.pptx
Block Ciphers and the data encryption standard.pptx
moudule-1classical Encyption Techniques.pptx
Business Intelligence Module 3_Datawarehousing.pptx
big data analytics (BAD601) Module-5.pptx
UHV Module-4 Exploring_Harmony_Assignment.pptx
Universal Human Values (BUHK408)Module-4.pptx
Aptitude Training Module-2_Data Suffciency.pptx
Big Data Analytics (BAD601) Module-4.pptx
UHV(BUHK408) Module-value education and self exploration
Os Module 4_Virtual Memory Management.pptx
Operating Systems Module 4_Memory Management.pptx
Introduction to Big data analytics subject
Network Lab simulation program ping.pptx
NErwork Lab Simulation Introduction.pptx

Recently uploaded (20)

PPTX
Approach to a child with acute kidney injury
PDF
POM_Unit1_Notes.pdf Introduction to Management #mba #bba #bcom #bballb #class...
PPTX
growth and developement.pptxweeeeerrgttyyy
PPTX
Power Point PR B.Inggris 12 Ed. 2019.pptx
PDF
Chevening Scholarship Application and Interview Preparation Guide
PPTX
Copy of ARAL Program Primer_071725(1).pptx
PDF
BSc-Zoology-02Sem-DrVijay-Comparative anatomy of vertebrates.pdf
PPTX
Theoretical for class.pptxgshdhddhdhdhgd
DOCX
THEORY AND PRACTICE ASSIGNMENT SEMESTER MAY 2025.docx
PDF
GIÁO ÁN TIẾNG ANH 7 GLOBAL SUCCESS (CẢ NĂM) THEO CÔNG VĂN 5512 (2 CỘT) NĂM HỌ...
PPTX
MMW-CHAPTER-1-final.pptx major Elementary Education
DOCX
EDUCATIONAL ASSESSMENT ASSIGNMENT SEMESTER MAY 2025.docx
PPT
hemostasis and its significance, physiology
PDF
Health aspects of bilberry: A review on its general benefits
PDF
CHALLENGES FACED BY TEACHERS WHEN TEACHING LEARNERS WITH DEVELOPMENTAL DISABI...
PPTX
Thinking Routines and Learning Engagements.pptx
PPTX
operating_systems_presentations_delhi_nc
PPTX
CHROMIUM & Glucose Tolerance Factor.pptx
PPTX
Math 2 Quarter 2 Week 1 Matatag Curriculum
PDF
Horaris_Grups_25-26_Definitiu_15_07_25.pdf
Approach to a child with acute kidney injury
POM_Unit1_Notes.pdf Introduction to Management #mba #bba #bcom #bballb #class...
growth and developement.pptxweeeeerrgttyyy
Power Point PR B.Inggris 12 Ed. 2019.pptx
Chevening Scholarship Application and Interview Preparation Guide
Copy of ARAL Program Primer_071725(1).pptx
BSc-Zoology-02Sem-DrVijay-Comparative anatomy of vertebrates.pdf
Theoretical for class.pptxgshdhddhdhdhgd
THEORY AND PRACTICE ASSIGNMENT SEMESTER MAY 2025.docx
GIÁO ÁN TIẾNG ANH 7 GLOBAL SUCCESS (CẢ NĂM) THEO CÔNG VĂN 5512 (2 CỘT) NĂM HỌ...
MMW-CHAPTER-1-final.pptx major Elementary Education
EDUCATIONAL ASSESSMENT ASSIGNMENT SEMESTER MAY 2025.docx
hemostasis and its significance, physiology
Health aspects of bilberry: A review on its general benefits
CHALLENGES FACED BY TEACHERS WHEN TEACHING LEARNERS WITH DEVELOPMENTAL DISABI...
Thinking Routines and Learning Engagements.pptx
operating_systems_presentations_delhi_nc
CHROMIUM & Glucose Tolerance Factor.pptx
Math 2 Quarter 2 Week 1 Matatag Curriculum
Horaris_Grups_25-26_Definitiu_15_07_25.pdf

Big data Analytics(BAD601) -module-1 ppt

  • 1. Department of CSE- Data Science Module-1 Introduction to Big Data, Big Data Analytics
  • 2. Department of CSE- Data Science Contents  Classification of data  Characteristics  Evolution and definition of Big data  What is Big data  Why Big data  Traditional Business Intelligence Vs Big Data  Typical data warehouse and Hadoop environment  Big Data Analytics: What is Big data Analytics  Classification of Analytics  Importance of Big Data Analytics  Technologies used in Big data Environments  Few Top Analytical Tools , NoSQL, Hadoop.
  • 3. Department of CSE- Data Science Introduction  Data is present internal to the enterprise and also exists outside the four walls and firewalls of the enterprise.  Data is present in homogeneous sources as well as in heterogeneous sources. Data → Information Information → Insights
  • 4. Department of CSE- Data Science Classification of Digital data
  • 5. Department of CSE- Data Science Structured data  Data which is in an organized form(e.g, rows and columns) and can be easily used by a computer program.  Relationships exist between entities of data, such as classes and their objects.  Data stored in databases is an example of structured data.
  • 6. Department of CSE- Data Science Semi-structured data  Data which does not conform to a data model but has some structure.  It is not in a form which can be used easily by a computer program.  For example, emails, XML, markup languages like HTML etc.,
  • 7. Department of CSE- Data Science Unstructured data  Data which does not conform to a data model or is not in a form which can be used easily by a computer program.  About 80%-90% data of an organization is in this format  For example, memos, chat rooms, powerpoint presentations, images, videos, letters etc,.
  • 8. Department of CSE- Data Science Structured Data  Most of the structured data is held in RDBMS.  An RDBMS conforms to the relational data model wherein the data is stored in rows/columns.  The number of rows/records/tuples in a relation is called the cardinality of a relation and the number of columns s referred to as the degree of a relation.  The first step is the design of a relation/table, the fields/columns to store the data, the type of data that will 5e stored [number (integer or real), alphabets, date, Boolean, etc.].
  • 9. Department of CSE- Data Science  Next we think of the constraints that we would like our data to conform to (constraints such as UNIQUE values in the column, NOT NULL values in the column, a business constraint such as the value held in the column should not drop below 50, the set of permissible values in the column such as the column should accept only “CS”, “IS”, “MS”, etc., as input).  Example: Let us design a table/relation structure to store the details of the employees of an enterprise.
  • 10. Department of CSE- Data Science
  • 11. Department of CSE- Data Science  The tables in an RDBMS can also be related. For example, the above “Employee” table is related to the “Department” table on the basis of the common column, “DeptNo”. Fig: Relationship between “Employee” and “Department” tables
  • 12. Department of CSE- Data Science Sources of Structured Data  Oracle, IBM — DB2, Microsoft — Microsoft SQL Server, EMC — Greenplum, Teradata — Teradata, MySQL (open source), PostgreSQL (advanced open source), etc.] are used to hold transaction/operational data generated and collected by day-to-day business activities.  The data of the On-Line Transaction Processing (OLTP) systems are generally quite structured.
  • 13. Department of CSE- Data Science Ease of Working with Structured Data 1. Insert/update/delete: The Data Manipulation Language (DML) operations provide the required ease with data input, storage, access, process, analysis, etc. 2. Security: There are available staunch encryption and tokenization solutions to warrant the security of information throughout its lifecycle. Organizations are able to retain control and maintain compliance adherence by ensuring that only authorized individuals are able to decrypt and view sensitive information.
  • 14. Department of CSE- Data Science 3. Indexing: An index is a data structure that speeds up the data retrieval operations (primarily the SELECT DML statement) at the cost of additional writes and storage space, but the benefits that ensue in search operation are worth the additional writes and storage space. 4. Scalability: The storage and processing capabilities of the traditional RDBMS can be easily scaled up by increasing the horsepower of the database server (increasing the primary and secondary or peripheral storage capacity, processing capacity of the processor, etc,). 5. Transaction processing: RDBMS has support for Atomicity, Consistency, Isolation, and Durability (ACID) properties of transaction. Given next is a quick explanation of the ACID properties:  Atomicity: A transaction is atomic, means that either it happens in its entirety or none of it at all.  Consistency: The database moves from one consistent state to another consistent state. In other words, if the same piece of information is stored at two or more places, they are in complete agreement.  Isolation: The resource allocation to the transaction happens such that the transaction gets the impression that it is the only transaction happening in isolation.  Durability: All changes made to the database during a transaction are permanent and that accounts for the durability of the transaction.
  • 15. Department of CSE- Data Science Semi-structured Data  Semi-structured data is also referred to as self-describing structure.  Features 1. It does not conform to the data models that one typically associates with relational databases or any other form of data tables. 2. It uses tags to segregate semantic elements. 3. Tags are also used to enforce hierarchies of records and fields within data. 4. There is no separation between the data and the schema. The amount of structure used is dictated by the purpose at hand. 5. In semi-structured data, entities belonging to the same class and also grouped together need not necessarily have the same act of attributes. And if at all, they have the same set of attributes, the order of attributes may not be similar and for all practical purposes it is not important as well.
  • 16. Department of CSE- Data Science Characteristics of semi-structured data
  • 17. Department of CSE- Data Science Sources of Semi-Structured Data
  • 18. Department of CSE- Data Science Unstructured Data Sources of Unstructured Data
  • 19. Department of CSE- Data Science Issues with Unstructured Data
  • 20. Department of CSE- Data Science Dealing with Unstructured Data
  • 21. Department of CSE- Data Science Properties Structured data Semi-structured data Unstructured data Technology It is based on Relational database table It is based on XML/RDF(Resource Description Framework). It is based on character and binary data Transaction management Matured transaction and various concurrency techniques Transaction is adapted from DBMS not matured No transaction management and no concurrency Version management Versioning over tuples,row,tables Versioning over tuples or graph is possible Versioned as a whole Flexibility It is schema dependent and less flexible It is more flexible than structured data but less flexible than unstructured data It is more flexible and there is absence of schema Scalability It is very difficult to scale DB schema It’s scaling is simpler than structured data It is more scalable. Robustness Very robust New technology, not very spread — Query performance Structured query allow complex joining Queries over anonymous nodes are possible Only textual queries are possible
  • 22. Department of CSE- Data Science Characteristics of Data Data has three characteristics: 1. Composition: deals with structure of data, that is, the sources of data , the granularity, the types, and the nature of the data as to whether it is static or real-time streaming. 2. Condition: The condition of data deals with the state of the data that is “can one use this data as is for analysis?” or “Does it require cleansing for further enhancement and enrichment?” 3. Context: deals with “Where has this data been generated?”, “Why was this data generated?” and so on.
  • 23. Department of CSE- Data Science EVOLUTION OF BIG DATA
  • 24. Department of CSE- Data Science Definition of Big Data
  • 25. Department of CSE- Data Science Challenges With Big Data
  • 26. Department of CSE- Data Science 1. Data today is growing at an exponential rate. Most of the data that we have today has been generated in the last 2-3 years. This high tide of data will continue to rise incessantly. The key questions here are: “Will all this dara be useful for analysis?”, “Do we work with all this data or a subset of it?”, “How will we separate the knowledge from the noise?”, etc. Cloud computing and virtualization are here to stay. 2. Cloud computing is the answer to managing infrastructure for big data as far as cost-efficiency, elasticity, and easy upgrading/downgrading is concerned. This further complicates the decision to host big data solutions outside the enterprise. 3. The other challenge is to decide on the period of retention of big data. Just how long should one retain this data? A tricky question indeed as some data is useful for making long-term decisions, whereas in few cases, the data may quickly become irrelevant and obsolete just a few hours after having being generated.
  • 27. Department of CSE- Data Science 4. There is a dearth of skilled professionals who possess a high level of proficiency in data sciences that is vital in implementing big data solutions. 5. Then, of course, there are other challenges with respect to capture, storage, preparation, search, analysis, transfer, security, and visualization of big data. Big data refers to datasets whose size is typically beyond the storage capacity of traditional database software tools. There is no explicit definition of how big the dataset should be for it to be considered “big data.” Here we are to deal with data that is just too big, moves way to fast, and does not fit the structures of typical database systems. The data changes are highly dynamic and therefore there is a need o ingest this as quickly as possible. 6. Data visualization is becoming popular as a separate discipline. We are short by quite a number, as far as business visualization experts are concerned.
  • 28. Department of CSE- Data Science WHAT IS BIG DATA?  Big data is data that is big in volume, velocity, and variety. Volume 1. Typical internal sources: • Data Storage- File systems, SQL, NoSQL (MongoDB, Cassandra). • Archives – Archives of scanned documents, paper archives, customer records, patient health records etc,. 2. External data sources: • public web - Wikipedia, weather, regulatory, census etc.
  • 29. Department of CSE- Data Science 3. Both (internal+external) • Sensor data – Car sensors, smart electric meters, office buildings etc,. • Machine log data – Event logs, application logs, Business process logs, audit logs etc. • Social media – Twitter, blogs, Facebook, LinkedIn, Youtube, Instagram etc,. • Business apps – ERP,CRM, HR, Google Docs, and so on. • Media – Audio, Video, Image, Podcast, etc. • Docs – CSV, Word Documents, PDF,XLS, PPT and so on.
  • 30. Department of CSE- Data Science A Mountain of Data
  • 31. Department of CSE- Data Science Sources of Big Data
  • 32. Department of CSE- Data Science Velocity Variety  Variety deals with a wide range of data types and sources of data. 1. Structured data: From traditional transaction processing systems and RDBMS, etc. 2. Semi-structured data: For example Hyper Text Markup Language (HTML), eXtensible Markup Language (XML). 3. Unstructured data: For example unstructured text documents, audios, videos, emails, photos, PDFs, social media, etc. Batch  Periodic  Near real time  Real-time processing
  • 33. Department of CSE- Data Science Why Big Data?
  • 34. Department of CSE- Data Science Traditional Business Intelligence (Bi) Versus Big Data 1. In traditional BI environment, all the enterprise’s data is housed in a central server whereas in a big data environment data resides in a distributed file system. The distributed file system scales by scaling in or out horizontally as compared to typical database server that scales vertically. 2. In traditional BI, data is generally analyzed in an offline mode whereas in big data, it is analyzed in both real time as well as in offline mode. 3. Traditional BI is about structured data and it is here that data is taken to processing functions whereas big data is about variety and here the processing functions are taken to the data.
  • 35. Department of CSE- Data Science A Typical Data Warehouse Environment
  • 36. Department of CSE- Data Science A Typical Hadoop Environment
  • 37. Department of CSE- Data Science WHAT IS BIG DATA ANALYTICS? 1. Technology-enabled analytics: Quite a few data analytics and visualization tools are available in the market today from leading vendors such as IBM, Tableau, SAS, R Analytics, Statistical, World Programming Systems (WPS), etc. to help process and analyze your big data. 2. About gaining a meaningful, deeper, and richer insight into your business to steer in the right direction, understanding the customer’s demographics to cross-sell and up-sell to them, better leveraging the services of your vendors and suppliers, etc. Author’s experience: The other day I was pleasantly surprised to get a few recommendations via email from one of my frequently visited online retailers, They had recommended clothing line from my favorite brand and also the color suggested was one to my liking. How did they arrive at this? In the recent past, I had been buying clothing line of a particular brand and the color preference was pastel shades. They had it stored in their database and pulled it out while making recommendations to me.
  • 38. Department of CSE- Data Science 3. About a competitive edge over your competitors by enabling you with findings that allow quicker and better decision-making. 4. A tight handshake between three communities: IT, business users, and data scientists. 5. Working with datasets whose volume and variety exceed the current storage and processing capabilities and infrastructure of your enterprise. 6. About moving code to data. This makes perfect sense as the program for distributed processing is tiny (just a few KBs) compared to the data (Terabytes or Petabytes today and likely to be Exabytes or Zettabytes in the near future).
  • 39. Department of CSE- Data Science Classification Of Analytics  These are basically two schools of thought: 1.Those that classify analytics into basic, operationalized, advanced, and monetized. 2.Those that classify analytics into analytics 1.0, analytics 2.0, and analytics 3.0. First School of Thought 3. Basic analytics: This primarily is slicing and dicing of data to help with basic business insights. This is about reporting on historical data, basic visualization, etc. 4. Operationalized analytics: It is operationalized analytics if it gets woven into the enterprise’s business processes. 5. Advanced analytics: This largely is about forecasting for the future by way of predictive and prescriptive modeling. 6. Monetized analytics: This is analytics in use to derive direct business revenue.
  • 40. Department of CSE- Data Science Second School of Thought • Let us take a closer look at analytics 1.0, analytics 2.0, and analytics 3.0. Table : Analytics 1.0, 2.0, and 3.0
  • 41. Department of CSE- Data Science
  • 42. Department of CSE- Data Science Figure : Analytics 1.0, 2.0, and 3.0.
  • 43. Department of CSE- Data Science Importance of Big Data Analytics Let us study the various approaches to analysis of data and what it leads to. 1. Reactive — Business Intelligence: What does Business Intelligence (BI) help us with? It allows the businesses to make faster and better decisions by providing the right information to the right person at the right time in the right format. It is about analysis of the past or historical data and then displaying the findings of the analysis or reports in the form of enterprise dashboards, alerts, notifications, etc. It has support for both pre-specified reports as well as ad hoc querying. 2. Reactive — Big Data Analytics: Here the analysis is done on huge datasets but the approach is still reactive as it is still based on static data.
  • 44. Department of CSE- Data Science 3. Proactive — Analytics: This is to support futuristic decision making by the use of data mining, predictive modeling, text mining, and statistical analysis. This analysis is not on big data as it still us the traditional database management practices on big data and therefore has severe limitations on the storage capacity and the processing capability. 4. Proactive - Big Data Analytics: This is sieving through terabytes, petabytes, exabytes of information to filter out the relevant data to analyze. This also includes high performance analytics to gain rapid insights from big data and the ability to solve complex problems using more data.
  • 45. Department of CSE- Data Science Terminologies used in Big data Environments In-Memory Analytics  Data access from non-volatile storage such as hard disk is a slow process. The more the data is required to be fetched from hard disk or secondary storage, the slower the process gets. One way to combat this challenge is to pre-process and store data (cubes, aggregate tables, query sets, etc.) so that the CPU has to fetch a small subset of records. But this requires thinking in advance as to what data will be required for analysis.  If there is a need for different or more data, it is back to the initial process of pre-computing and storing data or fetching it from secondary storage. This problem has been addressed using in-memory analytics. Here all the relevant data is stored in Random Access Memory (RAM) or primary storage thus eliminating the need to access the data from hard disk. The advantage is faster access, rapid deployment, better insights, and minimal IT involvement.
  • 46. Department of CSE- Data Science In-Database Processing  In-database processing is also called as in-database analytics. It works by fusing data warehouses with analytical systems.  Typically the data from various enterprise On Line Transaction Processing (OLTP) systems after cleaning up (de-duplication, scrubbing, etc.) through the process of ETL is stored in the Enterprise Data Warehouse (EDW) or data marts.  The huge datasets are then exported to analytical programs for complex and extensive computations.  With in-database processing, the database program itself can run the computations eliminating the need for export and thereby saving on time. Leading database vendors are offering this feature to large businesses.
  • 47. Department of CSE- Data Science Symmetric Multiprocessor System (SMP) • In SMP there is a single common main memory that is shared by two or more identical processors. • The processors have full access to all I/O devices and are controlled by a single operating system instance. • SMP are tightly coupled multiprocessor systems. Each processor has its own high- speed memory, called cache memory and are connected using a system bus. Figure : Symmetric Multiprocessor System.
  • 48. Department of CSE- Data Science Massively Parallel Processing  Massive Parallel Processing (MPP) refers to the coordinated processing of programs by a number of processors working parallel.  The processors, each have their own operating systems and dedicated memory. They work on different parts of the same program.  The MPP processors communicate using some sort of messaging interface. The MPP systems are more difficult to program as the application must be divided in such a way that all the executing segments can communicate with each other.  MPP is different from Symmetrically Multiprocessing (SMP) in that SMP works with the processors sharing the same operating system and same memory. SMP is also referred to as tightly-coupled multiprocessing.
  • 49. Department of CSE- Data Science Difference Between Parallel and Distributed Systems Parallel Systems  A parallel database system is a tightly coupled system. The processors co-operate for query processing. Figure : Parallel system
  • 50. Department of CSE- Data Science  The user is unaware of the parallelism since he/she has no access to a specific processor of the system.  Either the processors have access to a common memory or make use of message passing for communication. Figure : Parallel system.
  • 51. Department of CSE- Data Science Distributed database systems  Distributed database systems are known to be loosely coupled and are composed by individual machines.  Each of the machines can run their individual application and serve their own respective user. The data is usually distributed across several machines, thereby necessitating quite a number of machines to be accessed to answer a user query. Figure : Distributed system.
  • 52. Department of CSE- Data Science Shared Nothing Architecture  Let us look at the three most common types of architecture for multiprocessor high transaction rate systems.  They are: 1. Shared Memory (SM) 2. Shared Disk (SD). 3. Shared Nothing (SN).  In shared memory architecture, a common central memory is shared by multiple processors.  In shared disk architecture, multiple processors share a common collection of disks while having their own private memory  In shared nothing architecture, neither memory nor disk is shared among multiple processors.
  • 53. Department of CSE- Data Science Advantages of a “Shared Nothing Architecture” 1. Fault Isolation: A “Shared Nothing Architecture” provides the benefit of isolating fault. A fault in single node is contained and confined to that node exclusively and exposed only through messages (or lack of it). 2. Scalability: Assume that the disk is a shared resource. It implies that the controller and the disk bandwidth are also shared. Synchronization will have to be implemented to maintain a consistent shred state. This would mean that different nodes will have to take turns to access the critical data. This imposes a limit on how many nodes can be added to the distributed shared disk system, thus compromising on scalability.
  • 54. Department of CSE- Data Science CAP Theorem Explained  The CAP theorem is also called the Brewer’s Theorem. It states that in a distributed computing environment a collection of interconnected nodes that share data), it is impossible to provide the following guarantees.  At best you can have two of the following three — one must be sacrificed. 1. Consistency 2. Availability 3. Partition tolerance Figure : Brewer's CAP.
  • 55. Department of CSE- Data Science  Consistency implies that every read fetches the last write.  Availability implies that reads and writes always succeed. In other words, each non-failing node will return a response in a reasonable amount of time.  Partition tolerance implies that the system will continue to function when network partition occurs.
  • 56. Department of CSE- Data Science NoSQL (NOT ONLY SQL)  The term NoSQL was first coined by Carlo Strozzi in 1998 to name his lightweight, open-source, relational database that did not expose the standard SQL interface.  Few features of NoSQL databases are as follows: . 1. They are open sources 2. They are nonrelational 3. They are distributed 4. They are schema less 5. They are cluster friendly 6. They are born out of 21st century web applications.
  • 57. Department of CSE- Data Science Where is it Used?  NoSQL databases are widely used in big data and other real-time web applications.  NoSQL. databases is used to stock log data which can then be pulled for analysis.  It is used to store social media data and all such data which cannot be stored and analyzed comfortably in RDBMS. Figure : Where to use NoSQL?
  • 58. Department of CSE- Data Science What is it?  NoSQL stands for Not Only SQL. These are non-relational, open source, distributed databases. They are hugely popular today owing to their ability to scale out or scale horizontally and the adeptness at dealing with a rich variety of data: structured, semi-structured and unstructured data, Figure: What is NoSQL?
  • 59. Department of CSE- Data Science 1. Are non-relational: They do not adhere to relational data model, In fact, they are either key-value pairs or document-oriented or column-oriented or graph-based databases. 2. Are distributed: They are distributed meaning the data is distributed across several nodes in a cluster constituted of low-cost commodity hardware. 3. Offer no support for ACID properties (Atomicity, Consistency, Isolation, and Durability): They do not offer support for ACID properties of transactions. On the contrary, they have adherence to Brewer’s CAP (Consistency, Availability, and Partition tolerance) theorem and are often seen compromising on consistency in favor of availability and partition tolerance. 4. Provide no fixed table schema: NoSQL databases are becoming increasing popular owing to their support for flexibility to the schema. They do not mandate for the daa to strictly adhere to any schema structure at the time of storage.
  • 60. Department of CSE- Data Science Types of NoSQL Databases 1.Key-value 2.Schema-less. Key-value  It maintains a big hash table of keys and values.  For example, Dynamo, Redis, Riak, etc. Sample Key-Value Pair in Key-Value Database
  • 61. Department of CSE- Data Science Figure : Types of NoSQL databases
  • 62. Department of CSE- Data Science
  • 63. Department of CSE- Data Science Why NoSQL?
  • 64. Department of CSE- Data Science Advantages of NoSQL
  • 65. Department of CSE- Data Science
  • 66. Department of CSE- Data Science
  • 67. Department of CSE- Data Science Use of NoSQL in Industry
  • 68. Department of CSE- Data Science HADOOP  Hadoop is an open-source project of the Apache foundation.  It is a framework written in Java, originally developed by Doug Cutting in 2005 who named it after his son's toy elephant. He was working with Yahoo then.  It was created to support distribution for “Nutch”, the text search engine. Hadoop uses Google’s MapReduce and Google File System technologies as its foundation.  Hadoop is now a core part of the computing infrastructure for companies such as Yahoo, Facebook, Linkedn, Twitter, etc.
  • 69. Department of CSE- Data Science Figure : Hadoop
  • 70. Department of CSE- Data Science Features of Hadoop
  • 71. Department of CSE- Data Science Key Advantages of Hadoop
  • 72. Department of CSE- Data Science Versions of Hadoop There are two versions of Hadoop available: 1. Hadoop 1.0 2. Hadoop 2.0
  • 73. Department of CSE- Data Science Overview of Hadoop Ecosystems There are components available in the Hadoop ecosystem for data ingestion, processing, and analysis. Data Ingestion → Data Processing → Data Analysis
  • 74. Department of CSE- Data Science Hadoop Distributions  The core aspects of Hadoop include the following: 1. Hadoop Common 2. Hadoop Distributed File System (HDFS) 3. Hadoop YARN (Yet Another Resource Negotiator) 4. Hadoop MapReduce