Big data Analytics(BAD601) -module-1 ppt

Department of CSE- Data Science
Module-1
Introduction to Big Data, Big Data Analytics

Contents
 Classification of data
 Characteristics
 Evolution and definition of Big data
 What is Big data
 Why Big data
 Traditional Business Intelligence Vs Big Data
 Typical data warehouse and Hadoop environment
 Big Data Analytics: What is Big data Analytics
 Classification of Analytics
 Importance of Big Data Analytics
 Technologies used in Big data Environments
 Few Top Analytical Tools , NoSQL, Hadoop.

Introduction
 Data is present internal to the enterprise and also exists outside the four walls and
firewalls of the enterprise.
 Data is present in homogeneous sources as well as in heterogeneous sources.
Data → Information
Information → Insights

Classification of Digital data

Structured data
 Data which is in an organized form(e.g, rows and columns) and can be
easily used by a computer program.
 Relationships exist between entities of data, such as classes and their
objects.
 Data stored in databases is an example of structured data.

Semi-structured data
 Data which does not conform to a data model but has some structure.
 It is not in a form which can be used easily by a computer program.
 For example, emails, XML, markup languages like HTML etc.,

Unstructured data
 Data which does not conform to a data model or is not in a form which can be
used easily by a computer program.
 About 80%-90% data of an organization is in this format
 For example, memos, chat rooms, powerpoint presentations, images, videos,
letters etc,.

Structured Data
 Most of the structured data is held in RDBMS.
 An RDBMS conforms to the relational data model wherein the data is stored in
rows/columns.
 The number of rows/records/tuples in a relation is called the cardinality of a
relation and the number of columns s referred to as the degree of a relation.
 The first step is the design of a relation/table, the fields/columns to store the data,
the type of data that will 5e stored [number (integer or real), alphabets, date,
Boolean, etc.].

 Next we think of the constraints that we would like our data to conform to
(constraints such as UNIQUE values in the column, NOT NULL values in the
column, a business constraint such as the value held in the column should not
drop below 50, the set of permissible values in the column such as the column
should accept only “CS”, “IS”, “MS”, etc., as input).
 Example: Let us design a table/relation structure to store the details of the
employees of an enterprise.

 The tables in an RDBMS can also be related. For example, the above “Employee”
table is related to the “Department” table on the basis of the common column,
“DeptNo”.
Fig: Relationship between “Employee” and “Department” tables

Sources of Structured Data
 Oracle, IBM — DB2, Microsoft — Microsoft SQL Server, EMC — Greenplum,
Teradata — Teradata, MySQL (open source), PostgreSQL (advanced open source),
etc.] are used to hold transaction/operational data generated and collected by
day-to-day business activities.
 The data of the On-Line Transaction Processing (OLTP) systems are generally quite
structured.

Ease of Working with Structured Data
1. Insert/update/delete: The Data
Manipulation Language (DML) operations
provide the required ease with data
input, storage, access, process, analysis,
etc.
2. Security: There are available staunch
encryption and tokenization solutions to
warrant the security of information
throughout its lifecycle. Organizations are
able to retain control and maintain
compliance adherence by ensuring that
only authorized individuals are able to
decrypt and view sensitive information.

3. Indexing: An index is a data structure that speeds up the data retrieval operations (primarily
the SELECT DML statement) at the cost of additional writes and storage space, but the
benefits that ensue in search operation are worth the additional writes and storage space.
4. Scalability: The storage and processing capabilities of the traditional RDBMS can be easily
scaled up by increasing the horsepower of the database server (increasing the primary and
secondary or peripheral storage capacity, processing capacity of the processor, etc,).
5. Transaction processing: RDBMS has support for Atomicity, Consistency, Isolation, and
Durability (ACID) properties of transaction. Given next is a quick explanation of the ACID
properties:
 Atomicity: A transaction is atomic, means that either it happens in its entirety or none of it
at all.
 Consistency: The database moves from one consistent state to another consistent state. In
other words, if the same piece of information is stored at two or more places, they are in
complete agreement.
 Isolation: The resource allocation to the transaction happens such that the transaction gets
the impression that it is the only transaction happening in isolation.
 Durability: All changes made to the database during a transaction are permanent and that
accounts for the durability of the transaction.

Semi-structured Data
 Semi-structured data is also referred to as self-describing structure.
 Features
1. It does not conform to the data models that one typically associates with relational
databases or any other form of data tables.
2. It uses tags to segregate semantic elements.
3. Tags are also used to enforce hierarchies of records and fields within data.
4. There is no separation between the data and the schema. The amount of structure used is
dictated by the purpose at hand.
5. In semi-structured data, entities belonging to the same class and also grouped together
need not necessarily have the same act of attributes. And if at all, they have the same set
of attributes, the order of attributes may not be similar and for all practical purposes it is
not important as well.

Characteristics of semi-structured data

Sources of Semi-Structured Data

Unstructured Data
Sources of Unstructured Data

Issues with Unstructured Data

Dealing with Unstructured Data

Properties Structured data Semi-structured data Unstructured data
Technology It is based on Relational
database table
It is based on
XML/RDF(Resource
Description Framework).
It is based on character and
binary data
Transaction management
Matured transaction and
various concurrency
techniques
Transaction is adapted from
DBMS not matured
No transaction management
and no concurrency
Version management
Versioning over
tuples,row,tables
Versioning over tuples or
graph is possible Versioned as a whole
Flexibility
It is schema dependent and
less flexible
It is more flexible than
structured data but less
flexible than unstructured
data
It is more flexible and there
is absence of schema
Scalability It is very difficult to scale DB
schema
It’s scaling is simpler than
structured data It is more scalable.
Robustness Very robust
New technology, not very
spread
—
Query performance
Structured query allow
complex joining
Queries over anonymous
nodes are possible
Only textual queries are
possible

Characteristics of Data
Data has three characteristics:
1. Composition: deals with structure of data, that is, the sources of data , the
granularity, the types, and the nature of the data as to whether it is static or
real-time streaming.
2. Condition: The condition of data deals with the state of the data that is “can one
use this data as is for analysis?” or “Does it require cleansing for further
enhancement and enrichment?”
3. Context: deals with “Where has this data been generated?”, “Why was this data
generated?” and so on.

EVOLUTION OF BIG DATA

Definition of Big Data

Challenges With Big Data

1. Data today is growing at an exponential rate. Most of the data that we have
today has been generated in the last 2-3 years. This high tide of data will
continue to rise incessantly. The key questions here are: “Will all this dara be
useful for analysis?”, “Do we work with all this data or a subset of it?”, “How will
we separate the knowledge from the noise?”, etc. Cloud computing and
virtualization are here to stay.
2. Cloud computing is the answer to managing infrastructure for big data as far as
cost-efficiency, elasticity, and easy upgrading/downgrading is concerned. This
further complicates the decision to host big data solutions outside the enterprise.
3. The other challenge is to decide on the period of retention of big data. Just how
long should one retain this data? A tricky question indeed as some data is useful
for making long-term decisions, whereas in few cases, the data may quickly
become irrelevant and obsolete just a few hours after having being generated.

4. There is a dearth of skilled professionals who possess a high level of proficiency in
data sciences that is vital in implementing big data solutions.
5. Then, of course, there are other challenges with respect to capture, storage,
preparation, search, analysis, transfer, security, and visualization of big data. Big data
refers to datasets whose size is typically beyond the storage capacity of traditional
database software tools. There is no explicit definition of how big the dataset should
be for it to be considered “big data.” Here we are to deal with data that is just too big,
moves way to fast, and does not fit the structures of typical database systems. The
data changes are highly dynamic and therefore there is a need o ingest this as quickly
as possible.
6. Data visualization is becoming popular as a separate discipline. We are short by quite
a number, as far as business visualization experts are concerned.

WHAT IS BIG DATA?
 Big data is data that is big in volume, velocity, and variety.
Volume
1. Typical internal sources:
• Data Storage- File systems, SQL, NoSQL (MongoDB, Cassandra).
• Archives – Archives of scanned documents, paper archives, customer
records, patient health records etc,.
2. External data sources:
• public web - Wikipedia, weather, regulatory, census etc.

3. Both (internal+external)
• Sensor data – Car sensors, smart electric meters, office buildings etc,.
• Machine log data – Event logs, application logs, Business process logs, audit
logs etc.
• Social media – Twitter, blogs, Facebook, LinkedIn, Youtube, Instagram etc,.
• Business apps – ERP,CRM, HR, Google Docs, and so on.
• Media – Audio, Video, Image, Podcast, etc.
• Docs – CSV, Word Documents, PDF,XLS, PPT and so on.

A Mountain of Data

Sources of Big Data

Velocity
Variety
 Variety deals with a wide range of data types and sources of data.
1. Structured data: From traditional transaction processing systems and RDBMS, etc.
2. Semi-structured data: For example Hyper Text Markup Language (HTML),
eXtensible Markup Language (XML).
3. Unstructured data: For example unstructured text documents, audios, videos,
emails, photos, PDFs, social media, etc.
Batch  Periodic  Near real time  Real-time processing

Why Big Data?

Traditional Business Intelligence (Bi) Versus Big Data
1. In traditional BI environment, all the enterprise’s data is housed in a central
server whereas in a big data environment data resides in a distributed file
system. The distributed file system scales by scaling in or out horizontally as
compared to typical database server that scales vertically.
2. In traditional BI, data is generally analyzed in an offline mode whereas in big
data, it is analyzed in both real time as well as in offline mode.
3. Traditional BI is about structured data and it is here that data is taken to
processing functions whereas big data is about variety and here the
processing functions are taken to the data.

A Typical Data Warehouse Environment

A Typical Hadoop Environment

WHAT IS BIG DATA ANALYTICS?
1. Technology-enabled analytics: Quite a few data analytics and visualization tools
are available in the market today from leading vendors such as IBM, Tableau,
SAS, R Analytics, Statistical, World Programming Systems (WPS), etc. to help
process and analyze your big data.
2. About gaining a meaningful, deeper, and richer insight into your business to
steer in the right direction, understanding the customer’s demographics to
cross-sell and up-sell to them, better leveraging the services of your vendors
and suppliers, etc.
Author’s experience: The other day I was pleasantly surprised to get a few
recommendations via email from one of my frequently visited online
retailers, They had recommended clothing line from my favorite brand and
also the color suggested was one to my liking. How did they arrive at this? In
the recent past, I had been buying clothing line of a particular brand and the
color preference was pastel shades. They had it stored in their database and
pulled it out while making recommendations to me.

3. About a competitive edge over your competitors by enabling you with findings that allow
quicker and better decision-making.
4. A tight handshake between three communities: IT, business users, and data scientists.
5. Working with datasets whose volume and variety exceed the current storage and
processing capabilities and infrastructure of your enterprise.
6. About moving code to data. This makes perfect sense as the program for distributed
processing is tiny (just a few KBs) compared to the data (Terabytes or Petabytes today and
likely to be Exabytes or Zettabytes in the near future).

Classification Of Analytics
 These are basically two schools of thought:
1.Those that classify analytics into basic, operationalized, advanced, and
monetized.
2.Those that classify analytics into analytics 1.0, analytics 2.0, and analytics 3.0.
First School of Thought
3. Basic analytics: This primarily is slicing and dicing of data to help with basic
business insights. This is about reporting on historical data, basic visualization, etc.
4. Operationalized analytics: It is operationalized analytics if it gets woven into the
enterprise’s business processes.
5. Advanced analytics: This largely is about forecasting for the future by way of
predictive and prescriptive modeling.
6. Monetized analytics: This is analytics in use to derive direct business revenue.

Second School of Thought
• Let us take a closer look at analytics 1.0, analytics 2.0, and analytics 3.0.
Table : Analytics 1.0, 2.0, and 3.0

Figure : Analytics 1.0, 2.0, and 3.0.

Importance of Big Data Analytics
Let us study the various approaches to analysis of data and what it leads to.
1. Reactive — Business Intelligence: What does Business Intelligence (BI) help us
with? It allows the businesses to make faster and better decisions by providing
the right information to the right person at the right time in the right format. It is
about analysis of the past or historical data and then displaying the findings of the
analysis or reports in the form of enterprise dashboards, alerts, notifications, etc.
It has support for both pre-specified reports as well as ad hoc querying.
2. Reactive — Big Data Analytics: Here the analysis is done on huge datasets but the
approach is still reactive as it is still based on static data.

3. Proactive — Analytics: This is to support futuristic decision making by the use of
data mining, predictive modeling, text mining, and statistical analysis. This
analysis is not on big data as it still us the traditional database management
practices on big data and therefore has severe limitations on the storage capacity
and the processing capability.
4. Proactive - Big Data Analytics: This is sieving through terabytes, petabytes,
exabytes of information to filter out the relevant data to analyze. This also
includes high performance analytics to gain rapid insights from big data and the
ability to solve complex problems using more data.

Terminologies used in Big data Environments
In-Memory Analytics
 Data access from non-volatile storage such as hard disk is a slow process. The
more the data is required to be fetched from hard disk or secondary storage, the
slower the process gets. One way to combat this challenge is to pre-process and
store data (cubes, aggregate tables, query sets, etc.) so that the CPU has to fetch
a small subset of records. But this requires thinking in advance as to what data
will be required for analysis.
 If there is a need for different or more data, it is back to the initial process of
pre-computing and storing data or fetching it from secondary storage. This
problem has been addressed using in-memory analytics. Here all the relevant
data is stored in Random Access Memory (RAM) or primary storage thus
eliminating the need to access the data from hard disk. The advantage is faster
access, rapid deployment, better insights, and minimal IT involvement.

In-Database Processing
 In-database processing is also called as in-database analytics. It works by fusing
data warehouses with analytical systems.
 Typically the data from various enterprise On Line Transaction Processing (OLTP)
systems after cleaning up (de-duplication, scrubbing, etc.) through the process of
ETL is stored in the Enterprise Data Warehouse (EDW) or data marts.
 The huge datasets are then exported to analytical programs for complex and
extensive computations.
 With in-database processing, the database program itself can run the
computations eliminating the need for export and thereby saving on time.
Leading database vendors are offering this feature to large businesses.

Symmetric Multiprocessor System (SMP)
• In SMP there is a single common main memory that is shared by two or more
identical processors.
• The processors have full access to all I/O devices and are controlled by a single
operating system instance.
• SMP are tightly coupled multiprocessor systems. Each processor has its own high-
speed memory, called cache memory and are connected using a system bus.
Figure : Symmetric Multiprocessor
System.

Massively Parallel Processing
 Massive Parallel Processing (MPP) refers to the coordinated processing of
programs by a number of processors working parallel.
 The processors, each have their own operating systems and dedicated memory.
They work on different parts of the same program.
 The MPP processors communicate using some sort of messaging interface. The
MPP systems are more difficult to program as the application must be divided in
such a way that all the executing segments can communicate with each other.
 MPP is different from Symmetrically Multiprocessing (SMP) in that SMP works
with the processors sharing the same operating system and same memory. SMP is
also referred to as tightly-coupled multiprocessing.

Difference Between Parallel and Distributed Systems
Parallel Systems
 A parallel database system is a tightly coupled system. The processors co-operate
for query processing.
Figure : Parallel
system

 The user is unaware of the parallelism since he/she has no access to a specific
processor of the system.
 Either the processors have access to a common memory or make use of message
passing for communication.
Figure : Parallel system.

Distributed database systems
 Distributed database systems are known to be loosely coupled and are
composed by individual machines.
 Each of the machines can run their individual application and serve their own
respective user. The data is usually distributed across several machines,
thereby necessitating quite a number of machines to be accessed to answer a
user query.
Figure : Distributed system.

Shared Nothing Architecture
 Let us look at the three most common types of architecture for multiprocessor
high transaction rate systems.
 They are:
1. Shared Memory (SM)
2. Shared Disk (SD).
3. Shared Nothing (SN).
 In shared memory architecture, a common central memory is shared by multiple
processors.
 In shared disk architecture, multiple processors share a common collection of
disks while having their own private memory
 In shared nothing architecture, neither memory nor disk is shared among
multiple processors.

Advantages of a “Shared Nothing Architecture”
1. Fault Isolation: A “Shared Nothing Architecture” provides the benefit of isolating
fault. A fault in single node is contained and confined to that node exclusively and
exposed only through messages (or lack of it).
2. Scalability: Assume that the disk is a shared resource. It implies that the controller
and the disk bandwidth are also shared. Synchronization will have to be
implemented to maintain a consistent shred state. This would mean that different
nodes will have to take turns to access the critical data. This imposes a limit on
how many nodes can be added to the distributed shared disk system, thus
compromising on scalability.

CAP Theorem Explained
 The CAP theorem is also called the Brewer’s Theorem. It states that in a
distributed computing environment a collection of interconnected nodes that
share data), it is impossible to provide the following guarantees.
 At best you can have two of the following three — one must be sacrificed.
1. Consistency
2. Availability
3. Partition tolerance
Figure : Brewer's
CAP.

 Consistency implies that every read fetches the last write.
 Availability implies that reads and writes always succeed. In other words, each
non-failing node will return a response in a reasonable amount of time.
 Partition tolerance implies that the system will continue to function when network
partition occurs.

NoSQL (NOT ONLY SQL)
 The term NoSQL was first coined by Carlo Strozzi in 1998 to name his
lightweight, open-source, relational database that did not expose the
standard SQL interface.
 Few features of NoSQL databases are as follows: .
1. They are open sources
2. They are nonrelational
3. They are distributed
4. They are schema less
5. They are cluster friendly
6. They are born out of 21st
century web applications.

Where is it Used?
 NoSQL databases are widely used in big data and other real-time web
applications.
 NoSQL. databases is used to stock log data which can then be pulled for analysis.
 It is used to store social media data and all such data which cannot be stored and
analyzed comfortably in RDBMS.
Figure : Where to use NoSQL?

What is it?
 NoSQL stands for Not Only SQL. These are non-relational, open source, distributed
databases. They are hugely popular today owing to their ability to scale out or
scale horizontally and the adeptness at dealing with a rich variety of data:
structured, semi-structured and unstructured data,
Figure: What is NoSQL?

1. Are non-relational: They do not adhere to relational data model, In fact, they are
either key-value pairs or document-oriented or column-oriented or graph-based
databases.
2. Are distributed: They are distributed meaning the data is distributed across
several nodes in a cluster constituted of low-cost commodity hardware.
3. Offer no support for ACID properties (Atomicity, Consistency, Isolation, and
Durability): They do not offer support for ACID properties of transactions. On the
contrary, they have adherence to Brewer’s CAP (Consistency, Availability, and
Partition tolerance) theorem and are often seen compromising on consistency in
favor of availability and partition tolerance.
4. Provide no fixed table schema: NoSQL databases are becoming increasing
popular owing to their support for flexibility to the schema. They do not mandate
for the daa to strictly adhere to any schema structure at the time of storage.

Types of NoSQL Databases
1.Key-value
2.Schema-less.
Key-value
 It maintains a big hash table of keys and values.
 For example, Dynamo, Redis, Riak, etc. Sample Key-Value Pair in Key-Value
Database

Figure : Types of NoSQL databases

Why NoSQL?

Advantages of NoSQL

Use of NoSQL in Industry

HADOOP
 Hadoop is an open-source project of the Apache foundation.
 It is a framework written in Java, originally developed by Doug Cutting in 2005
who named it after his son's toy elephant. He was working with Yahoo then.
 It was created to support distribution for “Nutch”, the text search engine.
Hadoop uses Google’s MapReduce and Google File System technologies as its
foundation.
 Hadoop is now a core part of the computing infrastructure for companies such as
Yahoo, Facebook, Linkedn, Twitter, etc.

Figure : Hadoop

Features of Hadoop

Key Advantages of Hadoop

Versions of Hadoop
There are two versions of Hadoop available:
1. Hadoop 1.0
2. Hadoop 2.0

Overview of Hadoop Ecosystems
There are components available in the Hadoop ecosystem for data ingestion, processing, and
analysis.
Data Ingestion → Data Processing → Data Analysis

Hadoop Distributions
 The core aspects of Hadoop include the
following:
1. Hadoop Common
2. Hadoop Distributed File System (HDFS)
3. Hadoop YARN (Yet Another Resource
Negotiator)
4. Hadoop MapReduce

Big data Analytics(BAD601) -module-1 ppt

More Related Content

What's hot (20)

Similar to Big data Analytics(BAD601) -module-1 ppt (20)

More from AmbikaVenkatesh4 (17)

Recently uploaded (20)

Big data Analytics(BAD601) -module-1 ppt