big data
big data
';
-*
Big data is a collection of massive and complex data sets and data volume that include the huge
quantities of data, data management capabilities, social media analytics and real-time data. Big data
analytics is the process of examining large amounts of data. There exist large amounts of heterogeneous
digital data. Big data is about data volume and large data sets measured in terms of terabytes or peta bytes.
The sheer(complete) volume,variety,velocity and veracity of such data is termed as "Big Data". Big data is
structured,unstructured,semistructured or heterogeneous in nature.Traditional data management,warehousing
and analysis systems fizzle(bust) to analyze this type of data.Due to its complexity big data is stored in
distributed architecture file system. The process of capturing or collecting big data is known as
"datafication".Big is "data fied" so that it can be used productively.
Big Data
1.Consumer product companies and retail organizations are observing data on social media websites
such as face book and Twitter. These sites help them to analyze customer behavior, preferences and
perception.
2.Advertising and marketing agencies are tracking social media to understand responsiveness to
campaigns, promotions and other advertising mediums.
3.The government is making data public at the national, state, and city level for users to develop new
applications that can generate public goods. For example weather data.
4.Sports teams are using data for tracking ticket sales and even for tracking team strategies.
Types and Sources of Data:
Big Data is a pool of huge amounts of data of all types, shapes and formats collected from various
sources.
Machine data Refers to the information generated from RFID RFID chip
(radio frequency identification)chips, barcode readings,GPS(global
scanners and sensors. positioning systems) results
Transactional data Refers to the information generated from online Retail websites like EBay and
shopping sites, retailers and business to Amazon
business(B2B) transactions.
EVOLUTION OF BIG DATA: Big data is a new term of data evolution directed by enormous velocity, variety
and volume of data.
variety refers to the various forms of data such as structured, semi-structured and un structured data.
1.In early 60's technology witnessed problems with velocity or real time data assimilations. This need inspired
the evolution of databases
2. In the 90's technology witnessed the issues with variety(E-mails,Documents,Videos) of data leading to the
emergence of Non SQL stores.
3.Today technology is facing the issues relating to huge volumes of data leading to new storage and
processing solutions.
Structuring of data is arranging the available data in such amanner so that it becomes easy to study,analyze
and d erive the conclusion from it.Structuring of data can be done with the help of information processing
systems.These systems can analyze and structure large amount of data on the basis of what you
searched,what you looked at and presenting the customised information.
When a user regularly visits or purchases from online shopping sites ebay,amazon etc each time user logs in ,
the system present arecommended list of products on the basis of the earlier purchases or searches,thus
presenting a specially customized recommendatiob set for every user.This is the power of big data analytics.
Structuring data helps in understanding user behavior,requirements and preferences.various sources generate
a variety of data such as images,text,audios etc.All such types of data can be structured only if it is stored and
organized in some logical patterns.Thus the process of structuring dta requires to understand the various types
of dta available today.
Types of Data:
Concepts of Bigdata:
1.Data storage
2.Analysis
3.Distributed systems
4.Data Science
5.Artificial Intelligence
6.Data Mining
7.Parallel Processing
Big data is categorised into three types:
1.Structured data
2.Unstructured data
3.Semi-structured data
data
=
In real-world scenario unstructured data is larger in volume tha the structured and aemi structured data.
1.Structured Data:Data that can be stored,accessed and processed in the form of fixed format is termed
as"Structured data".
Structured data can be defined as the data that has defined repeating patterns which makes easier for any
program to sort,read and process the data.
Structured data:
3.Is the dta that consist of fixed fields within a record or file.
Example:
Unstructured Data:
3.Comprises inconsistent data such as data obtained from social media websites,satellites etc.
Examples:
Semi-structured Data:Semi-structured data contain both the forms of data i.e.,semi-structured data refers to
a form of structured data that contains tags or markup elements in order to separate elements and generate
hierarchies of records and fields in the given data.Such type of data does not follow the proper structure of
data models as in relational databases.i.e.,data is stored inconsistently in rows and columns of a database.
Example:
According to Gartner,data is growing at the rate of 59% every year.This growth can be depicted in terms of the
following fours V's:
1.Volume
2.Velocity
3.Variety
4.Veracity
Big data refers to the large amounts of information produced by different sources and processed by different
systems that measure their volume,velocity,variety and veracity.This information can have a numeric structure
that systems can process and store easily or have unstructured and dissimilar formats that requires flexible
framework.
Ex: A single message in social media account may contain audio files,images,video
files,text,numbers,hyperlinks and locations etc.
Volume: volume is the amount of data generated by organizations or individuals . The volume of data in most
organizations is approaching exabytes.A large volume of data demands better technology to
collect,process,store and analyze it. Organizations are doing best to handle this increasing volume of data .For
example, according to IBM,over 2.7 zetabytes of data is present in the digital universe today.Every minute over
571 new websites are being created.
The internet alone generates huge amount of data and has around 672 exabytes of accessible data . Internet
has around 14.3 trillion live webpages and 48 billion webpages indexed by google and 14 billion webpages
indexed by microsotf bing. The total data stored on the internet including images,videos and audio etc has
crossed 1 yottabyte.
Velocity: It describes the rate at which data is generated,captured and shared. Enterprises can capitalize on
data only if it is captured and shared in real time.Information processing systems such as CRM and ERP face
problems associated with data which keeps adding up but cannot be processed quickly.
Example: E-Bay analyzes around 5 million transactions per day in real time to detect and prevent frauds
arising from the use of paypal.
The sources of high velocity data includes the following:
2.Social media including Facebook posts ,tweets and other activities create huge amount of data which is to be
analyzed instantly at fast speed because the value degrades quickly with time.
3.Portable devices including mobiles,PDA etc also generate data at a high speed.
Variety:The data is generated from different types of sources such as internal,external and social and comes in
different formats such as images,text,videos etc.Every single source can generate data in varied formats.
ex:GPS and social networking sites such as facebook produces data of all types including text,images,videos
etc
Veracity:Veracity refers to the uncertainity of data i.e.,whether the obtained data is correct or consistent.Out of
huge amount of data that is generated in almost every process,only the data that is correct and consistent can
be used for further analysis. Data when processed becomes information. Big Data especially in unstructured
and semi structured form is messy in nature and requires huge amount of time and expertise to clean that data
and make it suitable for analysis.
Big Data Analytics specifies the ways to conduct business in many ways so that it improves decision making
and business process management etc.Business Analytics uses the data and different other techniques like IT,
Statistics and quantitative methods to provide the results. There are three main types of Business Analytics
1.Descriptive Analytics.
2.Predictive Analytics.
3.Prescriptive Analytics.
1.Descriptive Analytics: It is the most common form of analytics and serves as a base for advanced
analytics.It answers the question"What happened in the Business ?".It analyses a database to provide
information on the trends of past or current business events that helps the managers and leaders to develop
the future action.It performs an in -depth analysis of data to reveal details such as frequency of
events,operation costs and underlying reasons for failures.It helps in identifying the rootcauses of the problem.
2.Predictive Analytics: It is about understanding and predicting the future and answers the question"What
could happen?" by using statistical models and different forecast techniques.It predicts the future probabilities
and trends and helps in "What if analysis" . In predictive analysis we use statistics and machine learning to
analyze the future.
3.Prescriptive Analytics: It answers "What should we do?" on the basis of the data obtained from descriptive
and predictive analysis.By using optimization techniques ,prescriptive analytics determines the finest
substitutes to minimize or maximize equitable finance and marketing areas. For example, if we have to find the
best way of shipping goods from a factory to destination, to minimize cost we use prescriptive analysis.
numbers
+
Data
When
Text,sounds,
+ +
videos&images
Rules
Repeat
Prescriptive Analytics
Data which is available in abundance can be streamlined for growth and expansion in technology as well as
business.When data is analysed successfully it can become the answer to the most important questions:how
can businesses acquire more customers and gain business insights? .The key to this problem lies in being
able to source, link,understand and analyze data.
The following are the common analytical approaches that businesses apply to use Big Data:
Approach Evaluation
The right analysis of the available data can improve major business processes in various ways.For example in
a manufacturing unit,data analytics can improve the functioning of the following processes:
Procurement::To find out which suppliers are more efficient and cost-effective in delivering product on time
Product development: To draw insights on innovative product and service formats and design for enhancing
the development process and coming up with demanded products.
Distribution: To enhance the supply chain acvtivities and standardize the optimal inventory level on various
external factors such weather,holiays and economy etc.
marketing: To identify which marketing technique will be most effective in driving and engaging customers.
Merchandising: To improve merchandise breakdown on the basis of current buying patterns snd increase
inventory levels and product interest on the basis of customer behaviors.
Sales:To optimize assignment of sales resources and accounts and other operations.
Store Operations:To adust inventory levels on the basis of predicted buying patterns,study of weather,key
events and other factors.
Human Resources:To find the characteristics and behaviors of successful and effective employees.
Transportation: Big Data has greatly improved transportation services.The data containing traffic information is
analyzed to identify traffic jam areas. Based on this analysis ,suitable steps can be taken to keep the traffic
moving in such areas. Distributed sensors are installed in handheld devices on the roads and on vehicles to
provide real time traffic information.
Education: Big Data has transformed the education processes through innovative approaches such as E-
Learning for teachers to analyze the students ability to comprehend and thus provide education effectively in
accordance with each student needs. The analysis is done by studying responses to questions,recording the
time consumed in attempting those questions and analyzing other behaviour signals of the students.
Travel: Travel industry also uses Big Data to conduct business.It maintains complete details of all the customer
records that are then analyzed to determine certain behavioral patterns in customers.
For Example,in airline industry Big Data is analyzed for identifying personal preferences or spotting which
passenger like to have window seat and path seats for flights. This helps airlines to offer the similar seats to
customers when they make fresh booking with airways.
Big Data also helps airlines to track customers who regularly fly between regular routes so that they can make
right cross sell and up sell offers.
Government: Big Data plays an important role in almost all the undertaking and processes of the
government.Analysis of Big Data promotes clarity and transparency in various government processes and
helps in
2.Identifying loop holes in the processes and taking the preventive measures on time.
3.Accessing the areas of improvrement in various sectors such as education,health ,defence and research.
4. Using budgets more judiciously and reducing unnecessary wastage and cost .
Healthcare: In healthcare,the pharmacy and medical device companies use Big Data to improvre their
research and development practices.Health insurance companies uses Big Data to determine the patients
specific treatment therapy modes for best results.
Telecom: The mobile revolution aned internet usage on mobile phones have lead to a tremendous increase in
the amount of data generated in telecom sector.Managing this huge amount has almost become a challenge
for the telecom industry.Big data analytics allows telecom industries to utilise this data for extracting
meaningful information that could be used to gain crucial business insights that helps industries in enhancing
their performance,improving customer services and business opportunities.
In distributed computing,multiple computing resources are connected in a network and computing tasks are
distributed across these resouces.This sharing of task increases the speed and efficiency of the
system.Distributed computing is suitable to process huge amount of data in a limited time.
control server
Task
Grid nodes
The following figure shows the processing of large dataset in a distributed computing environment:
The nodes are arranged within a system and are the elements that form the core of computing
resources.These resources include CPU,memory,disks,etc.Big Data systems usually have higher scaling
requiements.Therefore these nodes are more more beneficial for adding scalability to the big data
venvironment.The system with added scalability can accommodate the growing amounts of data more
eeiciently and flexibly.Distributed computing technique also makes use of virtualization and load balancing
features.
The sharing of workload across various systems throughout the network to manage the load is known as"load
balancing".The virtualization feature creates a virtual environment in which hardware platform,storage devices
and OS are included.
Parallel computing for Big Data:To improve the processing capability of a computer system is to add
additional computational resources to it.This helps in dividing complex computations into subtasks which can
be handled individually by processing units that are running in parallel such systems are called "parallel
systems".
In parallel systems multiple parallel computing resources are involved to carry out calculations
simultaneously.parallel computing technology uses a number of techniques to process and manage huge
amount s of data at a high speed.The following are some of the techniques:
Cluster or Grid computing Cluster or grid computing is A cluster can be created by using
(used in Hadoop) based on a connection of hardware components to provide
multiple servers in a cost-effective storage
network.This network is known options.The overall cost may be
as a "cluster" in which servers very high in cluster computing.
share the workload among
them.A cluster can be either
homogeneous(consisting of
same type of hardware) or
heterogeneous(consisting of
different types of hardware).
Massively parallel A single machine working as a MPP platforms such as EMC
processing(MPP) grid is used in the MPP platform Greenplum and Paraccel are
(used in datawarehouses) which is capable of handling the most suited for high-value use
activities of storage,memory and cases.
computing.
An independent system connected in a network for A computer system with several processing units
performing specific tasks attached to it.
Loose coupling of computers connected in Tight coupling of processing resources that are
anetwork providing access to data and remotely used for solving a single,complex problem
located resources
Cloud Computing and Big Data: One of the major issues that the organization face with the storage and
management of Big Data is the huge amount of investment to get the required hardware stup and software
packages.Some of these resources may be over utilised with the varying requirements over time. We can
overcome this challenges by providing a set of computing resources that can be shared through cloud
computing.
In cloud based platforms applications can easily obtain the resources to perform computing task.The costs of
acquiring this resources need to be paid as they use.Cloud computing allows the organization to dynamically
regulate the use of computing resources and access them as per the need while paying only for those
resources that are used.
The cloud computing techniques uses data centers to collect data and ensure that data backup and recovery
are automatically performed according to the requirements of business.
Both cloud computing and Big Data analytics use the distributed computing model in a similar model.
Features of Cloud Computing: The following are some features of cloud computing that can be used to
handle Big Data:
1.Scalability
2.Elasticity.
3.Resource pooling.
4.Self service.
5.Low cost.
6.Fault Tolerance.
1.Scalability: it means addition of new resources to an existing infrastructure.The increase in the amount of
data being collected and analysed requires organizatons to improve their hardware components processing
ability.Sometimes the organizations replace the existing hardware with a new set of hardware componente
inorder to improve data management and processing activities.The new hardware may not provide complete
support to the software.We can solve such issues by using cloud services that employs the distributed
computing technique to provide scalability to the architecture.
2.Elasticity: Elasticity in cloud means hiring certain resources when required and paying for the resources that
have been used.No extra payment is required for acquiring specific cloud services.
For ex:A business expecting the use of more data during in-store promotions could hire more resources to
provide high processing speed.
3.Resource pooling: Resource pooling is an important aspect of cloud services for big data analytics.In
resource pooling multiple organizations which use similar kind of resources to carryout computing practices,
have no idea to individually hire all the resources . The sharing of resources is allowed in a cloud which
facilitates cost cutting through resource pooling.
4.Self service: Cloud computing involves a user interface that helps customers to directly access the cloud
services they want .The process of selecting the needed services requires no involvement from human and
can be accessed automatically.
5.Low cost: A careful planning,use ,management and contrtol of resources help organizations to reduce the
cost of acquiring the hardware .Cloud offers customized solutions especaially to the organizations that cannot
afford too much initial investment in purchasing the resources that are used in computation in big data
analytics.The cloud provides then pay-as-you-use option in which organization need to sign for those
resources only that are essential.
6.Fault tolerance
Cloud computing provides fault tolerance by offering uninterrupted services to customers especially in cases
of component failure.The responsibility of handling the workload is shifted to the other components of the
cloud.
CLOUD DEPLOYMENT MODELS: Cloud is a multi purpose platform that helps in handling big data analytics
operation and also performing various tasks including data sorage,data backup and customer service.
Depending upon the architecture used in the network,services and applications used,and the target
consumers,cloud services are offered in the form of various deployment models.The following are the most
commonly used cloud deployment models:
1.Public Cloud
2.Private Cloud
3.Community Cloud
4.Hybrid Cloud
1.Public Cloud(End-User level cloud) : A cloud that is owned and managed by a company than the
one(which can be either an individual user or a company) using it is known as a Public Cloud.In this
cloud,there is no need for the organizations to control or manage the resources,they are being administered by
a third party.
Example of public cloud providers is Amazon Web Service .In case of public cloud,the resources are owned
or hosted by the cloud service providers(a company) and the services are sold to other companies.The
process of computing becomes very flexible ans scalable through customozed hardware resources.
For example a cloud can be used specially for video storage that can be streamed live on youtube.
Businesses can obtain cloud storage solutions in a public cloud which provides efficient mechanism for
handling complex data.
A private cloud combines all the processes,systems,rules,policies,compliance checks etc of the organization at
a place.User can provide firewall protection to the cloud.
A private cloud can be either on-premises or hosted externally.In case of on-premises private clouds the
service is exclusively used and hosted by a single organization.Private clouds that are hosted externally are
used by a single organization and are not shared with other organizations.The main objective of a private cloud
is not to sell the cloud services(IaaS/Paas/SaaS) to the external organizations.
Community Cloud:
Community cloud is a type of cloud that is shared among various organizations with a common tie.This cloud is
generally managed by a third party offering the cloud service and can be made available on or off premises.
For example in any state or country the community cloud can be provided so that almost all government
organizations of that state can share the resources available on the cloud.Because of sharing of cloud
resources on community cloud the data of all citizens of that state can be easily managed by the government
organizations.
Hybrid Cloud:The cloud environment in which various internal or external service providers offer services to
many organizations is known as a "hybrid cloud".An organization hosts applications which requires high level
security and are critical on the private cloud. It is also possible that the applications that are not so important or
confidential can be hosted on the public cloud.
In hybrid clouds ,an organization can use both types of clouds i.e., public and private together.The organization
using the hybrid cloud can manage an internal private cloud for general use and migrate the entire part of an
application to the public cloud.
Cloud environment provides computational resources in the form of hardware,software and platform which are
used as services.
Virtual machines,load balancers and network attached storage are some of the examples of IaaS.
2.Platform as a service(PaaS): User applications are provided a platform for writing codes and executing it
through the PaaS cloud .Interms of cloud , platform that combines software development and deployment tools
along with middleware services.
Windows Azure and Google app engine(GAE) are examples of PaaS cloud. In the organizations having a
private PaaS ,it is possible for programmers to create and deploy applications for their requirement.
3.Software as a service(SaaS): The SaaS provide software applications that are accessible from wherever
the user is . Customer donot require the software to purchase it or installing it on their own devices.They can
use it directly from the cloud. These applications are hired through annual contracts which are operational only
if IaaS and PaaS clouds are already being used.
Cloud services are associated with various models that are used for delivery and deployment. Cloud services
are infrastructure,platforms or software that are hosted by third party providers and made available to users
through the internet.There are three basic types of cloud services.
1.IaaS(Infrastructure as a service): The huge storage and computational requirements for big data are
fulfilled by limitless storage space and computing ability obtained by the IaaS cloud.
examples of IaaS are Amazon web services(AWS), Microsoft Azure and Google compute engine. These
providers maintain all storage servers and networking hardware and may also offer load balancing, application
firewalls.
2.PaaS(Platform as a service): PaaS serves as a web based environment where developers can build cloud
apps. PaaS provides a database,operating system and programming language that organization can use to
develop cloud based software.
Big Data cloud providers are rendering services that are relevent to big data analytics and data
visualization.The following are such providers in big data market.
1.Amazon: Amazon is one of the largest cloud service provider and it offers its cloud services as Amazon Web
Services(AWS).AWS includes some of the most popular cloud services such as
i. EC2(Elastic Compute Cloud): It refers to a web service that employs a large set of computing resources to
perform its business operations.
ii.Elastic MapReduce: It is a web service that uses Amazon EC2 computation and amazon S3 storage for
storing and processing large amount of data so that the cost of processing and storage is reduced significantly.
iii.DynamoDB: refers to NoSQL database system in which data srorage is done on solid state devices(SSD's).
DynamoDB allows data replication for high availability and durability.
iv.Amazon S3(Amazon simple storage service) : It is a web interface that allows data storage over the internet
and makes web computing possible.
v.High performance computing(HPC): Refers to a network that is filled with high bandwidth,low latency and
high computational abilities which are required for processing big data for solving issues related to educational
and business domains.
vi.RedShift: Refers to a datawarehouse service that is used to analyze data with the help of existing tools for
business intelligence .Amazon RedShift can handle data upto petabyte.
2.Google:The cloud services that are provided by Google for handling BigData includes the following:
i.Google Compute Engine is a computing environment which is secure,flexible and based on virtual machines.
ii.Google BigQuery is a Desktop as a Service(DaaS) which is used for searching huge amount of data At A
faster pace on the basis of SQL format queries.
iii.Google Prediction API: Which is used for identifying the patterns in data, storing of patterns and improving
the patterns with successive utilization.
3.Windows Azure: Microsoft offers a PaaS cloud that is based on Windows and SQL and consist of set of
development tools ,virtual machine support, media services and mobile device services. Windows Azure PaaS
is easy to adopt for people who are well equipped with .Net, SQL server and Windows. Microsoft Azure allows
organizations to work with large datasets to carry out all their operations in the cloud.
4.Oracle Cloud: Oracle is a database platform available to businesses through its Oracle cloud
service,offering flexible , scalable storage based on analytics and data processing services.The service is
highly rated for its strong security features including realtime encryption of all data sent to the platform.
UNIT II
INTRODUCING HADOOP:
Traditional technologies like Relational Databases and Datawarehouses are incapable to handle huge amount
of data generated in the organization or to fulfill the processing requirements of such data. Because the data
generated today are semi structured or unstructured ,but traditional systems are designed to handle only
structured data in the form of rows and columns.
One of the technologies designed to process big data which is a combination of both structured and
unstructured data available in huge volumes is known as "HADOOP".
1.Hadoop is an open source platform that supports the processing of large data sets in a distributed computing
environment
2. A hadoop cluster consist of single MasterNode and multiple WorkerNodes. The masternode contains a
Name node and job tracker and a Slave or workernode acts as both a data node and task tracker.
3. Hadoop consist of MapReduce , HDFS and several related projects such as Apache Hive ,HBase etc.
MapReduce and HDFS are the main components of Hadoop.
4.Hadoop requires java runtime environment (JRE) 1.6 or higher versions for storing and managing big data.
5.Hadoop runs applications on large clusters and processes thousands of TeraBytes of data on thousands of
nodes.
ADVANTAGES OF HADOOP:
1.Robust and Scalable: We can add new nodes as well as modify them.
2.Cost effective: We donot need any special hardware for running Hadoop ,We can just use commodity server.
4.Fault Tolerance: When a node fails Hadoop framework automatically assigns its work to another node.
1.Hadoop Map Reduce:Hadoop MapReduce is a method which spilt larger data problem into smaller chunk
and distribute it to many different servers.each server have their own set of resources and they processed
them locally.Once the server has processed the data,they send it back collectively to main server.
2.Hadoop Distributed File System(HDFS) is a virtual file system.When we move a file on HDFS,it is
automatically spilt into many small pieces.These small chunks of the file are replicated and stored on the other
servers for the fault tolerance.
3.NameNode:NameNode is the heart of Hadoop system.The NameNode manages the file system
namespace.It stores the metadata information of the data blocks.If the name node crashes then the entire
Hadoop system goes down.
4.Secondary Name node:The secondary name node is used to copy and merge the namespace image and
edit log.In case if the name node crashes then the namespace image stored in secondary Name Node can be
used to restart the Name Node.
6.Job Tracker: Job Tracker is responsible to schedule the client's jobs. Job Tracker creates map and reduce
tasks and schedules them to run on the data nodes. It also checks for any failed task and re-schedules the
failed task on the other data nodes.
7.Task Tracker: Task tracker runs on the data nodes and is responsible to run the map or reduce task
assigned by the name node and report the status of the task to the name node.
8.Blocks in HDFS: Block is the smallest unit storage on a computer storage system. In Hadoop the default
block size is 128MB or 256 MB.
Features of Hadoop:
1.Suitable for big data analysis: Hadoop clusters are suitable for analysis of big data .
2.Scalability:Hadoop clusters can easily be scaled by adding additional cluster nodes and thus allows the
growth of big data.
3.Fault Tolerance: Hadoop Eco system replicate the input data on to the other cluster nodes,if a cluster node
fails,the data processing can still proceed by using data stored on other cluster node.
5. Great computational ability:Distributed computational model enables fast processing of big data with
multiple nodes running in parallel.
6.Cost effective: Open source technologies are free of cost and hence requires lesser amount of money to
implement them.
Hadoop Distributed File System:Hadoop includes a fault tolerant storage system called"Hadoop Distributed
File System(HDFS)".It stores large size files from terabytes to petabytes.HDFS provides reliability by
replicating the data over multiple hosts.
The default replication value is 3 i.e., Data is replicated on three nodes: two on the same rack and one on a
different rack. The file in HDFS is split into large blocks size of 64 MB by default and eaach block of file is
independently replicated at multiple data nodes. The Name Node actively monitors the number of replicas of a
block(by default 3 times).When a replica of block is lost due to a Data Node failure or disk failure, the Name
Node creates another replica of the block.
Features of Hadoop:
1.Hadoop performs well with several nodes without requiring shared memory or disks.
2.Hadoop follows client-server architecture in which the server works as a master and is responsible for data
distribution among clients and works as slaves to carry out all the computational tasks.The master node also
performs the tasks of job controlling,disk management and work allocation.
3.The data stored across various nodes can be tracked in Hadoop NameNode.It helps in accessing and
retrieving data,as and when required.
4.Hadoop keeps multiple copies of data to improve resilience that helps in maintaining consistency especially
in case of server failure.
5.Hadoop improves data processing by running computing tasks on all available processors that are working in
parallel.
MapReduce:
Map Reduce is a framework that helps developers to write programs to process large volumes of unstructured
data parallel over a distributed architecture/standalone architecture.
1.JobTracker
2.TaskTrackers
3.JobHistoryServer
4.Input Data
5.Output Data
1.JobTracker:Master node that manages all jobs and resources in a cluster of computers.
2.TaskTrackers:Agents deployed at each machine in the cluster to run the map and reduce task at the
terminal.
4.Input Data:The data set that is fed to the MapReduce for processing.
We can write MapReduce programs in several languages like C,C++,Java and Python. Programmers use Map
Reduce libraries to build tasks without communication or coordination between the nodes. Each node will
periodically report its status to master node, if a node does not respond as expected ,the master node
reassigns that job to the other available node i the cluster ,so MapReduce is also "Fault tolerant".
The following diagram shows the MapReduce architecture:
HDFS
1.Hadoop Distributed File System: It is a cluster of storage solutions that is highly reliable, efficient and
provides facilities to manage files containing related data across machines.
2.Hadoop MapReduce:It is a computational framework used in Hadoop to perform all the mathematical
computations. It is based on a parallel and distributed implementation of MapReduce algorithm that provides
high performance.
Hadoop facilitates the processing of large amounts of data present in both structured and unstructured forms.
Hadoop clusters are created from racks of commodity machines. Tasks are distributed across these machines
known as nodes, which are allowed to work independently and provide their responses to the starting node.
Hadoop divides the computing tasks into subtasks that are handled by individual nodes with the help of
MapReduce model, which consists of two functions namely "Mapper" and "Reducer".
1.The " Mapper" function is used for mapping the computational subtasks to different nodes and the
"Reducer" function is used for reducing the responses to a single result . The MapReduce model implements
a MapReduce algorithm for breaking the data into subtask and processing that data on distributed cluster and
making that data available for additional processing or user requirement.
2. In the MapReduce algorithm, the process of distributing the task across various systems , handling the task
placements for load balancing and managing the failure recovery are done by a "Map component" or "Mapper"
function. The "Reduce component" or "Reducer" function is used to consolidate all the elements together after
the completion of distributed computation.
When an indexing job is provided to Hadoop ,it requires the organizational data to be loaded first. Next The
data is divided into various parts and each part is forwarded to different individual servers. Each server has a
job code with the part of data it is required to process. The job code helps Hadoop to track the current state of
data processing. Once the server completes the operations on the data provided to it ,the response is
forwarded with the job code being appended to the result.At the end results from all nodes are integrated by
the Hadoop software and provided to the user as shown in the following figure.
Indexing job
Hadoop Software
input data + job code1 input data +job code 2 input data +job code 3
Result
HADOOP ECOSYSTEM:
MapReduce YARN
(cluster (cluster &
Management) Resource management)
HDFS HBase
1.HDFS(Hadoop Distributed File System):HDFS is storage component of Hadoop that stores data in th form
of files.Each file is divided into blocks of 128MB.It works with master-slave architecture with two main
components "Name Node" and "Data Node".
Name Node: is also known as "Master node" which stores metadata i.e., number of blocks,their location,on
which rack and which datanode the data is stored.It consists of files and directories.
Data Node :is also known as " slave node" that stores the blocks of data.Datanode performs read and write
operation as per the request of the clients.
2.MapReduce:MapReduce divides a single task into multiple tasks and processes them on different
machines.It has two phases:Map and Reduce."Map" phase filters,groups and sorts the data.Input data is
divided into multiple spilts.each map task works on a split of data in parallel on different machines and outputs
a key-value pair.The Reduce phase aggregates the data,summarises the result and stores it on HDFS.
3.YARN: Yet Another Resource Negotiator .YARN is responsible for allocating system resources to the various
applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.
YARN is called as the operating system of Hadoop as it is responsible for managing and monitoring workloads.
4.HBase: HBase is a distributed column oriented non relational database built on top of Hadoop file system.
HBase provides real time access to read/write data in HDFS.
5.Pig:Apache Pig is a high-level language platform for analyzing and quering huge dataset that are stored in
HDFS.Pig as a component of hadoop ecosystem uses PIgLatin language similar to SQL.It loads the
data,applies the required filters and dumps the data in required format.
6.Hive:Hive is an open source datawarehouse system for querying and analyzing large datasets stored in
Hadoop files.Hive performs three main function:
i.Data summarization
ii.Query
iii.Analysis
7.Sqoop:Sqoop imports data from external sources into related Hadoop ecosystem components like
HDFS,Hbase or Hive.It also exports data from Hadoop to other external sources.Sqoop works with relational
databases such as oracle,MySQL.
8.Mahout:Mahout is open source framework for creating scalable "Machine Learning" algorithm and data
mining library.Once data is syored in Hadoop HDFS,mahout provides the data science tools to automatically
find meaningful patterns in those big datasets.
9.Flume: Flume efficiently collects,aggregate and moves large amount of data from its origin and sendind it
back to HDFS.This component allows the data flow from the source into Hadoop environment.It uses a simple
extensible data model that aloow for online analytic application.Using Flume we can get the data from multiple
servers immediately into Hadoop.
11.Oozie:It is a workflow scheduler system for managing Hadoop jobs.Oozie combines multiple jobs
sequentially into one logiccal unit of work.
12.Chukwa:Chukwa is an open source data collection system for monitoring large distributed systems.
13.Avro:Avro is an open source project that provides data serialization and data exchange services for
Hadoop.
HDFS is the primary data storage system under Hadoop applications.It is a distributed file system and
provides high throughput access to application data.It manages the large amounts of structured and
unstructured data.HDFS distributes the processing of large data sets over clusters of inexpensive
computers.
Benefits of HDFS:
1. Large dataset storage: HDFS stores variety of data of any size from Gigabytes to Petabytes and in any
format including structured and unstructured data.
2.Streaming data access: HDFS is built for high data throughput which is best for access to streaming data.
3.Fast recovery from hardware failures: HDFS is designed to detect failures or faults and automatically
recovers on its own.
4.Reliable: The file system stores multiple copies of data in separate system to ensure it is always available.
5.Cost effectiveness: HDFS is an open source software that comes with no licensing fee or support cost.The
data nodes that stores the data rely on inexpensive hardware which cuts storage cost.
6.portability: HDFS is portable across hardware platforms and is compatable with many underlying operating
systems such as windows,Linux and Mac OS.
Features of HDFS:
2.HDFS is fault tolerant i.e as data is stored across multiple nodes, if any of the machine in the cluster fails,the
data will be available from another node.
3.HDFS splits large files into small units known as Blocks which contains a certain amount of data that can
read or write. Default block size is 128 MB.
4. HDFS provides data replication, data reliability and data availability features also.
HDFS Applications:
1.Financial services
2.Retail industry
3.Telecommunications
4.Insurance
5.Marketing
6.Research
HDFS Architecture:
HDFS follows master-slave architecture.It consists of NameNode and number of DataNodes.The NameNodes
is the master that manages various data nodes as shown in the figure.
NameNode:The NameNode manages HDFS cluster metadata whereas DataNodes stores the data.Records
and directories are presented by clients to the NameNode.These records and directories are managed on the
NameNode.Operations such as modification or opening and closing are performed by the NameNode.
DataNode:Internally a file is divided into one or more blocks,which are stored in a group
DataNodes.DataNodes read and write requests from the clients.DataNodes can also execute operations like
creation,deletion and replication of blocks depending on the instructions from the NameNode.
Metadata
:
name=/home/foo/data
replicas=3
metadata ops
client NameNode name=/home/foo/data1
replicas=3
Blocks
client
HDFS divides files into blocks and stores each block on a DataNode.Multiple Data Nodes are linked to the
master node in the cluster, the NmaeNode.The master node distributes replicas of these data blocks across
the cluster. It also instruct the user where to locate wanted information.NameNode can store and manage the
data, first it needs to partition the file into smaller data blocks. This process is called as " Data block
splitting".
HDFS stores the data in terms of blocks. The default size of HDFS block is 64MB.The files are split into 64MB
blocks and then stored into the Hadoop file system. The Hadoop application is responsible for distributing the
data blocks across multiple nodes.
Files in HDFS are broken into block-sized chunks called data blocks. These blocks are stored as independent
units. The size of these HDFS data blocks is 128 MB by default.
HDFS performance is calculated through distribution of data and fault tolerance by detecting faults and quickly
recurring the data. This recovery is completed through replication and resents in a reliable file system. It results
in a reliable huge file storage.
1.Monitoring:DataNode and Name Node communicate through continuous signals. If signal is not heard by
either of the two, the node is considered to have failed and would no longer available. The failed node is
replaced by the replica and replication is also changed.
2.Reblancing:The blocks shifted from one to another location where ever the free space is available. Better
performance can be judged by the increase in demand of data as well as the increase in demand for
replication towards frequent node failures.
3.Metadata Replication: These files are prove to failures; however they maintain the replica of the
corresponding file on the same HDFS.
1.NameNode:The NameNode deals with the file system.It stores the metadata for all the documents and
indexes in the file system.This metadata is stored on the local disk as two files:i)the file system image and
ii)the edit log.
2.DataNodes:DataNodes are the workhorses of a file system.They store and recover blocks when they are
asked by clients or the NameNode and they report back to the NameNode with a list of blocks that they store
externally.
Without the NameNode the file system cannot be used.In fact,if the machine using the NameNode crashes, all
files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from
the blocks on DataNodes.This is why it is important to make the NameNode robust to overcome with failures,
and Hadoop provides two ways of doing this.
The first way is to take the back up of file documents.Hadoop can be set in a way that NameNode creates its
state for various file systems.
The another way is to run a secondary NameNode ,which does not operate like a normal
NameNode.Secondary NameNode periodically reads the filesystem,changes the log,and apply them into the
fsimage file.When the NameNode is down,secondary NameNode will be online but this node will only have
read permissions to fsimage and editlog file.
Features of HDFS:
The Hadoop distributed file system is the primary data storage system used by Hadoop applicatoins.Hadoop
uses a Namenode and Datanode architecture to implement the distributed file system that provides high
performance access to the data across highly scalable Hadoop clusters.Hadoop is an open source distributed
processing framework that manages data processing and storage for Bigdata applications.
1.Data replication
2.Fault tolerance
3.High availability
4. Scalability
5. Distributed storage
6. Data locality
1.Data replication: This is used to ensure that the data is always available and prevents data loss. For ex.
when a node crashes or there is a hardware failure ,replicated data can be pulled from elsewhere within a
cluster,so processing continues while data is recovered.
2.Fault tolerance: Hadoop is highlt fault tolerant i.e, Hadoop framework divides the data into blocks, after that
it creates multiple copies of the blocks on different machines on the cluster . So, when any machine in the
cluster goes down, then a client can easily access that data from the othere machine which contains the same
copy of data block.
3.High availability: Hadoop HDFS is highly available file system because of replication across multiple nodes,
data is available even if the Namenode or Datanode fails.
4.Scalability: HDFS stores data on various nodes in a cluster , as requirements increases ,a cluster can be
scalable to hundreds of nodes.
5.Distributed storage: Since HDFS stores data in a distributed manner, the data can be processed in parallel
on a cluster of nodes. This plus data locality reduces the processing time and enable high throughput.
6.Data locality: With HDFS, computation happens on the Datanode ,where the data resides,rather than
having the data move to where the computational unit is. By minimizing the distance between the data and
computing process, this approach decreases netwotk conjunction and boosts the systems overall throughput.
MapReduce: MapReduce is a parallel programming framework to process large amount of data stored across
different systems.The process is initiated when a user request is received to execute the MapReduce program
and terminated once the results are returned back to the HDFS.
MapReduce facilitates the processing and analyzing of both unstructured and semistructured data collected
from different sources . MapReduce enables computational processing of data stored in a file system without
hte requirement of loading the data into the database.It primarily supports two operations.
1.Map
2.Reduce
1.Map: The Map job is to process the input data which is in the form of files and directories and stored in
HDFS.The Mapper processes the data and creates several small chunks of data .
2.Reduce: The Reducer job is to process the data that comes from Mapper . After processing it produces a
new set of output which will be stored in the HDFS.
These two operations execute in parallel on a set of worker nodes. MapReduce works on a Master / Worker
approach in which the Master process controls and directs the entire activities such collecting ,segregating and
delegating the data among different workers.
Working of MapReduce:
Partition 1
Map Task 1 Intermediate 1
Partition 2
Reduce Task 1
Results 1
1.A MapReduce worker receives data from the master, processes it, and sends back the generated result to
the master.
2.MapReduce workers run the same code on the received data; however they are not aware of other co-
workers and do not communicate or interact with each other.
3.The master receives the results from each worker process, integrates and processes them, and generates
the final output.
Introducing HBase:
HBase is a column-oriented distributed database composed on top of HDFS.HBase is used for real-time
continuous read/write access to huge datasets. HBase is defined as an open
source,distributed,NoSQL,Scalable database system written in Java. It was developed by Apache software
foundation.Hbase can host very large tables consisting of billions of rows and columns . HBase facilitates
reading/writing of big data randomly and efficiently in real time. it is highly configurable , allows efficient
management of huge amount of data and helps in dealing with Big data challenges in many ways.
It stores data into table with rows and columns as in RDBMS,the intersection of row and column is called a
"cell"
HBase tables have one key feature called versioning which differentiates them from RDBMS tables. Each cell
in an HBase table has an associated attribute termed as "Version" which provides a time stamp to uniquely
identify the cell. Versioning helps in keeping a track of changes made in a cell and allows the retrieval of
previous version of cell contents if required.
HBase Architecture:
HBase is a column oriented NoSQL database in which the data is stored in a table. The HBase table schema
defines only column families. The HBase table contains multiple families and each family can have unlimited
columns. The column values are stored in sequential manner on a disk. Every cell of the table has its
timestamp.
Applications store information in labeled tables and the values stored in table cell get updated from time to
time.A cell value is an unread array of bytes.
Table rows are categorised by the row key,which is tables primary key. Row/column are clustered into column
families.All members of family share common prefix.A table column family member must be specified before
the table schema ,but a new column family members can be added on a need basis.
For ex: A new column address : pincode can be provided by a client as a part of upgrade ,and its value
continued as long as the column family exist in a place on the targeted table.All the family members are saved
together in a file system and it is called as column -family oriented entity.
HBase tables are like the one in RDBMS with cells being versioned ,sorted rows and on-the-fly addition of the
columns as per the clients requirements.
Client
HMaster
Region Region
Region
The HBase architecture consists of three major components.
1.Hmaster
2.Region server
3.Zookeeper
1.HMaster: It assigns the regions to the region server .HMaster manages the region server and Hadoop
cluster to handle operations such as creating and deleting tables .
2.Region server: Region servers are the end nodes that handles all users requests. Several regions are
combined within a region server. This regions contains all the rows between specified keys.
3.ZooKeeper: ZooKeeper acts as a bridge across the communication of HBase architecture.It is responsible
for keeping track of all the region servers and regions that are within them.Monitoring of region servers and
HMasters are active and which are failed is done by ZooKeeper.
Regions of HBase:
Tables are automatically partitioned horizontally into regions by HBase.Each region consists of a subset of
rows of a table.Initially a table comprises a single region but as the size of the region grows, after it crosses a
configurable limit,it splits at the boundary of row into two new regions of almost equal sizes as shown in the
figure.
table A
Table a,region 1
A
Table a,region 2
B region server 7
Table g,region 1070
C
Table L, region 25
D
e
Table A, region 3
F
region server 86 Table C, region n30
g
Table f, region 160
H
Table f , region 776
I
K Table a, region 4
L Table
region server 367 c, region 17
M Table e, region 52
region 1
Table p, region 1116
N
Regions are units that get spread over a cluster in HBase.Hence a table too big for any single server can be
carried by a cluster of servers with each node hosting a subset of the total regions of a table. This is also the
medium by which the loading on a table gets spread. At a given time,the online group of sorted regions
comprises the tables total content.
It is designed according to Google Big table(a compressed high performance proprietary data storage system
built on the Google file system) and is equipped for providing extensive tables(billions of sections/columns) on
Hadoop clusters of appliance hardware.
Each cell value consists of a version property which is just a timestamp uniquely distinguishing the cell.
Versioning tracks changes in the cell and makes it possible to retrieve any version of the contents if required.
HBase stores the information in cells in descending order so a read will find the newer values first.
HBAse is a column-oriented non-relational database management system that runs on top of Hadoop
Distributed File System(HDFS).
Hadoop Framework
HIVE
MapReduce HBase
HDFS
HBase provides random access storage and retrieval of data which HDFS cannot.
HBase is built similar to HDFS(Name Node and DataNode) and MapReduce(Job Tracker and Task Tracker).
In HBase a master node manages the cluster and regions servers store portions of tables and perform task on
the data.
The data model of HBase is similar to Google's big table design and provides the feature of fault tolerance
similar to HDFS.
HBase is part of the Hadoop ecosystem that provides read and write access for real time data in Hadoop file
system.
HBase is a distributed database that uses Zookeeper to manage clusters and HDFS as the underlying storage.
At architecture level it consists of HMaster and multiple HRegion servers.
Master Root
HDFS
HBase MapReduce:
The relationship between table and region in HBase is similar to the relationship between file and block on
HDFS.
HBase provides APIs for interacting with MapReduce such as Table Input Format and Table Output
Format,Hbase data tables can be directly used as input and output of Hadoop MapReduce which facilitates the
development of MapReduce application and does not reuir processing of Hbase.
By default HBase stores the data in HDFS and is possible to run HBase over other distributed file systems.
HBase uses HFile as the format to store the tables on HDFS which is a block indexed file format, in which data
is stored in a sequence of blocks and seperate index is maintained at the end of the file to locate the block.
When a read request comes,the index is searched for the block location then the data is read from that block.
Features of HBase:
1.Consistency
2.Sharing
3.High availability
4.Client API
1.Consistency:HBase supports consistent read and write operations which makes HBase suitable for high
speed requirement.
2.Sharing:HBase allows distribution of data using an underlying file system and supports transparent and
automatic splitting and redistribution of content.
3.High availability:HBase implements region servers to ensure the recovery of LAN and WAN operations in
case of a failure.The master server monitors the regional servers and manages all the metadata for the cluster.
5.Support for IT operations:HBase provides a set of built in web pages to view detailed operational insights
about the system.
Hive:
Hive is a datawarehouse system which is used to analyse structured data.It is built on the top of Hadoop.It was
developed by Facebook.Hive provides the functionality of reading,writing,and managing large datasets residing
in distributed storage.It uses a query language called"Hive query language" similar to SQL which internally
gets converted to Map Reduce jobs.
Features of Hive:
4.It supports user-defined functions(UDFs) where user can provide its functionality.
Limitations of Hive:
Hive Architecture:
Hive Client
Hive Services
MapReduce
HDFS
The architecture of Hive is built on HDFS and MapReduce which consist of two main components.
1.Hive client
2. Hive services
1.Hive client: Hive allows writing applications in various languages such as java,python and c++
2.Hive Services: Hive provides the services such as Hive CLI(command line interface),Hive web user
interface, Hive metastore , Hive server and Hive compiler etc.
Hadoop defines Pig as a platform for analyzing large datasets that consist of high level languages for
developing data analysis programs and executing those programs.
Pig provides an interactive and script based execution environment for non developers with its language called
" Pig Latin".
Pig Latin loads and processes input data using a series of operations and transforms that data to produce that
desired output.
Features of Pig:
1. Ease of programming
2. Extensibility (Where user can write their logic to execute over the data set)
4.Built-in operations(sorting,filtering,joins)
Pig Architecture:
Pig Latin Scripts
Apache Pig
Pig Srever
MapReduce
Hadoop
HDFS
Pig architecture relies on underlying technologies like HDFS and MapReduce and consist of two components.
1. Apache Pig
1. Apache Pig : Apache Pig consist of several components such as parser(for checking the syntax of the
script,compiler and execution engine)
2.Pig Latin Scripts: Pig scripts are submitted to the Pig execution environment to produce the desired
results.User can execute the Pig Scripts by using script file or embedded scripts.
1.Sqoop is a tool used for transferring the data between Hadoop And Relational Databases.
2.Sqoop is also a command line interpreter which sequentially executes sqoop commands.
3.Sqoop can be effectively used by non programmers and relies o underlying technologies like HDFS and
MapReduce.
Features of Sqoop:
1.Sqoop helps to connect the results from the SQL queries into HDFS.
2.Sqoop helps to load the processed data directly into Hive or HBase.
3.Sqoop performs the security operations of data with the help of Kerberos
Working of Sqoop:
1.Import
2.Export
1.Sqoop Import: Sqoop import command helps in implementation of operations i.e, user can import a table
from RDBMS to Hadoop database server, and records in Hadoop struture are stored in text files and each
record is imported as a separate record in Hadoop database server.
2.Sqoop Export: Sqoop Export command is used to transfer the data from Hadoop database file system to
RDBMS.The data which will be exported is processed into records before opeeation is completed.
The export of data is done in two steps , first is to examine the database for metadata and second step
involves migration of data .
Sqoop Processing: Sqoop processing takes place step by step as shown below.
3.It uses Mappers to slice the incoming data into multiple formats and load the data in HDFS.
4.Exports data back into RDBMS while ensuring that the schema of the data in the database is maintained.
Sqoop import
Sqoop
HDFSstorage
MAP HDFS
map
RDBMS
(MySQL,Oracl
MAP HDFS
e,DB2) map
Sqoop export
ZooKeeper:
Zookeeper:
Zookeeper is an open source apache project that provides a centralized service for providing configuration
information,naming,synchronization and group services over large clusters in distributed systems. Zookeeper
implements different protocols in the cluster so that the application should not implement on their own.It
provides a single coherent vieew of multiple machines.
2.Managing the cluster: In Zookeeper the status of each node is maintained in real time
3.Automatic failure recovery: While modifying ,Zookeeper locks the data , if a failure occurs in the database
that helps the cluster to recover it automatically.
Zookeeper Architecture: Zookeeper follows client/server architecture .All systems store a copy of data.
Leaders are elected at startup. The Architecture of Zookeeper consists of following components.
ZooKeeper services
Leader
2.Client: Client is one of the nodes in the distributed application cluster.It helps you to accesses the
information from the server.Every client sends a message to the server at regular intervals that helps the
server to know that the client is alive.
3.Leader: One of the servers is designated a Leader. It gives all the information to the clients as well as an
acknowledgement that the server is alive.It would performs automatic recovery if any of the connected nodes
failed.
client read requests are handled by the corresponding connected zookeeper server
6.Zookeeper WebUI:If you want to work with zookeeper resource management,then you need to use WebUI.It
allows working with zookeeper using the web user interface,instead of using the command line.It offers fast
and effective communication with the zoopkeeper application.
FLUME
Apache Flume aids in transferring large amounts of data from distributed resources to a single centralized
repository.Apache flume is a reliable and distributed system for collecting,aggregating and moving massive
quantities of log data.Apache Flume is used to collect log data present in log files from webservers and
aggregating it into HDFS for analysis.
Features of Flume:
1.Open source:Apache Flume is an open source distributed system available at free of cost.
3.Fault-Tolerant:It is fault tolerant and robust with multiple failovers and recovery mechanisms.
4.Dataflow: Flume carries data between sources and sinks . This gathering of data can be either scheduled or
event driven. Flume is used to transport event data including but not limited to network traffic data , data
generated by social media websites and e-mail messages.
Flume Architecture:
A Flume agent is a JVM(java virtual machine) process which has three components .
1.Flume source: It reads the data ,translates the events and handles the failure situations.
2.Flume channel : Channels are the communications bridges between sources and sinks within an agent.
3.Flume sink : Sinks removes the event from the channel and puts it into an external repository like HDFS.
Flume Flume
Web H
data data
servers sources sinks
D
F
Flume data channel
S
1.In the above diagram, the events generated by external source(web server) are consumed by Flume data
source. The external source sends events to Flume source in the format that is recognised by the target
source.
2.The flume source receives an event and stores it into one or more channel.The channel acts as a store
which keeps the event until it is consumed by the flume sink.This channel may use a local file system inorder
to store this events.
3.Flume sink removes the event from a channel and storeds it into an external repository like HDFS. There
could be multiple flume agents in which case flume sink forwards the event to the flume source of next flume
agent in the flow.
OOZIE:
Apache Oozie is a workflow engine .It is a java web application used to schedule Apache Hadoop jobs.Oozie
combines multiple jobs sequentially into one logical unit of work. The main purpose is to manage different
types of jobs being processed in Hadoop system .It is used to run workflow job such as Hadoop,MapReduce
and Pig.
Features of Oozie:
1.Oozie is scalable
2.It can manage the timely execution of thousands of workflow in Hadoop cluster.
5.Client can submit workflow definitions for immediate or for later execution.
Hadoop
HDFS
hPDL
Oozie Client:An Apache Oczie client is a command-line utility that interacts with the Oczie server using the
Oozie command-line tool and Oozie java client API.
Oozie Server:Apache Oozie server is a java web application that runs in a Java servlet container.
1.Workflow engine: stores and runs workflows ,composed of different types of Hadoop jobs.
2.Coordinator engine:Runs workflow jobs based on predefined schedules and data availability.
Workflow is a DAG(directed acyclic graph) of action nodes and control flow nodes.
Action node: Performs a workflow task such as moving files in HDFS ,running a MapReduce , streaming ,
Pig/Hive job.
MapReduce is a programming model within the Hadoop framework that is used to access Big data stored in
Hadoop file system(HDFS).MApReduce facilitates concurrent processing by splitting petabytes of data into
smaller chunks and processing them in parallel on Hadoop servers. At the end it aggregates data from multiple
servers to return a consolidated output back to the application.
MapReduce is a framework using which user can write applications to process huge amount of data in parallel,
on large clusters of commodity hardware in a reliable manner.
MapReduce is a processing technique and a program model for distributed computing based on java. The
MapReduce algorithm contains two important tasks namely,
1.Map
2.Reduce
1.Map: Map takes a set of data and converts it into another set of data where individual elements are broken
down into tuples(key/value pairs) .
2.Reduce: Reduce task takes the output from a Map as an input and combines those data tuples into a smaller
set of tuples.
MapReduce Algorithm:
1.Map stage
3.Reduce stage.
1.Map stage: The Map or Mapper's job is to process the input data . Generally input data is in the form of a file
or a directory and is stored in HDFS. The input file is passed to the mapper function line by line .The mapper
processes the data and creates several small chunks of data.
2.Reduce stage: This stage is the combination of shuffle stage and reduce stage .The reducers job is to
process the data that comes from the mapper.After processing ,it produces a new set of output ,which will be
stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks,verfying task completion,and
copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result,and
sends it back to the Hadoop server.
Map()
Reduce()
Input
Map() Output
Reduce()
Map()
MapReduce keeps all the processing operations separate for parallel execution. Problems that are extremely
large in size are divided into subtasks, which are chunks of data separated in manageable blocks .The
subtasks are executed independently from each other and then the results from all independent executions are
combined to provide a complete output.
1. Scheduling : MApReduce involves two operations : Map and reduce , which are executed by dividing large
problems into smaller chunks , these chunks are run in parallel by different computing resources. The
operation of breaking tasks into subtasks and running these subtasks independently in parallel is called
Mapping. The Mapping operation requires task prioritization based on number of nodes in the cluster. In case
nodes are fewer than tasks ,then tasks are executed on a priority basis. The reduction operation cannot be
performed until the entire Mapping operation is completed. The reduction operation then merges independent
results on the basis of priority.Hence the MapReduce programming requires scheduling of tasks.
3.Data Locality: The effectiveness of data processing mechanism depends on the location of the code and the
data required for the code to execute. The best result is obtained when both the code and data reside on the
same machine i.e, the colocation of the code and data produces the most effective processing.
4.Handling of errors/faults: MapReduce engines usually provides a high level of fault tolerance and robustness
in handling errors.The reason for providing robustness to this engines is their high tendency to sustain the
failures/errors. There are high chances of failuree in cluster nodes on which different parts of program are
running.Therefore the mapReduce engine must have the capability of recognising the fault and taking the
required action to rectify it. The MapReduce engine design involves the ability to find out the tasks that are
incomplte and assigns them to different nodes.
5.Scale-out architecture: MapReduce engines are built in such a way that they can accommodate more
machines as and when required. This possibility of introducing more computing resources to the architecture
makes the MapReduce programming model more suited to the higher computational demands of Bigdata.
Working of MapReduce:
MapReduce is a programming model within the Hadoop framework that is used to access Big data stored in
Hadoop file system(HDFS).MApReduce facilitates concurrent processing by splitting petabytes of data into
smaller chunks and processing them in parallel on Hadoop servers. At the end it aggregates data from multiple
servers to return a consolidated output back to the application.
MapReduce is a framework using which user can write applications to process huge amount of data in parallel,
on large clusters of commodity hardware in a reliable manner.
3. Extract some interesting patterns to prepare an output list by using the Map function
4.Arrange the output list properly to enable optimisation for further processing
The MapReduce approach of analyzing data can be used by programmers for implementing various kinds of
applications .This algorithm can also work with extremely large datasets ranging from Gigabytes to Terabytes.
MapReduce is a processing technique and a program model for distributed computing based on java. The
MapReduce algorithm contains two important tasks namely,
1.Map
2.Reduce
1.Map: Map takes a set of data and converts it into another set of data where individual elements are broken
down into tuples(key/value pairs) .
2.Reduce: Reduce task takes the output from a Map as an input and combines those data tuples into a smaller
set of tuples.
job tracker
task tracker 3
The MapReduce framework shown in the above figure is a combination of master and three slaves. The
master monitors the entire job assigned to the mapReduce algorithm and is given the name of job tracker.
Slaves are responsible for keeping track of individual task and are called task trackers. First the given job is
divided into a number of tasksby the master i.e the job trackr, which then distributes this tasks to the slaves .
It is responsibility of the job tracker to keep an eye on the processing activities and re-execution of the failed
tasks.Slaves coordinate with the master by executing the tasks.
The job tracker receives jobs from client applications to process large information. These jobs are assigned in
the forms of individual task to various task trackers. The task distribution operation is completed by the job
tracker. The data after being processed, by the task trackers is transmitted to the reduce function so that the
final, integrated output which is an aggregate of data processed by the map function can be provided.
The logical flow of data in the MapReduce programming framework is shown in the following figure.
Input Split Map Combine Shuffle&so Reduce Output
rt
Operations performed in the MapReduce model according to the data flow are as follows:
1.The input is provided from large data files in the form of key-value pair(KVP) which is the standard input
format in a Hadoop MapReduce programming model.
2.The input data is divided into small pieces and master and slave nodes are created.The master node usually
executes an the machines where the data is present and slaves are made to work remotely on the data .
3.The Map operation is performed on all the data pieces . The Map function extracts the relevant data and
generate KVP for it
4. The output list generated from Map function is passed to the Reduce function. The reduce function sorts
the data on the basis of KVP list. The process of collecting the map output list from the map function and
then sorting it according to the keys is known as shuffling.
5. The output is finally generated by the reduce function and the control is handed over to the user program
by the master.
An analysis of the MapReduce program execution shows that it involves a series of steps in which every step
has its own set of resouce requirements.Aside from optimizing the actual application code,user can use some
optimization techniques to improve the reliability and performance of MapReduce jobs.They fall into three
categories:
1.Hardware/Network Topology
2.Synchronization
3.File System
1.Hardware/Network Topology:MapReduce makes it possible for hardware to run the MapReduce tasks on
inexpensive clusters of computers. These computers can be connected through standard network.
The performance and fault tolerance required for Big Data operations are also influenced by the physical
location of servers.
The data center arranges the hardware in racks.The performance offered by hardware systems that are
located in the same rack where the data is stored will be higher than the performance of hardware systems
that are located in a different rack than the one containing the data.
The reason for the low performance of the hardware that is located away from the data is the requirement to
move the data and/or application code.
When the program is being executed we can configure the MapReduce engine to exploit the advantages of
its closeness to the data.
The best way to optimize the performance of MapReduce engine is to keep the application code and data
together.
Thus minimize the latency in MapReduce processing by keeping the hardware elements close to each other.
2.Synchronization:
The completion of map processing enables the reduce function to combine the various outputs for providing
the final result.
However the performance will degrade if the results of mapping are contained within the same nodes where
the data processing began.
In order to improve the performance ,we should copy the results from mapping nodes to the reducing nodes
which will start their processing tasks immediately.
The synchronization mechanism copy the mapping results to the reducing nodes immediately after they have
completed so that the processing can begin right away.
All the values from the same key are sent to the same reducer,again ensuring higher performance and better
efficiency.
The reduction outputs are written directly to the file system so it must be designed and tuned for best
results.
3.File System: A Distributed file system is used to support the implementation of the MapReduce operation.
Distributed file systems are different from local file systems in capability of storing and arranging data.
The Big Data world has immense information to be processed and requires data to be distributed among a
number of systems or nodes in acluster so that the data can be handled efficiently.
The distribution model followed in implementing the MapReduce programming approach is to use the
master and slave model.
All the metadata and access rights, apart from the mapping,block,and file locations are stored with the
master.
Thye data on which the application code will run is kept with the slaves.
The master node receives all the requests,which are forwarded to appropriate slaves for performing the
required actions.
The following points are required while designing a file that suppoerts MapReduce implementation:
1.Keep it warm:The master handles various operations which may lead to a stage of over working for it.Incase
of the failure of the master node, you will be unable to access the entire file system unless the master
becomes active again.In order to optimize the file system you can develop a standby master node that can
remain warm and whenever the primary master fails to deliver,can take the responsibility.
2.The Bigger the Better:In Big Data environment files with a size less than 100MB are not preferred so you
need to avoid using them.The best results are obtained when the distributed file systems that are used with
MapReduce are loaded with a small number of large-sized files.
3.The Long View:MapReduce handles the workload by keeping large jobs in samll data batches.Hence
MapReduce needs a network bandwidth that remains available for a long time instead of having quick
execution times of mappers and reducers.To use the network optimally a long stream of data should be send
by the application code ,when it is reading from/writing to the file system.
4.Right degree of security: The increasing number of security layers improves the performance of distributed
file systems. The permissions associated with file systems are meant to protect files from unauthorized
access and damage. It is always advisable to allow only authorized users to access the data and protect the
distributed file system.
Uses of MapReduce:
MapReduce is used to process various types of data obtained from various sectors.Some of the fields
benefited from MapReduce are :
1.Entertainment :Hadoop MapReduce assist endusers in finding the most popular movies based upon their
preferences and previous viewing history.Various OTT services including Netflix regularly release many Web
series and movies you could not pick which movie to watch so you looked at Netflix recommendations and
decided to watch one of the suggested series or films.
2.E-Commerce: Several E-Commerce companies includind Amazon,Flipcart and Ebay uses MapReduce to
evaluate consumer buying pattrerns on customers interest and historical purchasing patterns. Many E-
Commerce vendors uses MapReduce programming model to identify popular products based on customer
preference or purchasing behavior.
3.Social Media: Nearly 500 million tweets or about 3000 per second are sent daily on twitter platform.
MapReduces processes twitter data performing operations such as tokenization,filtering,counting and
aggregating counters.
4.Data Warehouse: Data warehouses are used to store huge volumes of information.Using MapReduce , we
can build a specialized business logic for data insights while analyzing huge data volumes in data warehouses.
5.Fraud Detection: Hadoop system is well suited for handling large volumes of data needed to create fraud
detection algorithms. Financial businesses including Banks,Insurance companies and payment locations use
Hadoop and MapReduce for fraud detection,pattern recognition evidence and business analytics through
transaction analysis.
HBase is an open source and sorted map data built on Hadoop .It is column oriented and horizontally
scalable.
1.HBase is a distributed column oriented database built on top of the Hadoop file system
2. HBase is a data model similar to Google's big table designed to provide quisk random access to huge
amount of structured data . It provides fault tolerance by HDFS.
3.It is a part of Hadoop eco system that provides real time read/write access to data in Hadoop file system
4. User can store the data in HDFS either directly or through HBase. Data consumer reads/ accesses the data
in HDFS randomly using HBase. HBase sits on the top of HDFS ands provides read/write access.
Read Write
HBase
HDFS
Characteristics of HBase:
1.HBase helps the programmers to store large quantities of data in such a way that it can be accessed easily
and quickly as and when required.
2.HBase stores the data in a compressed format and thus occupies less memory space.
3.HBase has low latency time therefore beneficial for lookups and scanning of large amounts of data.
4.HBase saves data in cells in the descending order with the help of timestamp,therefore a read will always
first determines the most recent values.
7.HBase is a platform for storing and retrieving data with random access.
8. Columns inHBase relate to column family .The column family name is utilized as a prefix for determining
members of its family. Example: WagonR and i10 are the members of cars column family.
9.A key is associated with rows in HBase table.It can be a calculated value or a string or any other data
structure used for controlling the retrieval of data to the cells in the row.
10.In column oriented database the data is saved across rows.The columns can be added very easily and are
added row by row providing a great flexibility,performance and scalability.
Exploring the Bigdata stack: The first step in process of designing any data architecture is to create a model
that should give a complete view of all the required elements. Bigdata analysis also needs the creation of
model or architecture commonly known as Bigdata architecture. In order to create a Bigdata architecture
model you need to think of Bigdata strategy principles such as storing of data ,analytics, reporting or
applications .The Bigdata environment must include considerations for hardware ,infrastructure software,
operational software ,management software , application programming interface(API) and software
development tools. The Bigdata environment must fulfill all the foundational requirements and must be able
to perform the following functions:
4.analyzing data .
The following diagram the arrangement of various layers in the Big Data architecture(Big Data Stack)
r visualization layer
e Hadoop Administration Data Analyst IDE/SDK Visualization tools
l
a
t
i
o
Analytics Engines
n
a Z Pig Hive Sqoop Statistical Analytics
l o
d o
Text Analytics
b k
D e MapReduce
B i e Search Engine
u
n p
n
g e
s
e r Real Time Engine
t
s
r
t
u
i
c Hadoop storage layer Data Warehouses
o NoSQL Database
t
n
u HDFS
l Analytics Appliances
r
a
e
y
d
e
a
r Rack Rack Rack
t
a Hadoop Infrastructure Layer Node Node Node
Security Layer
Monitoring Layer
The Big Data architecture consists of the following layers and components:
2.Ingestion Layer
3.Storage layer
6.Security layer
7.Monitoring layer
8.Analytics engine
9.Visualization layer
Organizations generate a huge amount of data on a daily basis.The basics function of the data sources layer is
to absorb and integrate the data coming from various sources at varying velocity and in different
formats.Before this data is considered for big datastack,we have to differentiate between the noise and
relevant information.
The following figure shows different data sources from where the telecom industry obtains its data:
Switching devices data Image and video feeds from Call data records
social networking sites
Access point data messages Transaction data Social networking site conversation
Call data record due to exponential GPS data GPS data
Feeds from social networking sites Call center voice feeds Call center-voice -to-text-feeds
E-Mail
SMS
Variety of Data sources for Telecom Industry
The data obtained from the data sources in the above figure has to be validated and cleaned before
introducing it for any logical use in the enterprise.The task of validating,sorting,and cleaning data is
done by the ingestion layer.The removal of noise from from the data also takes place in the ingestion
layer.
Ingestion Layer:
The role of the ingestion layer is to absorb the huge inflow of data and sort it out in different
categories.This layer separates noise from relevant information.It can handle huge amount ,high
velocity and a variety of data.The ingestion layer validates,cleanses,transforms,reduces,and
integrates the unstructured data into the Big Data stack for further processing.
The following figureshows the functioning of the ingestion layer:
NoSQL Database
HDFS
Integration
Compression
Transformation
Noise Reduction
Validation
Filteration
Identification
Data Sources
Functioning of the Ingestion Layer:In the ingestion layer,the data passes through the following steps:
1.Identification:At this stage, data is categorized into various known data formats, or we can say that
unstructured data is assigned with default formats.
2.Filteration:At this stage,the information relevant for the enterprise is filtered on the basis of the
enterprise Master Data Management(MDM) repository.
3.Validation:The filtered data is analyzed against MDM metadata.
4.Noise reduction:At this data is cleaned by removing the noise and minimizing the related
disturbances.
5.Transformation:At this stage,data is split or combined on the basis of its type,contents,and the
requirements of the organization.
6.Compression:At this stage,the size of the data is reduced without affecting its relevance for
required process.It should be noted that compression doesnot affect the analysis results.
7.Integration:At this stage,the refined dataset is integrated with the Hadoop storage layer,which
consists of Hadoop Distributed File System(HDFS) and NoSQL databases.
Data ingestion in the Hadoop world means ETL(Extract,Load and Transform) as opposed to ETL in
case of traditional warehouses.
Storage Layer
Hadoop is an open source framework used to storage large volumes of data in a distributed manner
across multiple machines.The Hadoop storage layer supports fault-Tolerance and
parallelization,which enable high-speed distributed processing algorithms to execute over large-scale
data.There are two major components of Hadoop:
i.A scalable Hadoop Distributed File System(HDFS) that can support petabytes of data and
ii.A MapReduce engine that computes results in batches.
HDFS is a file system that is used to store huge volumes of data across a large number of commodity
machines in a cluster.The data can be in terabytes or petabytes.HDFS stores data in the form of
blocks of files and follows the write-once-many model to access data from these blocks of files.The
files stored in the HDFS are operated upon by many complex programs as per the requirement.
Consider an example of a hospital that used to perform a periodic review of the data obtained from
the sensors and machines attached to the patients as well as analyze the effects of various medicines
on them.The growing volume of data made it difficult for the hospital staff to store and handle it.To
find a solution the hospital consulted a data analyst who suggested the implementation of HDFS as
an answer to this problem.HDFS can be implemented in an organization at comparitively low costs
than other advanced technologies and can easily handle the continuous streaming of data.
Earlier we needed to have different types of databases such as relational and non-rel;ational for
storing different types of data.All these types of data storage requirements can be addressed by a
single concept known as Not Only SQL(NoSQL) databases.Some examples of NoSQL databases include
HBase,MongoDB etc.
The following diagram shows different types of NoSQL databases such as
1.Graph-Based
2.Document-Based
3.Column-Oriented
4.Key-Value Pair used for storing different types of data.
NoSQL Databases Types
Key-Value Pair Column-Oriented
The lowest level of the stack is the physical infrastructure layer which includes hardware,network etc.Before
learning about the physical infrastructure layer you need to know about the principles on which Big Data
implementation is based.Some of these principles are:
2.Availability:The infrastructure setup must be available at all times to ensure nearly a 100 percent uptime
guarantee of service.It is obvious that businesses cannot wait incase of a service interruption or
failure;therefore an alternative of the main system must also be maintained.
3.Scalability:The Big Data infrastructure should be scalable to accommodate varying storage and coimputing
requirements.
4.Flexibility:Flexible infrastructure facilitates adding more resources to the setup and promote failure
recovery.It should be noted that flexible infrastructure is also costly;however costs can be controlled with the
use of cloud services where you need to pay for what you actually use.
5.Cost:You must select the infrastructure that ypu can afford.This includes all the hardware,networking and
storage requirements.You must consider all the above parameters in the context of your overall budget and
then make trade-offs where necessary.
From the above points it can be concluded that a robust and inexpensive physical infrastructure can be
implemented to handle Big Data. This requirement is addressed by the Hadoop physical infrastructure
layer.The Hadoop physical infrastructure layer also supports redundancy of data,because data is collected
from so many different sources.
In the Big Data environment networks should be redundant and capable of accommodating the anticipated
volume and velocity of the inbound and outbound data in case of heavy network traffic.Organizations that
plan to make Big Data an integral part of their computing strategy must be prepared for improving their
network performance to handle the increase in the volume,velocity and variety of data.
Hardware resources for storage and servers must also have sufficient speed and capacity to handle all types
of Big Data.If slow servers are connected to high-speed networks, the slow performace of the vservers will be
of little use and can,at times,also become a bottleneck.
Infrastructure Operations
The role of the platform management layer is to provide tools and query languages for accessing NoSQL
databases.This layer uses theHDFS storage file system that lies on the top of the Hadoop physical
infrastructure layer.The following figure shows the interaction of the Hadoop platform management layer
with its lower layers.
Hadoop platform layer(MapReduce,Hive,pig)
High-speed Network
Hadoop consists of two core components:HDFS and MapReduce and different types of tools that works on
Hadoop framework to store,access and analyze large amount of data by using real-time analysis.
The following are the key building blocks of the Hadoop platform management layer:
1.MapReduce:Refers to a technology that simplifies the creation of processes for analyzing huge amount of
unstructured and structured data.It is a combination of map and reduce features.Map is a component that
distributes multiple tasks across a large number of systems and also handles the task of distributing the load
for recovery management against failure. When the task of distributed computation is completed,the reduce
function combines all the elements back together to provide an aggregate result.
2.Hive:Refers to a data warehousing packages built over Hadoop architecture.Hive provides SQL type query
language called Hive Query Language for querying data stored in a Hadoop cluster.
3.Pig:Refers to a scripting language that is used for batch processing huge amounts of data and allows us to
process the data in HDFS in parallel.
4.HBase: This Refers to a column-oriented database that provides fast access for handling Big Data.It is
Hadoop compliant and suitable for batch processing.
5.Sqoop:Refers to a command-line tool that can import individual tables,specific columns or entire database
files directly in the distributed file system or data warehouse.
6.ZooKeeper:Refers to a coordinator that keeps multiple Hadoop instances and nodes in synchronization and
provides protection to every node from failing because of the overload of data.
Security Layer
The security layer handles the basic security principles that Big Data architecture should follow.Big Data
projects are full of security issues because of using the distributed architecture, a simple programming model
and the open framework of services.Therefore the following security checks must be considered while
designing a Big Data stack:
1.Data access:User access to raw or computed big data has about the same level of technical requirements
as non-big data implementations.The data should be available only to those who have a valid business need
for examining or interacting with it.Most core data storage platforms have accurate security schemes and are
often augmented with a federated identity capability,providing appropriate access across the many layers of
the architecture.
2.Application access:Application access to data is also relatively straight forward from a technical
perspective. Most application programming interfaces(APIs) offer protection from unauthorized usage or
access. This level of protection is probably adequate for most big data implementations.
3.Data Encryption:Data encryption is the most challenging aspect of security in a big data environment.In
traditional ironment encrypting and decrypting data stresses the systems resources.With the volume,velocity
and varieties associated with Big data this problem is worsen.The simplest approach is to provide more and
faster computational capability which can be done by accommodating resiliency requirements and to identify
the data elements requiring this level of security and to encrypt only the necessary items.
4.Threat Detection:The inclusion of mobile devices and social networks exponentially increases both the
amount of data and the opportunities for security threats.It is therefore important that organizations take a
multiple approaches to provide security.
Monitoring Layer:
The monitoring layer consists of a number of monitoring systems.These systems remain automatically aware
of all the configurations and functions of different operating systems and hardware.They also provide the
facility of machine communication with the help of a monitoring tool through high-level protocols such as
Extensible Markup Language(XML).Monitoring systems also provide tools for data storage and
visualization.Some examples of open source tools for monitoring Big Data stacks are Ganglia and Nagios.
Analytics Engine:
The role of an analytics engine is to analyze huge amounts of Unstructured data.This type of analysis is
related to text analytics and statistical analytics.Some examples of different types of unstructured data that
are available as large datasets include the following:
Visualization Layer:
The visualization layer handles the task of interpreting and visualizing Big Data.
Visualization of data is done by data analysts and scientists to have a look at the different aspects of the dta
in various visual modes.
It can be described as viewing a piece of information from different perspectives interpreting it in different
manners,trying to fit it in different types of situations and deriving different types of conclusions from it.
Role of VIsualization:
The visualization layer works on top of the aggregated data stored in traditional Operational Data
Stores(ODS),data warehouse,and data marts.These ODS get the aggregated data through the data scoop as
shown in the following figure.
Some examples of visualization tools are Tableau,Spotfire,MapR and revolution R.These tools work on top of
the traditional components such as reports,dashboards and queries.
Visualization Tools
Operational Data
Stores
Data Scoop
Data Warehouse Data Lakes
A simple DBMS stores the data in the form of schemas and tables compromising rows and columns . The
main goal of DBMS is to provide a solution for storing and retrieving the information in a continent and
efficient manner. The most common way of fetching the data from these tables is by using SQL. A RDBMS
stores the relationships between these tables in columns that serve as a reference for another table.These
columns are known as primary keys and foreign keys which are used to refer other tables so that data can be
related between the tables and retrieved as and when it is required.
Such a database system usually consist of several tables and relationships between those tables which help in
classifying the information contained in them. Data in tables is stored in the form of rows and columns .The
size of a file increases as new data and records are added resulting in an increase in the size of the database.
All these files are commonly shared by several users through a database server.
A data warehouse is also used for handling large amount of data or big data. A data warehouse can be
defined as the association of data from various sources that are created for supporting planned decision
making.” A data warehouse is a subject oriented, integrated, time variant and non-volatile collection of data
In support of management’s decision making process”. The primary goal of a data warehouse is to provide a
consistent picture of the business at a given point of time. Using different data warehousing tools employees
in the organization can execute online queries and mine data according to their requirements.
RDBMS uses a relational model where all data is stored using schemas. This schemas are linked using the
values in specific columns of each table. The data is hierarchical which means for data to be stored or
transacted it needs ACID standards namely
2. Consistency: Ensures that data by schema standards (Table) such as correct datatype entry, constraints
and keys.
3.Isolation : Refers to the encapsulation of information. Makes only necessary information visible.
4.Durability: Ensures that transaction stay valid even after a power failure or errors.
In traditional database systems every time the data is accessed or modified it requires to be moved to a
central location of processing. RDBMS assumes your data to be essentially correct beforehand. Mapping real
world problems to a relational database model comprises many improvised strategies such as conceptual,
logical and physical modeling.
The internet has grown in the past two decades .Numerous domains have been registered ,morethan a
billion Gigabytes of web space has been reserved. With the digital revolution of the early 1990’s providing the
personal computer segment, web transactions have grown rapidly.With the inventions of search engines
information is available easily and freely distributed across social media platforms has increased the volume
of transactions.
The Bigdata mainly takes 3 V’s into account- volume, variety and velocity.
1.Volume of data: Bigdata is designed to store and process a few hundred Terabytes or even petabytes or
Zettabytes of data.
2.Variety of data: Collection of data different from a format suiting relational database systems is stored in a
semi structured or an unstructured format .
3.Velocity of data: The rate of data arrival might make an enterprise data warehouse problematic,
particularly where formal data preparation processes like confirming, examining, transforming and cleansing
of data needs to be accomplished before it is stored in data warehouse table.
Bigdata solutions:
Bigdata solutions are designed for storing and managing huge amount of data using a simple file structure
,formats and highly distributed storage mechanisms with the initial handling of data occurring at each storage
node unlike RDBMS
One of the biggest difficulties with RDBMS is that it is not yet near the demand levels of Big Data.The volume
of data handling today is rising at a fast rate.
Big Data primarily comprises semi-structured data such as social media analysis and text mining data, while
RDBMSs are more suitable for structured data ,such as weblog,sensor,and financial data.
Data
Final output
Execute Query
Execution of query using a Bigdata Batch processing Solution
Bigdata cluster
cluster
cluster server controller
cluster info
cluster server
final output
CAP Theorem is also known as Brewers thoerem.It states that it is not possible for a distributed system to
provide all the following three conditions at the same point of time:
The CAP theorem is very helpful in decision making in the case of future databases.
1.Maintenance problem: The maintenance of the relational database becomes difficult over time due to the
increase in data . Developers and programmers have to spend lot of time maintaining the database.
2.Physical storage: A relational database comprise of rows and columns ,which requires a lot of physical
memory ,because each operation performed depends on separate storage. The requirements of physical
memory may increase with increase of data.
3.Lack of scalability: While using relational database over multiple servers , its structure changes and
becomes difficult to handle, especially when the quantity of data is very large. Due to this the data is not
scalable on different physical storage servers. Ultimately its performance is affected ,I.e., lack of availability of
data and load time etc.As the database becomes larger or more distributed with a greater number of servers
,this will have negative effect like latency and availability issues affecting overall performance.
4. Complexity in structure: Relational databases can only store data in tabular format,which makes it difficult
to represent complex relationships between objects. This is an issue because many applications require
morethan one table to store all the necessary data required by their application logic.
5.Decrease in performance over time: The relational databases can become slower ,not just because of its
reliance on multiple tables. When there is a large number of tables and data in the system, it causes an
increase in complexity. It can lead to slow response time over queries or even complete failure for the m
depending on how many people are logged into the server at a given time.
Non Relational Database:
A Non Relational Database is a database that does not use the tabular schema of rows and columns found in
most traditional database system i.e., A database that doesnot use the table model of RDBMS is a Non
Relational Database .such a kind of database requires effective data operation techniques and
processes,often custom design to provide the solutions to Bigdata problems. NoSQL(Not only SQL) is one
such example of a non relational database. Most non relational databases are associated with websites such
as Google,Amazon,Yahoo and Facebook. This websites introduce new applications almost everyday with
millions of users,So they require a non relational database to handle unexpected traffic.
An important class of non relational databases is the one that does not support the relational model ,but
uses and relies on SQL as the primary means of manipulating existing data .
1.Scalability: It refers to the capability to write the data across multiple data clusters simultaneously,
irrespective of physical hardware or infrastructure limitation.
2.Seamlessness: Non relational databases have the capability to expand/contract and accommodate varying
degrees of increasing or decreasing data flow without affecting the enduser experience.
3.Data and query model: Instead of the traditional row/column ,key-value structure ,non relationaldatabses
use frameworks to store data with a required set of queries or API’s to access the data.
The non relational data model looks more like a concept of one entity and all the data that relates to that
entity is called a Document.
Non relational databases are less reliable than the relational databases because they compromise
reliability for performance.
Non relational databases also compromise consistency for performance unless manual support is
provided.
Non relational databases use different query languages than relational databases ,so users need to
learn new query languages.
In many non relational databases security is lower than relational databases which is a major
concern. Like MongoDB and Cassandra both databases have lack of encryption for data files ,they
have very weak authentication system and very simple authorization.
Polyglot Persistance:
The term polyglot in Bigdata is applied to a set of applications that use several core database technologies.A
polyglot database is often used to solve a complex problem by breaking it into fragments and appling
different database models.Then the results of different sets are aggregated into a data storage and analysis
solution.
Polyglot persistence is an enterprise storage term used to describe choosing different data storage/data
stores technologies to support the various data types and their storage needs.
For examples,Disney in addition to RDBMS also uses Cassandra and MongoDB.Netflix uses Cassandra,Hbase
and SimpleDB.
A lot of corporations still use relational databases for some data ,but the increasing persistence requirements
of dynamic applications are growing from predominantly relational to a mixture of data sources.
However like distributed or parallel computing,if the data is scatteredover different database nodes,Ypu can
at least be sure that not all functionalities would be hampered in an unlikely event.
Polyglot persistence is quite a common and popular paradigm among database professionals but an
important reason is that users like applications,come studdede with rich experiences-where they can easily
search for what they are looking without caring about rest of the frills-such as finding friends,nearby
restaurants,accurate information etc.
The following figure shows an example of a web application using different types of dtabases for handling
complex user queries.
Web Applications
1. Data Availability: Data Availability is a well known challenge for any system related to transforming and
processing data for use by end users and bigdata is no different. The challenge is to sort and load the data
which is unstructured and in varied formats.Also ,context-sensitive data involving several different domains
may require another level of availability check.The data present in the Big Data hierarchy is not
updated,reprocessing new data containing updates will create duplicate data,and this needs to be handled to
minimize the impact on availability.
2.Pattern Study:Pattern study is nothing but the centralization and localization of data according to the
demands.A global E-Commerce website can centralize requests and fetch directives and results on the basis
of end user locations so as to return only meaningfulcontextual knowledge than to impart the entire data to
the user.Trending topics are one of the pattern-based data study models that are a popular mode of
knowledge gathering for all platforms.The trending pattern to be found for a given regional location is
matched for occurrence in the massive data stream.
3.Data Incorporation and Integration:since no guidebook format or schema metadata exists the data
incorporation process for Big Data is about just acquiring tha data and storing as files.However continuous
data processing on a platform can create a conflict for resurces over a given period of time often leading to
deadlocks.Especially in case of big documents ,images or videos if such requirements happen to be the sole
architecture driver,a dedicated machine can be allocated for this task,bypassing the guesswork involved in
the configuration and setup process.
4.Data Volumes and Exploration:Traffic spikes and volatile surge in data volumes can easily dislocate the
functional architecture of corporate infrastructure due to the fundamental nature of the data streams.On
each cycle of data acquisition completion,retention requirements for data can vary depending on the nature
and the freshness of the data and its core relevance to the business.Data exploration and mining is an activity
responsible for Big data procurements across organizations and also yields large data sets as processing
output.These data sets are required to be preserved in the system by occasional optimization of intermediary
data sets.
5.Compliance and Localized Legal Requirement:Various compliance standards such as safe Harbor and PCI
regulations can have some impact on data security and storage.Therefore these standards should be
judiciously planned and executed.Moreover there are several cases of transactional data sets not being
stored online required by the courts of law.Big Data infrastructure can be used as a storage engine for such
data types,but the data needs to comply with certain standards and additional security.
6.Storage performance:In all these years storage-based solutions didnt advance as rapidly as their
counterparts,processors,memories did.Disk performance is a vital point to be taken care of while developing
Big Data systems and appliance architecture can throw better on the storage class and layered architecture.
Bigdata Analysis and data warehouse: Big data is analyzed to know the present behavior or trends and make
future predictions.Various Bigdata solutions are used for extracting useful information from Bigdata. A simple
working definition of bigdata solution is that it is a technology that:
Bigdata analytics performed by a bigdata solution helps organizations in the following ways:
1.Brings improvement in the tourism industry by analyzing the behavior of tourists and their trends
A big data of Argon Technology wants a solution for analyzing the data of 100,000 employees across the
world.Assessing the performance manually of each employee is a huge task for the administrative
department before rewarding bonuses or increasing salaries.Argon Technology experts set up a data
warehouse in which information related to each employee is stored.The administrative department extracts
that information with the help of Argon’s Bigata solution easily analyzes it before providing benefits to an
employee.
Complexity of the data warehouse environment has risen dramatically in the recent years with the influx of
data warehouse appliances architecture,NoSQL/Hadoop,databases and several API-based tools for many
forms of cutting-edge analytics or real-time tasks.A Big Data solution is preferred because vthere is a lot of
data that has to be manually and relatively handled.In organizations handling Big Data if the data is
potentially used it can provide much valuable information leading to superior decision making which in turn
can lead to more profitability, revenue and happier customers.
On comparing a data warehouse to a Big data solution, we find that a Big Data solution is a technology and
data warehousing is architecture.
2.In a typical data warehouse you will find a combination of flat files,relational database tables and non-
relational sources.
Big data is not a substitute for a data warehouse.Data warehouses work with abstracted data that has been
operated,filtered , and transformed into a separate database,using analytics such as sales trend analysis or
compliance reporting.That database is updated gradually with the same filtered data , either at a weekly or
monthly basis.
Organizations that use data warehousing technology will continue to do so and those that use both bigdata
and data warehousing are future-proof from any further technological advancements only up till the point
where the thin line of separation starts depleting.Conventional data warehouse systems are proven systems
with investments to the tune of millions of dollars for their development,those systems are not going
anywhere,soon.Regardless of how good and profitable bigdata analytics is or turns out,data warehousing will
still continue to provide crucial database support to many enterprises,and in all circumstances ,will complete
the lifecycle of current systems.
1.Scalability and Speed:The deploying hybrid Big Data platform supports parallel processing,optimized
appliances and storage,workload management and dynamic query optimization.
2.Agility and Elasticity:The hybrid model is agile,which means it is flexible and responds rapidly in case of
changing trends.It also provides elasticity which means this model can be increased or decreased as per the
demand of the user.
3.Affordability and Manageability:The hybrid environment will integrate flexible pricing,including licensed
software,custom designed appliances and cloud-based approaches for future proofing.
However when organizations want to utilize the huge information generatedby Big Data sources,the curbs on
the traditional models are more evident and apparent in real-time functioning.Hence appliance data
warehouse has become a real-world method of producing an enhanced environment to support the
transition to new information management.
5.Cloud Deployment:Big Data business applications with cloud-based approaches have an advantage that
other methods (such as software on commodity hardware) lack.The following are the advantages of cloud
deployment:
1.On-Demand self-service:Enables the customer to use self-force cloud service with minimal interaction
involving the cloud service provider.
2.Broad network access:Allows Big Data cloud resources to be available over the network and accessible
acreoss different client platforms.
3.Multi-User:Allows cloud resources to be allocated so that isolation can be guaranteed to multiple users
their computations and data from one another.
5.Measured service:Enables remote monitoring and billing of Big Data cloud resources.
NoSQL DATA MANAGEMENT
Introduction to NoSQL:
NoSQL is a non relational database that varies from traditional relational database system.NoSQL database is
provided for distributed data stores where there is a need for large scale of data storing.Foe example,Google
and Facebook are collecting terabytes of data daily for their users.Such databases do not require fixed
schema.
NoSQL database are still in the development atages and are going through a lot of changes.In a NoSQL
database ,tables ae stored as ASCII files,with each tuple represented by a row and fields separated with tabs.
We are in the era of Bigdata and are in search of ways of handling it.This gave rise to the need for creating
schema –free databases that can handle large amounts of data.These databases are scalable,enable
availability of user ,support replication, and they are distributed and open source.
Before developing applications that can interact with NoSQL databases,we must understand the need for
maintaining a separation between data management and data storage.NoSQL mainly focuses on high
performance scalable data storage and provides low level accsess to the data management layer.This allows
data management tasks to be created easily .The following figure shows the interaction of laters in a NoSQL
database.
Why NoSQL?
The concept of NoSQL database became popular with internet giants like Google,Facebook,Amazon etc who
deal with huge volumes of data.The system response time becomes slow when you use RDBMS for massive
volumes of data.
To resolve this problem we could scale up our systems by upgrading our existing hardware.This process is
expensive.
Characteristics of NoSQL:
When the creators of NOSQL chose the name NoSQL they meant to communicate the fact that here is a
database query language that does not follow the principles of RDBMS.
Another important characteristic of NoSQL databases is that they are generally open-source projects.
The term NoSQL is frequently applied to an open-source phenomenon.Most NoSQL databases are driven by
the need to run on clusters.
This comes particularly useful while dealing with non-uniform data and custom fields.
The following are the most common features that define a basic NoSQL database.
1.Non-Relational:
2.Schema-free:
3.Simple API:
Offers easy to use interfaces for storage and querying data provided.
APIs allow low-level data manipulation and selection methods.
Web-enabled databases running as internet-facing services.
Text-based protocols mostly used with HTTP.
4.Distributed:
History of NoSQL:
In early 1970 Flat File Systems are used.Data were stored in flat files and the biggest problems is that there is
no standards, so that it is very difficult to store and retrieve data from it.
Then the relational databases was introduced by E.F.Codd.But relational databases could not handle Big Data
so due to this problem there was a need of database .
Later on NOSQL database was invented by an engineer named Carlo Strozzi in 1998.NoSQL stands for “Not
Only SQL” provides a mechanism for the retrieval and storage of data.At that time NoSQL did not provide SQL
interface but was relational.
Later, when Johan Oskarsson organized an event on open-source distributed databases,Eric Evans
reintroduced the term NOSQL in 2009.
Most of the NoSQL databases are non-relational,distributed and do not follow ACID properties.
1998-Carlo Strozzi use the term NoSQL for open-source relational database
2000-Graph database Neo4j is launched.
2004-Google BigTable is launched.
2005-CouchDB is launched.
2007-research paper on amazon Dynamois released.
2008-Facebooks open sources the Cassandra project.
2009-The term NoSQL was reintroduced.
Every data model has its unique attributes and limitations.Users should select the database based on their
products needs.
Key Value
Key Value
Example:
Key Value
City Nizamabad
State Telangana
Country India
Features:
A column-oriented database is a non-relational database that stores the data in columns instead of
rows.That means when we want to run analytics on a small number of columns you can read those columns
directly without consuming memory with the unwanted data.
Columnar databases are designed to read data more efficiently and retrieve the data with greater speed.A
Columnar database is used to store a large amount of data.
The data maintained by columns are in the form of column-Specific files.In Column-oriented data model the
performance on the aggregation queries such as COUNT,SUM,AVG,MIN and MAX is high.
The pictorial representation of the column-oriented data model is shown in the figure:
Column Family
Column-based No SQL databases are widely used to manage Data warehouses,Business intelligence,CRM.
HBase and Cassandra are NoSQL query examples of Column based database.
1.Elastic Scalability:It is highly scalable;it allows to add more hardware to aacomadate more customers and
more data as per requirement.
2.Flexible data storage:It supports all possible data formats like structured,semi-structured and unstructured
data.
3.Easy data distribution:It provides the flexibility to distribute data where you need by replicating data across
multiple data centers.
The document-based database is a non-relational database.Instead of storing the data in rows and
columns(tables),it uses the documents to store the data in the databases.A document database stores data in
JSON(Java Script Object Notation),BSON or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data objects used in applications
which means less translation is required to use theses data in the applications.In document database the
particular elements can be accessed by using the index value that is assigned for faster querying.These
documents can be grouped together into collections to form database systems.
1.Flexible Schema:Documents in the database has a flexible schema.It means the documents in the database
need not be the same schema.
2.Faster creation and maintenance:The creation of documents is easy and minimal maintenance is required
once we create the document.
3.No foreign keys:There is no dynamic relationship between two documents so documents can be
independent of one another.So there is no requirement for a foreign key in a document database.
Examples:The document type is mostly used for blogging platforms,real-time analytics and E-Commerce
applications.
4.Graph Databases:
The graph is a collection of nodes and edges where each node is used to represent an entity and each edge
describes the relationship between entities.
A graph-oriented database,or graph database is a type of NoSQL database that uses graph theory to
store,map and query relationships.
Nodes or points are instances or entities of data which represent any object to be tracked such as
people,accounts,locations etc.
Edges or lines in the graph databases represent relationships between nodes.The connections have a
direction that is either unidirectional or bidirectional.
Properties represent descriptive information associated with nodes.
Friends
Person
Likes(rating,review)
City Restaurant
Located In(address)
NoSQL
Schema-Less Databases:
A database without any schema is known as schema-less database.user need to define a schema or structure
before storing data into relational databases.
A well-defined schema of a database describes the tables,columns and data types of the values in the column
of the database.
Storing data in NoSQL is easier in comparison to data storage in SQL .In the case of a key/value database, you
can store any type of data under a particular key. In the case of document databases there is no restriction on
the type of document you want to store.
In case of column databases you can store any type of data under any column according to the requirement.
In graph databases there is no restrictions in adding edges or properties to nodes. You can add edges and
simultaneously define node properties as per the requirement.
In case of scheme-defined database you need to plan for the collection of data to be stored in the database.
However, no such planning is required for schema-less databases, and the data type can be changed at run
time, if required. If you find a new data item while storing data, you can add them easily. Similarly you can
remove the items that are not required at run time without worrying about existing related data like in case
of relational databases.
A schema-less database also makes it easier to arrange non-uniform information, where each record has an
alternate set of fields.
A schema-less database permits each record to contain exactly what it needs.This type of database also
eliminates many conditions found in fixed-schema databases.
Materialized Views:
The priority for developers and data administrators in data storage is how the data is stored and how it is
read. The chosen storage format is usually closely related to the format of the data, requirements for
managing data size and data integrity and the kind of store in use.
For example when using the document model, data is often represented as a series of aggregates each of
which contains all information about an entity.
However this may have a negative effect on queries. When a query requires only a subset of the data from
some entities such as a summary of orders of several customers without all the order details it needs to
extract all of the data for the relevant entities in order to obtain the required information.
To support efficient querying a common solution is to generate a view that contains the data in a format
most suited to generate the required dataset.
This view is known as the “Materialized View”. The “Materialized View” is defined as a pattern that generates
pre-populated views of data in environment where the source data is in a format that is not suitable for
querying.
These materialized views which contain only data required by a query allow applications to quickly obtain the
information they need. In addition to joining tables or combining data entities materialized views may include
the current values of calculated columns or data items, the results of combining values or executing
transformations on the data items and values specified as part of the query.
A materialized view and the data it contains are completely disposable because they can entirely rebuilt from
the source data stores. A materialized view is never updated directly by an application and so it is effectively
a specialized cache.
When there is a change in the source data while creating view, the view must be updated to include the new
information. This may occur automatically on an appropriate schedule, or when the system detects a change
in the original data. In other cases it may be necessary to generate the view manually.
CAP Theorem:
In case of distributed databases the three important aspects of the CAP theorem are:
1. Consistency(C)
2. Availability (A)
The CAP theorem states that in any distributed system, we can select only two aspects given by Eric Brewer, a
professor of computer science, University of California, Berkeley.
“Though it’s desirable to have Consistency, Availability and Partition-tolerance in every system, unfortunately
no system can achieve all three at the same time”.
In CAP theorem
The first one refers to the number of nodes that should respond to a read request before it is
considered as a successful operation.
The second is the number of nodes that should respond to a write request before it is considered a
successful operation.
The third is the number of nodes where the data is replicated or copied.
1. Consistency: Implies consistent data in the database even after the execution of an operation.
For example after an insert operation all the clients will see the same data.
2. Availability: Implies that the system is always available i.e., there is no down time.
3. Partition-Tolerance: Implies the system continues to work even though the communication among the
servers is unreliable.
Generally it is impossible to sustain all the three requirements, Therefore according to CAP theorem two
out of the three requirements should be followed,
The three combinations are: CA, CP, and AP as shown in the figure:
1. CA: Implies single site cluster in which all nodes communicate with each other.
2. CP: Implies all the available data will be consistent or accurate provided some data may not be
available.
3. AP: Implies some data returned may be inconsistent.
ACID Property:
ACID (Atomicity, Consistency, Isolation, and Durability) transactions guarantee four properties:
Atomicity: In this either all the operations in transaction will complete or not a single one. If any
part of transaction fails, the entire transaction will fail.
Consistency: A transaction must leave the database in an inconsistent state. This ensures that
any transaction which is done will change the database to another valid state.
Durability: The successful completion of transaction will not be reserved.
RDBMS mongoDB
Consistency
CA CP
Availability AP Partition-tolerance
Cassandra