CCS334 BIG DATA ANALYTICS - Notes - Fullsyllabus
CCS334 BIG DATA ANALYTICS - Notes - Fullsyllabus
COIMBATORE
Regulation 2021
Data Variety:
It is the assortment of data. Traditionally data, especially operational data, is
―structured‖ as it is put into a database based on the type of data (i.e., character, numeric,
floating point, etc.).
Wide variety of data:
Internet data(Social media ,Social Network-Twitter, Face book), Primary Research (Surveys,
Experiences, Observations), Secondary Research (Competitive and Market place data,
2
Industry reports, Consumer data, Business data), Location data (Mobile device data,
Geospatial data), Image data (Video, Satellite image, Surveillance), Supply Chain data
(vendor Catalogs, Pricing etc), Device data (Sensor data, RF device, Telemetry)
Structured Data
They have predefined data model and fit into relational database. Especially,
operational data is ―structured‖ as it is put into a database based on the type of data (i.e.,
character, numeric, floating point, etc.)
Semi-structured data
These are data that do not fit into a formal structure of data models. Semi-structured
data is often a combination of different types of data that has some pattern or structure that is
not as strictly defined as structured data. Semi-structured data contain tags that separate
semantic elements which includes the capability to enforce hierarchies within the data.
Unstructured data
Do not have a predefined data model and /or do not fit into a relational database.
Oftentimes, text, audio, video, image, geospatial, and Internet data (including click streams
and log files) are considered unstructured data.
Data Velocity
Data velocity is about the speed at which data is created, accumulated, ingested, and
processed. The increasing pace of the world has put demands on businesses to process
information in real-time or with near real-time responses. This may mean that data is
processed on the fly or while ―streaming‖ by to make quick, real-time decisions or it may be
that monthly batch processes are run inter-day to produce more timely decisions.
Why bother about Unstructured data?
- The amount of data (all data, everywhere) is doubling every two years.
- Our world is becoming more transparent. Everyone is accepting this and people
don’t mind parting with data that is considered sacred and private.
- Most new data is unstructured. Specifically, unstructured data represents almost
95 percent of new data, while structured data represents only 5 percent.
- Unstructured data tends to grow exponentially, unlike structured data, which tends
to grow in a more linear fashion.
- Unstructured data is vastly underutilized.
Need to learn how to:
- Use Big data
- Capitalize new technology capabilities and leverage existing technology assets.
- Enable appropriate organizational change.
- Deliver fast and superior results.
Advantage of Big data Business Models:
Improve Operational Increase Achieve Competitive
Efficiencies Revenues Differentiation
Reduce risks and costs Sell to microtrends Offer new services
Save time Enable self service Seize market share
Improve customer
Lower complexity Incubate new ventures
experience
Enable self service Detect fraud
3
Web Analytics
Web analytics is the measurement, collection, analysis and reporting of web data for
purposes of understanding and optimizing web usage. Web analytics is not just a tool
for measuring web traffic but can be used as a tool for business and market research,
and to assess and improve the effectiveness of a web site. The following are the some of
the web analytic metrics: Hit, Page view, Visit / Session, First Visit / First Session,
Repeat Visitor, New Visitor, Bounce Rate, Exit Rate, Page Time Viewed
/ Page Visibility Time / Page View Duration, Session Duration / Visit Duration. Average
Page View Duration, and Click path etc.
Why use big data tools to analyse web analytics data?
Web event data is incredibly valuable
• It tells you how your customers actually behave (in lots of detail), and how
that varies
• Between different customers
• For the same customers over time. (Seasonality, progress in customer
journey)
• How behaviour drives value
• It tells you how customers engage with you via your website / webapp
• How that varies by different versions of your product
• How improvements to your product drive increased customer
satisfaction and lifetime value
• It tells you how customers and prospective customers engage with your
different marketing campaigns and how that drives subsequent behaviour
Deriving value from web analytics data often involves very personalized
analytics
• The web is a rich and varied space!
E.g.
• Bank
• Newspaper
• Social network
• Analytics application
• Government organisation (e.g. tax office)
• Retailer
• Marketplace
• For each type of business you‟d expect different :
• Types of events, with different types of associated data
• Ecosystem of customers / partners with different types of relationships
• Product development cycle (and approach to product development)
• Types of business questions / priorities to inform how the data is
analysed
Web analytics tools are good at delivering the standard reports that are
common across different business types.
• Where does your traffic come from e.g.
• Sessions by marketing campaign / referrer
• Sessions by landing page
4
• Understanding events common across business types (page views,
transactions, „goals‟) e.g.
• Page views per session
• Page views per web page
• Conversion rate by traffic source
• Transaction value by traffic source
• Capturing contextual data common people browsing the web
• Timestamps
• Referer data
• Web page data (e.g. page title, URL)
• Browser data (e.g. type, plugins, language)
• Operating system (e.g. type, timezone)
• Hardware (e.g. mobile / tablet / desktop, screen resolution, colour
depth)
• What is the impact of different ad campaigns and creative on the way users
behave, subsequently? What is the return on ad spend?
• How do visitors use social channels (Facebook / Twitter) to interact around
video content? How can we predict which content will “go viral”?
• How do updates to our product change the “stickiness” of our service?
I) Digital Marketing
5
III) Big data and Advances in health care
6
Database Marketers, Pioneers of Big Data
Database marketing is concerned with building databases containing info
about individuals, using that information to better understand those individuals and
communicating effectively with some of those individuals to drive business value.
Marketing databases are typically used for
i) Customer acquisition
ii) Retaining and cross-selling to existing customers which reactivates the cycle
As companies grew and systems proliferated, a situation where there was one system
for one product and another for another product etc. was landed up (silos). Then
companies began developing technologies to manage and duplicate data from multiple
sources. Companies started developing software that could eliminate duplicate
customer info (de-duping). This enable them to extract customer information from
silos product systems, manage the info into single database, remove all the duplicates
and then send direct mail to subsets of the customers in the database. Companies such
as Reader’s Digest and several other firms were early champions of this new kind of
marketing and they used it very effectively. By the 1980’s marketers developed the
ability to run reports on the info in their databases which gave them better and deeper
insights into buying habits and preferences of customers. Telemarketing became
popular when marketers figured out how to feed information extracted from customer
databases to call centers. In 1990’s email entered the picture and marketers saw
opportunities to reach customers via Internet and WWW. In the past five years there
has been exponential growth in database marketing and the new scale is pushing up
against the limits of technology.
Big Data & New School of Marketing
New school marketers deliver what today’s consumers want ie. Relevant
interactive communication across digital power channels
Digital power channels: email, mobile, social display and web.
Consumers have changed so must marketers
7
Social & Affiliate Marketing or Pay for Performance Marketing on the Internet
The concept of affiliate marketing, or pay for performance marketing on the Internet
is often credited to William J. Tobin, the founder of PC Flowers & Gifts.
Amazon.com launched its own affiliate program in 1996 and middleman affiliate
networks like Link-share and Commission Junction emerged preceding the 1990s
Internet boom, providing the tools and technology to allow any brand to put affiliate
marketing practices to use. Today, most of the major brands have a thriving affiliate
program. Today, industry analysts estimate affiliate marketing to be a $3 billion
industry. It’s an industry that largely goes anonymous. Unlike email and banner
advertising, affiliate marketing is a behind the scenes channel most consumers are
unaware of.
In 2012, the emergence of the social web brings these concepts together. What only
professional affiliate marketers could do prior to Facebook, Twitter, and Tumblr, now
any consumer with a mouse can do. Couponmountain.com and other well known
affiliate sites generate multimillion dollar yearly revenues for driving transactions for
the merchants they promote. The expertise required to build, host, and run a business
like Couponmountain.com is no longer needed when a consumer with zero technical
or business background can now publish the same content simply by clicking ―Update
Status‖ or ―Tweet.‖ The barriers to enter the affiliate marketing industry as an
affiliate no longer exist.
8
Empowering Marketing with Social intelligence
As a result of the growing popularity and use of social media around the world
and across nearly every demographic, the amount of user-generated content—or ―big
data‖—created is immense, and continues growing exponentially. Millions of status
updates, blog posts, photographs, and videos are shared every second. Successful
organizations will not only need to identify the information relevant to their company
and products—but also be able to dissect it, make sense of it, and respond to it—in
real time and on a continuous basis, drawing business intelligence—or insights—that
help predict likely future customer behavior. Very intelligent software is required to
parse all that social data to define things like the sentiment of a post.
Marketers now have the opportunity to mine social conversations for purchase intent
and brand lift through Big Data. So, marketers can communicate with consumers
regardless of the channel. Since this data is captured in real-time, Big Data is forcing
marketing organizations to quickly optimize media and message. Since this data
provides details on all aspects of consumer behavior, companies are eliminating silos
within the organization to prescription across channels, across media, and across the
path to purchase.
9
- This fraud detection system uses an open source search server based on Apache
Lucene. It can be used to search all kind of documents at near real-time. The tool is
used to index new transactions which are sourced in real-time, which allows analytics
to run in a distributed fashion utilizing the data specific to the index. Using this tool,
large historical data sets can be used in conjunction with real-time data to identify
deviation from typical payment patterns. The big data component allows overall
historical patterns to be compared and contrasted and allows the number of attributes
and characteristics about consumer behavior to be very wide with little impact on
overall performance.
- Percolator performs the function of identifying new transactions that have raised
profiles. Percolator can handle both structured and unstructured data. This provides
scalability to the event processing framework and allows specific suspicious
transactions to be enriched with additional unstructured information (E.g. Phone
location/geospatial records, customer travel schedules and so on). This ability to
enrich the transaction further can reduce false positives and increase the experience of
customer while redirecting fraud efforts to actual instances of suspicious activity.
- Capgemini’s fraud Big Data initiative focuses on flagging the suspicious credit card
transactions to prevent fraud in near real-time via multi-attribute monitoring. Real-
time inputs involving transaction data and customers records are monitored via
validity checks and detection rules. Pattern recognition is performed against the data
to score and weight individual transactions across each of the rules and scoring
dimensions. A cumulative score is then calculated for each transaction record and
compared against thresholds to decide if the transaction is suspicious or not.
10
Social Network Analysis (SNA)
- This is another approach to solving fraud with Big data.
- SNA views social relationships and makes assumptions.
- SNA could reveal all individuals involved in fraudulent activity from perpetrators to
their associates and understand their relationships and behavior to identify a bust out
fraud case. Bust out is a hybrid credit and fraud problem and the scheme is typically
defined by the following behavior.
- There are some Big Data solutions in the market like SAS’s SNA solution, which
helps institutions and goes beyond individual and account views to analyze all related
activities and relationships at a network dimension. The network dimension allows
visualization of social networks and helps to see hidden connections and relationships,
which could be a group of fraudsters. There are huge amounts of data involved behind
the scene, but the key to SNA solutions like SAS’s is the visualization techniques for
users to easily engage and take action.
11
- Social media and cell phone usage data are opening up new opportunities to
analyze customer behavior that can be used for credit decisioning.
- As Figure illustrates, there are four critical parts of the typical credit risk
framework: planning, customer acquisition, account management, and collections.
All four parts are handled in unique ways through the use of Big Data.
14
Disruptive analytics
- Data science and disruptive analytics can have immediate beneficial impact on the
healthcare systems.
- Data analytics makes it possible to create transparent approach to pharmaceutical
decision making based on the aggregation and analysis of healthcare data such as
electronic medical records and insurance claims data.
- Creating healthcare analytics framework has significant value for individual
stakeholders.
- For providers (physicians), there is an opportunity to build analytics systems for
evidence – based medicine(EBM) lifting through clinical and health outcomes
data to determine the best clinical protocols that provide the best heath outcomes
for patients and create defined standards of care.
- For producers( Pharmaceutical and medical device companies)there is an
opportunity to build analytics systems to enable(transactional medicine)
integrating externally generated post marketing safety, epidemiology and health
outcomes data with internally generated clinical and discovery data ( sequencing,
expression, biomarkers) to enable improved strategic R&D decision making
across the pharmaceutical value chain.
- For payers (ie, insurance companies) there is an opportunity to create analytics
systems to enable comparative effectiveness research(CER) that will be used to
drive reimbursement by mining large collections of claims, health care
records(EMR/EHR), economic and geographic, demographic data sets to
determine what treatment and therapies work best for which patients in which
context and with what overall economic and outcomes benefit.
A Holistic Value Proposition
− The ability to collect, integrate, analyze and manage data can make health care
data such as EHR/EMR, valuable.
− Big data approach to analyze health care data creates methods and platform for
analysis of large volumes of disparate kinds of data (Clinical, EMR, Claims, Labs
etc.) to better answer questions of outcomes, epidemiology, safety, effectiveness
and pharma economic benefit.
− Big data technology and platforms such as Hadoop, R, Open health data etc. help
clients create real-world evidence-based approaches to realize solutions for
competitive effectiveness research, improve outcomes in complex populations and
to improve decision making.
BI is not Data Science
− Traditional Business Intelligence and data warehousing skills do not help in
predictive analytics. Like a lawyer who draws a conclusion and then looks for
supporting evidence. Traditional BI is declarative and doesn’t necessarily require
any real domain understanding. Generating automated reports from aging data
warehouses that are briefly scanned by senior management does not meet the
definition of data science.
− Making data science useful to business is about identifying that question
management really tries to answer question.
15
IV) Pioneering New Frontiers in Medicine
− In Medical Field, Big Data analytics are being used by researches to understand
autoimmune disease such as Rheumatoid Arthritis, Diabetes and lupus and
neurodegenerative disease such as multi sclerosis Parkinson’s and Alzheimer’s. In
most these cases, the goal is to identify the genetic variations that causes the
diseases. The data sets used for such identification contain thousands of genes. For
example a research work on the role of environment factors and interactions
between environmental factors in multiple sclerosis typically uses data sets that
contain 100,000 to 500,000 genetic variations. The algorithms used to identify the
interactions between environmental factors and diseases. They also have rapid
search techniques built into them and should be able to do statistical analysis and
permutation analysis which can be very, very time consuming if not properly
done.
Challenges faced by pioneers of quantitative pharmacology
− The data set is very large 1000 by 2000 matrix.
− When an interactive analysis for first order and second order interactions are done
each of the 500,000 genetic locations have to be compared to each of all the rest
of the 500,000 genetics locations for the first order and this has to be done twice
and then 500,000 may reduce to a third for the second order interaction and so on.
Basically a second order interaction would be 500,000 squared, a third order
would be 500,000 cubed and so on. Such huge computations are made possible in
little time with the aid of big data technologies.
If out of three ads aired, two have high breakthrough but one is weak, the weak
performing ad could be quickly taken off air and the media spend can be rotated to the
higher performing ads. This will make breakthrough scores go up.
Instead of 30 second ads, a mix of 15s and 30s ads, can be planned, suppose real time
data shows that 15s ads work as well as 30s ads. Instead of spending money on 30s
ads, all money can be spent on 15-second ads and scores will continue to grow.
The measurement tools and capabilities are enabling real-time optimization on this
and so there’s a catch-up happening both in terms of advertising systems and
processes, but the industry infrastructure must be able to actually enable all of this
real-time optimization.
Now, the impact on sales in social media can be measured through market mixed
modeling. Market mixed modeling is a way that can take all the different variables in
the marketing mix—including paid, owned, and earned media—and uses them as
independent variables that regress against sales data and tries to understand the single
variable impact of all these different things.
Since these methods are quite advanced, organizations use high-end internal analytic
talent and advanced analytics platforms such as SAS or point solutions such as Unica
and Omniture. Alternatively, there are several large analytics providers like Mu
Sigma that supply it as a software-as-a-service (SaaS).
As the world becomes more digital, the quantity and quality of marketing data is
improving, which is leading to more granular and insightful MMM analyses.
The Three Big Data Vs in Advertising
Impact of the three Vs (volume, velocity, and variety) in advertising:
Volume
The volume of information and data that is available to the advertiser has gone up
exponentially versus what it was 20 years ago. In the old days, we would copy test our
advertising. The agency would build a media plan demographically targeted and we’d
execute it. Maybe 6 to 12 months later, we’d try to use whatever sales data we had to
try to understand if there was any impact. In today’s world, there is hugely more
advertising effectiveness data. On TV advertising, we can measure every ad in every
TV show every day, across about 70 percent of the viewing audience. We measure
clients digital ad performance hourly—by ad, by site, by exposure, and by audience.
On a daily or weekly basis, an advertiser can look at their advertising performance.
18
Velocity
There are already companies that will automate and optimize advertising on the web
without any human intervention at all based on click-thru. It’s now beginning to
happen on metrics like breakthrough, branding, purchase intent etc. This is sometimes
called programmatic buying. Literally, we’ll have systems in place that will be
measuring the impact of the advertising across websites or different placements
within websites, figuring out where the advertising is performing best. It will be
automated optimization and reallocation happening in real-time. The volume and the
velocity of data, the pace at which we can get the data, make decisions and do things
about it is dramatically increased.
Variety
Before, we really didn’t have a lot of data about how our advertising was performing
in market. We have a lot more data and it’s a lot more granular. We can look at our
brand’s overall advertising performance in the market. But we can also decompose it
to how much of a performance is because of the creative quality, the media weight,
how much is because of the program that the ads sit in. How much is because of the
placement: time of day, time of year, pod position, how much is because of cross-
platform exposure, how much is because of competitive activity. Then we have the
ability to optimize on most of those things—in real time. And now we can also
measure earned (social) and owned media. Those are all things that weren’t even
measured before.
Apple entered into the mobile and tablet market because of the iPod, which crushed
giants like Sony in the MP3 market. For Apple, the market was not just about selling
hardware or music on iTunes. It gave them a chance to get as close to a consumer as
anyone can possibly get. This close interaction also generated a lot of data that help
them expand and capture new customers. Again it’s all about the data, analytics, and
putting it into action.
Google gives away product that other companies, such as Microsoft, license for the
same the reason. It also began playing in the mobile hardware space through the
development of the Android platform and the acquisition of Motorola. It’s all about
gathering consumer data and monetizing the data. With Google Dashboard we can see
every search we did, e-mails we sent, IM messages, web-based phone calls,
documents we viewed, and so on. This is powerful for marketers.
The online retailer Amazon has created new hardware with Kindle and Barnes and
Noble released the Nook. If both companies know every move we make, what we
download, what we search for, they can study our behaviors to present new products
that they believe will appeal to us. The connection with consumers and more
importantly taking action on the derived data is important to win.
19
BIG DATA TECHNOLOGY
20
Components of Hadoop:
2) Map Reduce:
- Because Hadoop stores the entire dataset in small pieces across a collection of
servers, analytical jobs can be distributed in parallel to each of the servers storing
part of the data. Each server evaluates the question against its local fragment
simultaneously and reports its result back for collation into a comprehensive
answer.
- Map Reduce is the agent that distributes the work and collects the results.
- Both HDFS and Map Reduce are designed to continue to work even if there are
failures.
- HDFS continuously monitors the data stored on the cluster. If a server becomes
unavailable, a disk drive fails or data is damaged due to hardware or software
problems, HDFS automatically restores the data from one of the known good
replicas stored elsewhere on the cluster.
- Map Reduce monitors the progress of each of the servers participating in the job,
when an analysis job is running. If one of them is slow in returning an answer or
fails before completing its work, Map Reduce automatically starts another
instance of the task on another server that has a copy of the data.
- Because of the way that HDFS and Map Reduce work, Hadoop provides scalable,
reliable and fault-tolerant services for data storage and analysis at very low cost.
21
Old Vs New approaches to Data Analytics
Old Approach (Database approach) New Approach (Big data Analytics)
Follows data and analytics technology Follows data and analytics platform
stack with different layers of cross- that does all the data processing and
communicating data and works on ―scale- analytics in one layer without moving
up‖ expensive hardware. data back and forth on cheap but
scalable (―scale-out‖) commodity
hardware.
Data is moved to places where they have to Data must be processed and converted
be processed. into usable business intelligence where
it sits.
Massive parallel processing was not Hardware and storage is affordable and
employed due to hardware and storage continuing to get cheaper to enable
limitations. massive parallel processing.
Due to technological limitations storing, New proprietary technologies and open
managing and analyzing massive data sets source inventions enable different
were difficult. approaches that make it easier and
more affordable to store, manage &
analyze data.
Not able to handle unstructured data. The variety of data and ability to
handle unstructured data is on the rise.
Big data approach provides solution to
this.
Data Discovery
- Data discovery is the term used to describe the new wave of business intelligence
that enables users to explore data, make discoveries and uncover insights in a
dynamic and intuitive way versus predefined queries and preconfigured drill-
down dashboards. This approach is being followed by many business users due to
its freedom and flexibility to view Big Data. There are two software companies
that stand out in the crowd by growing their businesses at unprecedented rates in
this space: Tableau Software and QlikTech International.
- Both companies’ approach to the market is much different than the traditional BI
software vendor. They used a sales model referred to as ―land and expand‖. This
model was based on the fact that analytics and reporting are produced by the
people using the results. The model enabled business people to create their own
reports and dashboards.
- The most important characteristic of rapid-fire BI is that business users, not
specialized developers, drive the applications. The result is that everyone wins.
The IT team can stop the backlog of change requests and instead spend time on
strategic IT issues. Users can serve themselves data and reports when needed.
- There is a simple example of powerful visualization. A company uses an
interactive dashboard to track the critical metrics driving their business. Every
day, the CEO and other executives are plugged in real-time to see how their
22
markets are performing in terms of sales and profit, what the service quality scores
look like against advertising investments, and how products are performing in
terms of revenue and profit. Interactivity is key: a click on any filter lets the
executive look into specific markets or products. She can click on any data point
in any one view to show the related data in the other views. She can look into any
unusual pattern or outlier by showing details on demand. Or she can click through
the underlying information in a split-second.
- Business intelligence needs to work the way people’s minds work. Users need to
navigate and interact with data any way they want to—asking and answering
questions on their own and in big groups or teams.
- Qliktech has designed in a way for users to leverage direct—and indirect—search.
With QlikView search, users type relevant words or phrases in any order and get
instant, associative results. With a global search bar, users can search across the
entire data set. With search boxes on individual list boxes, users can confine the
search to just that field. Users can conduct both direct and indirect searches. For
example, if a user wanted to identify a sales rep but couldn’t remember the sales
rep’s name—just details about the person, such as that he sells fish to customers in
the Nordic region—the user could search on the sales rep list box for ―Nordic‖
and ―fish‖ to narrow the search results to just the people who meet those criteria.
23
product/service is zero. Whether a private hosted model or a publicly shared one,
the true value lies in delivering software, data and/or analytics in an ―as a service’
model.
Predictive Analytics
- Enterprises will move from being in reactive positions (Business Intelligence) to
forward learning positions (Predictive analysis). Using all the data available i.e
traditional internal data sources combined with new rich external data sources will
make the predictions more accurate and meaningful. Algorithm trading and supply
chain optimizations are two examples where predictive analytics have greatly
reduced the friction in business. Predictive analytics proliferate in every facet of
our lives both personal and business. Some of the leading trends in business today
are
- Recommendation engines similar to those used in Netflix, Amazon that use past
purchases and buying behavior to recommend new purchases.
- Risk engines for a wide variety of business areas, including market and credit risk,
catastrophic risk and portfolio risk.
- Innovation engines for new product innovation, drug discovery and consumer and
fashion trends to predict new product formulations and new purchases.
- Consumer insight engines that integrate a wide variety of consumer-related
information including sentiment, behavior and emotions.
- Optimization engines that optimize complex interrelated operations and decisions
that are too complex to handle.
24
- A focus on customer success
Unlike traditional enterprise software, with SaaS business, it is easy for customers
to leave if they are not satisfied. Today’s BI is not designed for the end-user. It
must be designed to be more intuitive, easily accessible, real time and meet the
expectations of today’s customer technology who expect a much more connected
experience.
Mobile Business Intelligence
- Simplicity and ease of use had been the major barriers to BI adoption. But mobile
devices have made complicated actions to be performed very easily. For example, a
young child can use an ipad or iphone easily but not a laptop. This ease of use will
drive the wide adoption of mobile BI.
- Multi touch and software oriented devices have brought mobile analytics and
intelligence to a much wider audience.
- Ease of mobile application development and development have also contributed to
the wide adoption of mobile BI.
Three elements that have impacted the viability of mobile BI are
i) Location-GPS component enables finding location easy.
ii) Transaction can be done through smart phones.
iii) Multimedia functionality allows virtualization.
Three challenges with mobile BI include
i) Managing standards for these devices.
ii) Managing security (always a big challenge).
iii) Managing ―bring your own device‖, where you have devices both owned by the
company and devices owned by the individual, both contributing to
productivity.
Crowdsourcing Analytics
- Crowdsourcing is the recognition that organizations can’t always have the best
and brightest internal people to solve all their big problems. By creating an open,
competitive environment with clear rules and goals, problems can be solved.
- In October 2006, Netflix an online DVD rental business announced a contest to
create a new predictive model for recommending movies based on past user
ratings. The grand prize was $1,000,000. Netflix already had an algorithm to solve
the problem but thought there was an opportunity to improve the model which
would turnout huge revenues.
- Kaggle is an Australian firm that provides innovative solutions for statistical
analytics for outsourcing. Kaggle manages competitions among world’s best data
scientist corporations, governments and research laboratories. Organizations that
confront complex statistical challenges describe the problems to kaggle and
provide data sets. Kaggle converts the problems and the data into contests that are
posted on its website. The contest features cash prizes ranging in values from
$100 to $3 million. Kaggle’s clients range in size from tiny start-ups to
Multinational Corporations such as Ford Motor Company and government
agencies such as NASA.
- The idea is that someone comes to Kaggle with a problem, they put it up on their
website and then people from all over the world can compete to see who can
25
produce the best solution. In essence Kaggle has developed an effective global
platform for crowdsourcing complex analytic problems.
- There are various types of crowdsourcing such as crowd voting, crowd
purchasing, wisdom of crowds, crowd funding and contests.
- Example:
99designs.com/, does crowdsourcing of graphic design.
Agentanything.com/, posts missions where agents are invited to do various
jobs.
33needs.com/, allows people to contribute to charitable programs to make
social impact.
Inter and Trans-Firewall Analytics
- Yesterday companies were doing functional silo-based analytics. Today they are
doing intra-firewall analytics with data within the firewall. Tomorrow they will be
collaborating on insights with other companies to do inter-firewall analytics as well as
leveraging the public domain spaces to do trans-firewall analytics (Fig.1).
- As fig.2 depicts, setting up inter-firewall and trans-firewall analytics can add
significant value. But this presents some challenges. When information is collected
outside the firewall, the information to noise ratio increases putting additional
requirements on analytical methods and technology requirements.
- Further, organizations are limited by a fear of collaboration and overreliance on
proprietary information. The fear of collaboration is driven by competitive fears, data
privacy concerns and proprietary orientations that limit opportunities for cross-
organizational learning and innovations. The transition to an inter-firewall and trans-
firewall paradigm may not be easy but it continues to grow and will become a key
weapon for decisions scientists to drive disruptive value and efficiencies.
Figure 1
26
Figure 2
- For many reasons, organizations find it hard to make changes after spending many
years implementing a data management, BI and analytic stack. So the organizations
have to do lot of research and development on the new technologies before
completely adopting the technologies to minimize the risk. The two core programs
that have to focused by R & D teams are
Program Goal Core Elements
Tap into the latent creativity of all Visa employees,
Innovation providing them with a platform to demonstrate
mastery and engage collaboratively with their Employee personal growth
management
colleagues.
Employee acquisition and
retention
Look outside of the company and scan the
Research and open
environment for trends, new technology, and Innovation
innovation
approaches
Competitive advantage
Adding Big Data Technology
The process that enterprises must follow to get started with the big data technology
1. Practical approach – Start with the problem and then find a solution.
2. Opportunistic Approach – Start with the technology and then find a home for it.
For both the approaches the following activities have to be conducted,
(i) Play – R & D team members may request to install their lab to get more familiar with the
technology.
(ii)Initial Business review – Talk with the business owner to validate the applicability and
rank, the priorities to ensure that it is worth pursuing.
27
(iii) Architecture Review – Asses the validity of the underlying architecture and ensure
that it maps to IT’s standards.
(iv) Pilot use cases – find the use case to test the technology out.
(v)Transfer from R & D to Production – Negotiate internally regarding what it would take
to move it from research to production using the following table.
Organizations may have a lot of smart people, but there are other smart people outside.
Organizations need to be exposed to the value they are creating. A systematic program that
formalizes relationships with a powerful ecosystem is shown in the following table.
Innovation ecosystem: Leveraging brain power from outside of the organization
Source Example
Tap into a major university who did a major study on social network
Academic community
analytics.
Leverage research a vendor completed in their labs demonstrating success
Vendors’ research arms
leveraging unstructured data.
Research houses Use research content to support a given hypothesis for a new endeavor.
Government agencies Discuss fraud strategies with the intelligence community.
Venture capital Have a venture capital firm review some new trends they are tracking and
organizations investing in.
Invite BI and analytic technology start-ups in instead of just sticking with the
Start-ups
usual suspects.
28
UNIT II INTRODUCTION TO NOSQL
NoSQL is a type of database management system ( DBMS) that is designed to handle
and store large volumes of unstructured and semi-structured data. Unlike traditional
relational databases that use tables with pre- de昀椀ned schemas to store data, NoSQL
databases use 昀氀exible data models that can adapt to changes in data structures and are
capable of scaling horizontally to handle growing amounts of data.
The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but the term
has since evolved to mean “not only SQL,” as NoSQL databases have expanded to include a
wide range of different database architectures and data models.
Types of NoSQL database
There are multiple types of NoSQL databases. 4 of the most common NoSQL databases are:
1. Document databases: A collection of documents, where each document is JSON or
JSON-like format. Each document contains pairs of 昀椀elds and values. The primary
storage is in the storage layer and we cache it out to memory. Examples – MongoDB,
CouchDB, Cloudant
2. Key-value stores: Key-value stores; similar to Python dictionaries. Query either
by using the key or search through the entire database. The key- value stores
tend to be used in memory and use a backing store behind it. Examples –
Memcached, Redis, Coherence
3. Wide column databases: Similar to relational database tables; the difference is the
storage on the backend is different. We can put SQL on top of a wide column
database, which makes it very similar to querying a relational database Examples –
Hbase, Big Table, Accumulo
Graph databases: Stores data as nodes (vertices) and relationships (edges). Vertices
typically store object information while edges represent the relationships between
nodes. We can have a SQL-like query language in our graph databases. Examples
– Amazon Neptune, Neo4j
• NoSQL systems are also sometimes called Not only SQL to emphasize the fact
that they may support SQL-like query languages. A NoSQL database includes
simplicity of design, simpler horizontal scaling to clusters of machines has and 昀椀
ner control over availability. The data structures used by NoSQL databases are
different from those used by default in relational databases which makes some
operations faster in NoSQL. The suitability of a given NoSQL database depends on
the problem it should solve.
• NoSQL databases, also known as “not only SQL” databases, are a new type of
database management system that has, gained popularity in recent years. Unlike
traditional relational databases, NoSQL databases are designed to handle large
amounts of unstructured or semi-structured data, and they can accommodate dynamic
changes to the data model. This makes NoSQL databases a good 昀椀t for modern web
applications, real-time analytics, and big data processing.
• Data structures used by NoSQL databases are sometimes also viewed as more
昀氀exible than relational database tables. Many NoSQL stores compromise
consistency in favor of availability, speed,, and partition tolerance. Barriers to the
greater adoption of NoSQL stores include the use of low-level query languages, lack
of standardized interfaces, and huge previous
investments in existing relational databases.
• Most NoSQL stores lack true ACID(Atomicity, Consistency, Isolation, Durability)
transactions but a few databases, such as MarkLogic,
Aerospike, FairCom c-treeACE, Google Spanner (though technically a NewSQL
database), Symas LMDB, and OrientDB have made them central to their designs.
• Most NoSQL databases offer a concept of eventual consistency in which database
changes are propagated to all nodes so queries for data might not return updated data
immediately or might result in reading data that is not accurate which is a problem
known as stale reads. Also, has some NoSQL systems may exhibit lost writes and
other forms of data loss. Some NoSQL systems provide concepts such as write-ahead
logging to avoid data loss.
• One simple example of a NoSQL database is a document database. In a document
database, data is stored in documents rather than tables. Each document can contain
a different set of 昀椀elds, making it easy to accommodate changing data
requirements
• For example, “Take, for instance, a database that holds data regarding employees.”.
In a relational database, this information might be stored in tables, with one table
for employee information and another table for department information. In a
document database, each employee would be stored as a separate document, with
all of their information contained within the document.
• NoSQL databases are a relatively new type of database management system
that has a gained popularity in recent years due to their scalability and 昀氀
exibility. They are designed to handle large amounts of unstructured or semi-
structured data and can handle dynamic changes to the data model. This makes
NoSQL databases a good 昀椀t for modern web applications, real- time analytics,
and big data processing.
Key Features of NoSQL:
1. Dynamic schema: NoSQL databases do not have a 昀椀xed schema and can
accommodate changing data structures without the need for migrations or schema
alterations.
2. Horizontal scalability: NoSQL databases are designed to scale out by
adding more nodes to a database cluster, making them well- suited for handling
large amounts of data and high levels of tra昀케c.
3. Document-based: Some NoSQL databases, such as MongoDB, use a document-
based data model, where data is stored in a scales semi- structured format, such as
JSON or BSON.
4. Key-value-based: Other NoSQL databases, such as Redis, use a key- value data
model, where data is stored as a collection of key- value pairs.
5. Column-based: Some NoSQL databases, such as Cassandra, use a column-based data
model, where data is organized into columns instead of rows.
Aggregate means a collection of objects that are treated as a unit. In NoSQL Databases, an
aggregate is a collection of data that interact as a unit. Moreover, these units of data or
aggregates of data form the boundaries for the ACID operations.
Aggregate Data Models in NoSQL make it easier for the Databases to manage data
storage over the clusters as the aggregate data or unit can now reside on any of the
machines. Whenever data is retrieved from the Database all the data comes along with
the Aggregate Data Models in NoSQL.
Aggregate Data Models in NoSQL don’t support ACID transactions and sacri昀
椀ce one of the ACID properties. With the help of Aggregate Data Models in
NoSQL, you can easily perform OLAP operations on the Database.
You can achieve high e昀케ciency of the Aggregate Data Models in the NoSQL
Database if the data transactions and interactions take place within the same
aggregate.
The Aggregate Data Models in NoSQL are majorly classi昀椀ed into 4 Data Models
listed below:
Key-Value Model
The Key-Value Data Model contains the key or an ID used to access or fetch the data of the
aggregates corresponding to the key. In this Aggregate Data Models in NoSQL, the data of
the aggregates are secure and encrypted and can be decrypted with a Key.
Use Cases:
• These Aggregate Data Models in NoSQL Database are used for storing
the user session data.
• Key Value-based Data Models are used for maintaining schema-less user
pro昀椀les.
• It is used for storing user preferences and shopping cart data.
Document Model
The Document Data Model allows access to the parts of aggregates. In this Aggregate Data
Models in NoSQL, the data can be accessed in an in昀氀exible way. The Database stores
and retrieves documents, which can be XML, JSON, BSON, etc. There are some
restrictions on data structure and data types of the data aggregates that are to be used
in this Aggregate Data Models in NoSQL
Database.
Use Cases:
• Document Data Models are widely used in E-Commerce platforms
• It is used for storing data from content management systems.
• Document Data Models are well suited for Blogging and Analytics
platforms.
Graph-based data models store data in nodes that are connected by edges. These Aggregate
Data Models in NoSQL are widely used for storing the huge volumes of complex aggregates
and multidimensional data having many interconnections between them.
Use Cases:
• Graph-based Data Models are used in social networking sites to store
interconnections.
• It is used in fraud detection systems.
• This Data Model is also widely used in Networks and IT operations.
Now that you have a brief knowledge of Aggregate Data Models in NoSQL Database. In
this section, you will go through an example to understand how to design Aggregate Data
Models in NoSQL. For this, a Data Model of an E- Commerce website will be used to
explain Aggregate Data Models in NoSQL.
This example of the E-Commerce Data Model has two main aggregates – customer and order.
The customer contains data related to billing addresses while the order aggregate consists of
ordered items, shipping addresses, and payments. The payment also contains the billing
address.
Here in the diagram have two Aggregate:
• Customer and Orders link between them represent an aggregate.
• The diamond shows how data 昀椀t into the aggregate structure.
• Customer contains a list of billing address
• Payment also contains the billing address
• The address appears three times and it is copied each time
• The domain is 昀椀t where we don’t want to change shipping and billing
address.
If you notice a single logical address record appears 3 times in the data, but its value is
copied each time wherever used. The whole address can be copied into an aggregate as
needed. There is no pre-de昀椀ned format to draw the aggregate boundaries. It solely
depends on whether you want to manipulate the data as per your requirements.
The Data Model for customer and order would look like this.
// in customers
{
"customer": {
"id": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}], "orders":
[
{ "id":99,
"customerId":1, "orderItems":[
{
"productId":27, "price":
32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft", "billingAddress":
{"city": "Chicago"}
}],
}]
}
}
In these Aggregate Data Models in NoSQL, if you want to access a customer along
with all customer’s orders at once. Then designing a single aggregate is preferable. But
if you want to access a single order at a time, then you should have separate
aggregates for each order. It is very content-speci昀椀c.
A Document Data Model is a lot different than other data models because it stores data
in JSON, BSON, or XML documents. in this data model, we can move documents under
one document and apart from this, any particular elements can be indexed to run queries
faster. Often documents are stored and retrieved in such a way that it becomes close to the
data objects which are used in many applications which means very less translations are
required to use data in applications. JSON is a native language that is often used to store
and query data too.
So in the document data model, each document has a key-value pair below is an example for
the same.
{
"Name" : "Yashodhra", "Address" :
"Near Patel Nagar",
"Email" : "[email protected]", "Contact" :
"12345"
}
Working of Document Data Model:
This is a data model which works as a semi-structured data model in which the records and
data associated with them are stored in a single document which means this data model is
not completely unstructured. The main thing is that data here is stored in a document.
Features:
• Document Type Model: As we all know data is stored in documents rather than tables
or graphs, so it becomes easy to map things in many programming languages.
• Flexible Schema: Overall schema is very much 昀氀exible to support this
statement one must know that not all documents in a collection need to have
the same 昀椀elds.
• Distributed and Resilient: Document data models are very much dispersed which is
the reason behind horizontal scaling and distribution of data.
• Manageable Query Language: These data models are the ones in which query
language allows the developers to perform CRUD (Create Read Update Destroy)
operations on the data model.
Examples of Document Data Models :
• Amazon DocumentDB
• MongoDB
• Cosmos DB
• ArangoDB
• Couchbase Server
• CouchDB
Applications of Document Data Model :
• Content Management: These data models are very much used in creating various video
streaming platforms, blogs, and similar services Because each is stored as a single
document and the database here is much easier to maintain as the service evolves over
time.
• Book Database: These are very much useful in making book databases because as we
know this data model lets us nest.
• Catalog: When it comes to storing and reading catalog 昀椀les these data models are
very much used because it has a fast reading ability if incase Catalogs have thousands of
attributes stored.
• Analytics Platform: These data models are very much used in the Analytics Platform.
6. Distributed and high availability: NoSQL databases are often designed to be highly
available and to automatically handle node failures and data replication across
multiple nodes in a database cluster.
7. Flexibility: NoSQL databases allow developers to store and retrieve data in
a昀氀exible and dynamic manner, with support for multiple data types and
changing data structures.
8. Performance: NoSQL databases are optimized for high performance and can
handle a high volume of reads and writes, making them suitable for big data and real-
time applications.
Advantages of NoSQL: There are many advantages of working with NoSQL databases such
as MongoDB and Cassandra. The main advantages are high scalability and high availability.
2. Lack of ACID compliance: NoSQL databases are not fully ACID- compliant, which
means that they do not guarantee the consistency,
integrity, and durability of data. This can be a drawback for applications that
require strong data consistency guarantees.
3. Narrow focus: NoSQL databases have a very narrow focus as it is mainly designed
for storage but it provides very little functionality. Relational databases are a better
choice in the 昀椀eld of Transaction Management than NoSQL.
4. Open-source: NoSQL is an open-source database. There is no reliable standard for
NoSQL yet. In other words, two database systems are likely to be unequal.
5. Lack of support for complex queries: NoSQL databases are not designed to handle
complex queries, which means that they are not a good 昀椀t for applications that
require complex data analysis or reporting.
6. Lack of maturity: NoSQL databases are relatively new and lack the maturity of
traditional relational databases. This can make them less reliable and less secure than
traditional databases.
7. Management challenge: The purpose of big data tools is to make the management of
a large amount of data as simple as possible. But it is not so easy. Data
management in NoSQL is much more complex than in a relational database. NoSQL,
in particular, has a reputation for being challenging to install and even more hectic to
manage on a daily basis.
8. GUI is not available: GUI mode tools to access the database are not
昀氀exibly available in the market.
9. Backup: Backup is a great weak point for some NoSQL databases like MongoDB.
MongoDB has no approach for the backup of data in a consistent manner.
10. Large document size: Some database systems like MongoDB and
CouchDB store data in JSON format. This means that documents are quite large
(BigData, network bandwidth, speed), and having descriptive key names actually
hurts since they increase the document size.
GRAPH DATABASES
A graph database is a type of NoSQL database that is designed to handle data with complex
relationships and interconnections. In a graph database, data is stored as nodes and edges,
where nodes represent entities and edges represent the relationships between those entities.
1. Graph databases are particularly well-suited for applications that require deep and
complex queries, such as social networks, recommendation engines, and fraud
detection systems. They can also be used for other types of applications, such as
supply chain management, network and infrastructure management, and
bioinformatics.
2. One of the main advantages of graph databases is their ability to handle and represent
relationships between entities. This is because the relationships between entities are as
important as the entities themselves, and often cannot be easily represented in a
traditional relational database.
3. Another advantage of graph database is their 昀氀exibility. Graph databases can
handle data with changing structures and can be adapted to new use cases
without requiring signi昀椀cant changes to the database schema. This makes them
particularly useful for applications with rapidly changing data structures or
complex data requirements.
4. However, graph databases may not be suitable for all applications. For
example, they may not be the best choice for applications that require simple queries
or that deal primarily with data that can be easily represented in a traditional
relational database. Additionally, graph databases may require more specialized
knowledge and expertise to use effectively. SCHEMALESS DATABASES
CASSANDRA
Apache Cassandra is highly scalable, high performance, distributed NoSQL database.
Cassandra is designed to handle huge amount of data across many commodity servers,
providing high availability without a single point of failure.
Cassandra has a distributed architecture which is capable to handle a huge amount of
data. Data is placed on different machines with more than one replication factor to attain a
high availability without a single point of failure.
Cassandra is a NoSQL database
NoSQL database is Non-relational database. It is also called Not Only SQL. It is a
database that provides a mechanism to store and retrieve data other than the tabular
relations used in relational databases. These databases are schema-free, support easy
replication, have simple API, eventually consistent, and can handle huge amounts of data.
Reasons behind its popularity
Cassandra is an Apache product. It is an open source, distributed and
decentralized/distributed storage system (database). It is used to manage very large amounts
of structured data spread out across the world. It provides high availability with no single
point of failure.
Important Points of Cassandra
o Cassandra is a column-oriented database.
o Cassandra is scalable, consistent, and fault-tolerant.
o Cassandra's distribution design is based on Amazon's Dynamo and its
data model on Google's Bigtable.
o Cassandra is created at Facebook. It is totally different from relational
database management systems.
o Cassandra follows a Dynamo-style replication model with no single point of failure,
but adds a more powerful "column family" data model.
o Cassandra is being used by some of the biggest companies like Facebook,
Twitter, Cisco, Rackspace, ebay, Twitter, , and more.
History of Cassandra
Cassandra was initially developed at Facebook by two Indians Avinash Lakshman (one of the
authors of Amazon's Dynamo) and Prashant Malik. It was developed to power the Facebook
inbox search feature.
The following points specify the most important happenings in Cassandra history:
Features of Cassandra
There are a lot of outstanding technical features which makes Cassandra very popular.
Following is a list of some popular features of Cassandra:
High Scalability
Cassandra is highly scalable which facilitates you to add more hardware to attach more
customers and more data as per requirement.
Rigid Architecture
Cassandra has not a single point of failure and it is continuously available for business-
critical applications that cannot afford a failure.
Transaction Support
Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).
Fast writes
Cassandra was designed to run on cheap commodity hardware. It performs
blazingly fast writes and can store hundreds of terabytes of data, without
sacri昀椀cing the read e昀케ciency.
Cassandra Architecture
Cassandra was designed to handle big data workloads across multiple nodes without a
single point of failure. It has a peer-to-peer distributed system across its nodes, and data is
distributed among all the nodes in a cluster.
o In Cassandra, each node is independent and at the same time interconnected to other
nodes. All the nodes in a cluster play the same role.
o Every node in a cluster can accept read and write requests, regardless of where
the data is actually located in the cluster.
o In the case of failure of one node, Read/Write requests can be served from other
nodes in the network.
Data Replication in Cassandra
In Cassandra, nodes in a cluster act as replicas for a given piece of data. If some of the nodes
are responded with an out-of-date value, Cassandra will return the most recent value to the
client. After returning the most recent value, Cassandra performs a read repair in the
background to update the stale values.
See the following image to understand the schematic view of how Cassandra uses data
replication among the nodes in a cluster to ensure no single point of failure.
Components of Cassandra
The main components of Cassandra are:
Write Operations
Every write activity of nodes is captured by the commit logs written in the nodes. Later the
data will be captured and stored in the mem-table. Whenever the mem-table is full, data
will be written into the SStable. All writes are automatically partitioned and replicated
throughout the cluster. Cassandra periodically consolidates the SSTables, discarding
unnecessary data.
Read Operations
In Read operations, Cassandra gets values from the mem-table and checks the bloom
the appropriate SSTable which contains the required data.
There are three types of read request that is sent to replicas by coordinators.
o Direct request
o Digest request
o Read repair request
The coordinator sends direct request to one of the replicas. After that, the coordinator sends
the digest request to the number of replicas speci昀椀ed by the consistency level and
checks if the returned data is an updated data.
After that, the coordinator sends digest request to all the remaining replicas. If any node gives
out of date value, a background read repair request will update that data. This process is
called read repair mechanism.
Cassandra is a great database for many online companies and social media providers for
analysis and recommendation to their customers.
• Job Tracker– Just like the storage (HDFS), the computation (MapReduce) also
works in a master-slave / master-worker fashion. A Job Tracker node acts as the
Master and is responsible for scheduling / executing Tasks on appropriate nodes,
coordinating the execution of tasks, sending the information for the execution of
tasks, getting the results back after the execution of each task, re-executing the failed
Tasks, and monitors / maintains the overall progress of the Job. Since a Job consists
of multiple Tasks, a Job’s progress depends on the status / progress of Tasks
associated with it. There is only one Job Tracker node per Hadoop Cluster.
• Map() – Map Task in MapReduce is performed using the Map() function. This part
of the MapReduce is responsible for processing one or more chunks of data and
producing the output results.
• Reduce() – The next part / component / stage of the MapReduce programming
model is the Reduce() function. This part of the MapReduce is responsible for
consolidating the results produced by each of the Map() functions/tasks.
• Data Locality – MapReduce tries to place the data and the compute as close as
possible. First, it tries to put the compute on the same node where data resides, if that
cannot be done (due to reasons like compute on that node is down, compute on that
lOMoARcPSD|214 401 88
node is performing some other computation, etc.), then it tries to put the compute on
the node nearest to the respective data node(s) which contains the data to be
processed. This feature of MapReduce is “Data Locality”.
The following diagram shows the logical flow of a MapReduce programming model.
Game Example
Say you are processing a large amount of data and trying to find out what percentage of
your user base where talking about games. First, we will identify the keywords which we
lOMoARcPSD|214 401 88
are going to map from the data to conclude that it’s something related to games. Next, we
will write a mapping function to identify such patterns in our data. For example, the
keywords can be Gold medals, Bronze medals, Silver medals, Olympic football, basketball,
cricket, etc.
Let us take the following chunks in a big data set and see how to process it.
“Merry Christmas”
In the same way, we can define n number of mapping functions for mapping various words:
“Olympics”, “Gold Medals”, “cricket”, etc.
Reducing Phase – The reducing function will accept the input from all these mappers in
form of key value pair and then processing it. So, input to the reduce function will look like
the following:
reduce (“football”=>2)
lOMoARcPSD|214 401 88
reduce (“Olympics”=>3)
Now, getting into a big picture we can write n number of mapper functions here. Let us say
that you want to know who all where wishing each other. In this case you will write a
mapping function to map the words like “Wishing”, “Wish”, “Happy”, “Merry” and then
will write a corresponding reducer function.
Here you will need one function for shuffling which will distinguish between the “games”
and “wishing” keys returned by mappers and will send it to the respective reducer function.
Similarly you may need a function for splitting initially to give inputs to the mapper
functions in form of chunks. The following diagram summarizes the flow of Map reduce
algorithm:
lOMoARcPSD|214 401 88
• The input data can be divided into n number of chunks depending upon the amount
of data and processing capacity of individual unit.
• Next, it is passed to the mapper functions. Please note that all the chunks are
processed simultaneously at the same time, which embraces the parallel processing
of data.
• After that, shuffling happens which leads to aggregation of similar patterns.
• Finally, reducers combine them all to get a consolidated output as per the logic.
• This algorithm embraces scalability as depending on the size of the input data, we
can keep increasing the number of the parallel processing units.
With MRUnit, you can craft test input, push it through your mapper and/or reducer, and
verify its output all in a JUnit test.
lOMoARcPSD|214 401 88
As do other JUnit tests, this allows you to debug your code using the JUnit test as a driver.
A map/reduce pair can be tested using MRUnit’sMapReduceDriver. , a combiner can be
tested using MapReduceDriver as well.
A PipelineMapReduceDriver allows you to test a workflow of map/reduce jobs. Currently,
partitioner’s do not have a test driver under MRUnit.
MRUnit allows you to do TDD(Test Driven Development) and write lightweight unit tests
which accommodate Hadoop’s specific architecture and constructs.
Example: We’re processing road surface data used to create maps. The input contains both
linear surfaces and intersections. The mapper takes a collection of these mixed surfaces as
input, discards anything that isn’t a linear road surface, i.e., intersections, and then processes
each road surface and writes it out to HDFS. We can keep count and eventually print out
how many non-road surfaces are inputs. For debugging purposes, we can additionally print
out how many road surfaces were processed.
• The MapReduce application master, which coordinates the tasks running the
MapReduce job. The application master and the MapReduce tasks run in containers
That are scheduled by the resource manager and managed by the node managers.
lOMoARcPSD|214 401 88
• The distributed filesystem, which is used for sharing job files between the other entities.
• He distributed filesystem ,which is used for sharing job files between the other entities.
Classic MapReduce
A job run in classic MapReduce is illustrated in Figure 6-1. At the highest level, there are
four independent entities:
• The client, which submits the MapReduce job.
• The jobtracker, which coordinates the job run. The jobtracker is a Java application whose
lOMoARcPSD|214 401 88
Job Initialization:
When the JobTracker receives a call to its submitJob() method, it puts it into an internal
queue from where the job scheduler will pick it up and initialize it. Initialization involves
creating an object to represent the job being run.
To create the list of tasks to run, the job scheduler first retrieves the input splits computed by
the client from the shared filesystem. It then creates one map task for each split.
Task Assignment:
Tasktrackers run a simple loop that periodically sends heartbeat method calls to the
jobtracker. Heartbeats tell the jobtracker that a tasktracker is alive As a part of the heartbeat,
a tasktracker will indicate whether it is ready to run a new task, and if it is, the jobtracker
will llocate it a task, which it communicates to the tasktracker using the heartbeat return
value.
Task Execution:
Now that the tasktracker has been assigned a task, the next step is for it to run the task. First,
it localizes the job JAR by copying it from the shared filesystem to the tasktracker’s
filesystem. It also copies any files needed from the distributed cache by the application to
the local disk. TaskRunner launches a new Java Virtual Machine to run each task in.
YARN
Yet Another Resource Manager takes programming to the next level beyond Java , and
makes it interactive to let another application Hbase, Spark etc. to work on it.Different Yarn
applications can co-exist on the same cluster so MapReduce, Hbase, Spark all can run at the
same time bringing great benefits for manageability and cluster utilization.
Components Of YARN
o Client: For submitting MapReduce jobs.
o Resource Manager: To manage the use of resources across the cluster
o Node Manager:For launching and monitoring the computer containers on machines
in the cluster.
o Map Reduce Application Master: Checks tasks running the MapReduce job. The
application master and the MapReduce tasks run in containers that are scheduled by
the resource manager, and managed by the node managers.
Benefits of YARN
o Scalability: Map Reduce 1 hits ascalability bottleneck at 4000 nodes and 40000
task, but Yarn is designed for 10,000 nodes and 1 lakh tasks.
o Utiliazation: Node Manager manages a pool of resources, rather than a fixed
number of the designated slots thus increasing the utilization.
o Multitenancy: Different version of MapReduce can run on YARN, which makes the
process of upgrading MapReduce more manageable.
lOMoARcPSD|214 401 88
MapReduce Types
Mapping is the core technique of processing a list of data elements that come in pairs of
keys and values. The map function applies to individual elements defined as key-value pairs
of a list and produces a new list. The general idea of map and reduce function of Hadoop
can be illustrated as follows:
map: (K1, V1) -> list (K2, V2)
reduce: (K2, list(V2)) -> list (K3, V3)
The input parameters of the key and value pair, represented by K1 and V1 respectively, are
different from the output pair type: K2 and V2. The reduce function accepts the same format
output by the map, but the type of output again of the reduce operation is different: K3 and
V3. The Java API for this is as follows:
void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) throws
IOException;
}
public interface Reducer<K2, V2, K3, V3> extends JobConfigurable,Closeable
{
void reduce(K2 key, Iterator<V2> values,
OutputCollector<K3, V3> output, Reporter reporter)throws
IOException;
}
lOMoARcPSD|214 401 88
Note that the combine and reduce functions use the same type, except in the variable names
where K3 is K2 and V3 is V2.
The partition function operates on the intermediate key-value types. It controls the
partitioning of the keys of the intermediate map outputs. The key derives the partition using
a typical hash function. The total number of partitions is the same as the number of reduce
tasks for the job. The partition is determined only by the key ignoring the value.
}
lOMoARcPSD|214 401 88
Hadoop is an open-source framework for processing, storing, and analyzing large volumes of
data in a distributed computing environment. It provides a reliable, scalable, and distributed
computing system for big data.
Key Components:
• Hadoop Distributed File System (HDFS): HDFS is the storage system of Hadoop,
designed to store very large files across multiple machines.
• MapReduce: MapReduce is a programming model for processing and generating large
datasets that can be parallelized across a distributed cluster of computers.
• YARN (Yet Another Resource Negotiator): YARN is the resource management layer of
Hadoop, responsible for managing and monitoring resources in a cluster.
Advantages of Hadoop:
• Scalability: Hadoop can handle and process vast amounts of data by distributing it across
a cluster of machines.
• Fault Tolerance: Hadoop is fault-tolerant, meaning it can recover from failures, ensuring
that data processing is not disrupted.
• Cost-Effective: It allows businesses to store and process large datasets cost-effectively,
as it can run on commodity hardware.
Installing Hadoop on a single-node cluster is a common way to set up Hadoop for learning and
development purposes. In this guide, I'll walk you through the step-by-step installation of
Hadoop on a single Ubuntu machine.
Prerequisites:
1. Visit the Apache Hadoop website (https://hadoop.apache.org) and choose the Hadoop
version you want to install. Replace X.Y.Z with the version number you choose.
2. Download the Hadoop distribution using wget or your web browser. For example:
bash
wget https://archive.apache.org/dist/hadoop/common/hadoop-X.Y.Z/hadoop-
X.Y.Z.tar.gz
Step 2: Extract Hadoop 3. Extract the downloaded Hadoop tarball to your desired directory
(e.g., /usr/local/):
bash
sudo tar -xzvf hadoop-X.Y.Z.tar.gz -C /usr/local/
Step 3: Configure Environment Variables 4. Edit your ~/.bashrc file to set up environment
variables. Replace X.Y.Z with your Hadoop version:
bash
export HADOOP_HOME=/usr/local/hadoop-X.Y.Z
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
bash
source ~/.bashrc
Step 4: Edit Hadoop Configuration Files 5. Navigate to the Hadoop configuration directory:
bash
cd $HADOOP_HOME/etc/hadoop
6. Edit the hadoop-env.sh file to specify the Java home directory. Add the following line to
the file, pointing to your Java installation:
bash
export JAVA_HOME=/usr/lib/jvm/default-java
7. Configure Hadoop's core-site.xml by editing it and adding the following XML snippet.
This sets the Hadoop Distributed File System (HDFS) data directory:
xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
lOMoARcPSD|214 401 88
8. Configure Hadoop's hdfs-site.xml by editing it and adding the following XML snippet.
This sets the HDFS data and metadata directories:
xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/local/hadoop-X.Y.Z/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/hadoop-X.Y.Z/data/datanode</value>
</property>
Step 5: Format the HDFS Filesystem 9. Before starting Hadoop services, you need to format
the HDFS filesystem. Run the following command:
bash
hdfs namenode -format
Step 6: Start Hadoop Services 10. Start the Hadoop services using the following command:
bash
start-all.sh
Step 7: Verify Hadoop Installation 11. Check the running Hadoop processes using the jps
command:
bash
jps
You should see a list of Java processes running, including NameNode, DataNode,
ResourceManager, and NodeManager.
Step 8: Access Hadoop Web UI 12. Open a web browser and access the Hadoop Web UI at
http://localhost:50070/ (for HDFS) and http://localhost:8088/ (for YARN ResourceManager).
You have successfully installed Hadoop on a single-node cluster. You can now use it for learning
and experimenting with Hadoop and MapReduce.
• Text Files: Simple plain text files, where each line represents a record.
lOMoARcPSD|214 401 88
Analyzing data with Hadoop involves understanding the data format and structure, as well as
using appropriate tools and techniques for processing and deriving insights from the data. Here
are some key considerations when it comes to data format and analysis with Hadoop:
1. Data Format:
• Structured Data: If your data is structured, meaning it follows a fixed schema, you can
use formats like Avro, Parquet, or ORC. These columnar storage formats are efficient for
large-scale data analysis and support schema evolution.
• Semi-Structured Data: Data in JSON or XML format falls into this category. Hadoop
can handle semi-structured data, and tools like Hive and Pig can help you query and
process it effectively.
• Unstructured Data: Text data, log files, and other unstructured data can be processed
using Hadoop as well. However, processing unstructured data often requires more
complex parsing and natural language processing (NLP) techniques.
2. Data Ingestion:
• Before you can analyze data with Hadoop, you need to ingest it into the Hadoop
Distributed File System (HDFS) or another storage system compatible with Hadoop.
Tools like Apache Flume or Apache Sqoop can help with data ingestion.
3. Data Processing:
• Hadoop primarily uses the MapReduce framework for batch data processing. You write
MapReduce jobs to specify how data should be processed. However, there are also high-
level processing frameworks like Apache Spark and Apache Flink that provide more user-
friendly abstractions and real-time processing capabilities.
4. Data Analysis:
• For SQL-like querying of structured data, you can use Apache Hive, which provides a
SQL interface to Hadoop. Hive queries get translated into MapReduce or Tez jobs.
lOMoARcPSD|214 401 88
• For SQL-like querying of structured data, you can use Apache Hive, which provides a
SQL interface to Hadoop. Hive queries get translated into MapReduce or Tez jobs.
• Apache Pig is a scripting language specifically designed for data processing in Hadoop.
It's useful for ETL (Extract, Transform, Load) tasks.
• For advanced analytics and machine learning, you can use Apache Spark, which provides
MLlib for machine learning tasks, and GraphX for graph processing.
• Hadoop provides various storage formats optimized for analytics (e.g., Parquet, ORC)
and supports data compression to reduce storage requirements and improve processing
speed.
• Hadoop can automatically partition data into smaller chunks and shuffle it across nodes to
optimize the processing pipeline.
• Hadoop offers mechanisms for securing data and controlling access through
authentication, authorization, and encryption.
8. Data Visualization:
• To make sense of the analyzed data, you can use data visualization tools like Apache
Zeppelin or integrate Hadoop with business intelligence tools like Tableau or Power BI.
9. Performance Tuning:
• Regularly monitor the health and performance of your Hadoop cluster using tools like
Ambari or Cloudera Manager. Perform routine maintenance tasks to ensure smooth
operation.
Analyzing data with Hadoop involves a combination of selecting the right data format,
processing tools, and techniques to derive meaningful insights from your data. Depending on
your specific use case, you may need to choose different formats and tools to suit your needs.
Scaling out is a fundamental concept in distributed computing and is one of the key benefits of
using Hadoop for big data analysis. Here are some important points related to scaling out in
Hadoop:
lOMoARcPSD|214 401 88
• Horizontal Scalability: Hadoop is designed for horizontal scalability, which means that
you can expand the cluster by adding more commodity hardware machines to it. This
allows you to accommodate larger datasets and perform more extensive data processing.
• Data Distribution: Hadoop's HDFS distributes data across multiple nodes in the cluster.
When you scale out by adding more nodes, data is automatically distributed across these
new machines. This distributed data storage ensures fault tolerance and high availability.
• Processing Power: Scaling out also means increasing the processing power of the cluster.
You can run more MapReduce tasks and analyze data in parallel across multiple nodes,
which can significantly speed up data processing.
• Elasticity: Hadoop clusters can be designed to be elastic, meaning you can dynamically
add or remove nodes based on workload requirements. This is particularly useful in
cloud-based Hadoop deployments where you pay for resources based on actual usage.
• Balancing Resources: When scaling out, it's important to consider resource management
and cluster balancing. Tools like Hadoop YARN (Yet Another Resource Negotiator) help
allocate and manage cluster resources efficiently.
Scaling Hadoop:
• Horizontal Scaling: Hadoop clusters can scale horizontally by adding more machines to
the existing cluster. This approach improves processing power and storage capacity.
• Vertical Scaling: Vertical scaling involves adding more resources (CPU, RAM) to
existing nodes in the cluster. However, there are limits to vertical scaling, and horizontal
scaling is preferred for handling larger workloads.
• Cluster Management Tools: Tools like Apache Ambari and Cloudera Manager help in
managing and scaling Hadoop clusters efficiently.
• Data Partitioning: Proper data partitioning strategies ensure that data is distributed
evenly across the cluster, enabling efficient processing.
lOMoARcPSD|214 401 88
What is Hadoop Streaming? Hadoop Streaming is a utility that comes with Hadoop
distribution. It allows you to create and run MapReduce jobs with any executable or script as the
mapper and/or the reducer. This means you can use any programming language that can read
from standard input and write to standard output for your MapReduce tasks.
1. Input: Hadoop Streaming reads input from HDFS or any other file system and provides
it to the mapper as lines of text.
2. Mapper: You can use any script or executable as a mapper. Hadoop Streaming feeds the
input lines to the mapper's standard input.
3. Shuffling and Sorting: The output from the mapper is sorted and partitioned by the
Hadoop framework.
lOMoARcPSD|214 401 88
4. Reducer: Similarly, you can use any script or executable as a reducer. The reducer reads
sorted input lines from its standard input and produces output, which is written to HDFS
or any other file system.
5. Output: The final output is stored in HDFS or the specified output directory.
• Language Flexibility: It allows developers to use languages like Python, Perl, Ruby,
etc., for writing MapReduce programs, extending Hadoop's usability beyond Java
developers.
• Rapid Prototyping: Developers can quickly prototype and test algorithms without the
need to compile and package Java code.
What is Hadoop Pipes? Hadoop Pipes is a C++ API to implement Hadoop MapReduce
applications. It enables the use of C++ to write MapReduce programs, allowing developers
proficient in C++ to leverage Hadoop's capabilities.
1. Mapper and Reducer: Developers write the mapper and reducer functions in C++.
2. Input: Hadoop Pipes reads input from HDFS or other file systems and provides it to the
mapper as key-value pairs.
3. Map and Reduce Operations: The developer specifies the map and reduce functions,
defining the logic for processing the input key-value pairs.
• Performance: Programs written in C++ can sometimes be more performant due to the
lower-level memory management and execution speed of C++.
• C++ Libraries: Developers can leverage existing C++ libraries and codebases, making it
easier to integrate with other systems and tools.
Both Hadoop Streaming and Hadoop Pipes provide flexibility in terms of programming
languages, enabling a broader range of developers to work with Hadoop and leverage its
powerful data processing capabilities.
Architecture: Hadoop Distributed File System (HDFS) is designed to store very large files
across multiple machines in a reliable and fault-tolerant manner. Its architecture consists of the
following components:
1. NameNode: The NameNode is the master server that manages the namespace and
regulates access to files by clients. It stores metadata about the files and directories, such
as the file structure tree and the mapping of file blocks to DataNodes.
2. DataNode: DataNodes are responsible for storing the actual data. They store data in the
form of blocks and send periodic heartbeats and block reports to the NameNode to
confirm that they are functioning correctly.
3. Block: HDFS divides large files into fixed-size blocks (typically 128 MB or 256 MB).
These blocks are stored across the DataNodes in the cluster.
Replication: HDFS replicates each block multiple times (usually three) and places these replicas
on different DataNodes across the cluster. Replication ensures fault tolerance. If a DataNode or
block becomes unavailable, the system can continue to function using the remaining replicas.
3. Replication: As mentioned earlier, HDFS replicates blocks for fault tolerance. The default
replication factor is 3, but it can be configured based on the cluster's requirements.
4. Fault Tolerance: HDFS achieves fault tolerance by replicating data blocks across multiple
nodes. If a DataNode or block becomes unavailable due to hardware failure or other issues, the
system can continue to operate using the replicated blocks.
5. High Write Throughput: HDFS is optimized for high throughput of data, making it suitable
for applications with large datasets. It achieves this through the parallelism of writing and
reading data across multiple nodes.
6. Scalability: HDFS is designed to scale horizontally by adding more nodes to the cluster. This
scalability allows Hadoop clusters to handle large and growing amounts of data.
lOMoARcPSD|214 401 88
7. Data Integrity: HDFS ensures data integrity by storing checksums of data with each block.
This checksum is verified by clients and DataNodes to ensure that data is not corrupted during
storage or transmission.
Hadoop provides Java APIs that developers can use to interact with the Hadoop ecosystem. The
Java interface in Hadoop includes various classes and interfaces that allow developers to create
MapReduce jobs, configure Hadoop clusters, and manipulate data stored in HDFS. Here's a brief
overview of key components in the Java interface:
1. org.apache.hadoop.mapreduce Package:
o Mapper: Interface for the mapper task in a MapReduce job.
o Reducer: Interface for the reducer task in a MapReduce job.
o Job: Represents a MapReduce job configuration.
o InputFormat: Specifies the input format of the job.
o OutputFormat: Specifies the output format of the job.
o Configuration: Represents Hadoop configuration properties.
2. org.apache.hadoop.fs Package:
o FileSystem: Interface representing a file system in Hadoop (HDFS, local file
system, etc.).
o Path: Represents a file or directory path in Hadoop.
3. org.apache.hadoop.io Package:
o Writable: Interface for custom Hadoop data types.
o WritableComparable: Interface for custom data types that are comparable and
writable.
Developers use these interfaces and classes to create custom MapReduce jobs, configure input
and output formats, and interact with HDFS. They can implement the Mapper and Reducer
interfaces to define their own map and reduce logic for processing data.
1. Input Phase:
o Input data is read from one or more sources, such as HDFS files, HBase tables, or
other data storage systems.
o Input data is divided into input splits, which are processed by individual mapper
tasks.
2. Map Phase:
o Mapper tasks process the input splits and produce intermediate key-value pairs.
o The intermediate data is partitioned, sorted, and grouped by key before being sent
to the reducers.
lOMoARcPSD|214 401 88
Ensuring data integrity is crucial in any distributed storage and processing system like Hadoop.
Hadoop provides several mechanisms to maintain data integrity:
1. Replication:
o HDFS stores multiple replicas of each block across different nodes. If a replica is
corrupted, Hadoop can use one of the other replicas to recover the lost data.
2. Checksums:
o HDFS uses checksums to validate the integrity of data blocks. Each block is
associated with a checksum, which is verified by both the client reading the data
and the DataNode storing the data. If a block's checksum doesn't match the
expected value, Hadoop knows the data is corrupted and can request it from
another node.
3. Write Pipelining:
o HDFS pipelines the data through several nodes during the writing process. Each
node in the pipeline verifies the checksums before passing the data to the next
node. If a node detects corruption, it can request the block from another replica.
4. Error Detection and Self-healing:
o Hadoop can detect corrupted blocks and automatically replace them with healthy
replicas from other nodes, ensuring the integrity of the stored data.
1. Compression:
o Hadoop supports various compression algorithms like Gzip, Snappy, and LZO.
Compressing data before storing it in HDFS can significantly reduce storage
requirements and improve the efficiency of data processing. You can specify the
compression codec when writing data to HDFS or when configuring MapReduce
jobs.
java
conf.set("mapreduce.map.output.compress", "true");
conf.set("mapreduce.map.output.compress.codec",
"org.apache.hadoop.io.compress.SnappyCodec");
Serialization:
• Hadoop uses its own serialization framework called Writable to serialize data efficiently.
Writable data types are Java objects optimized for Hadoop's data transfer. You can also
use Avro or Protocol Buffers for serialization. These serialization formats are more
efficient than Java's default serialization mechanism, especially in the context of large-
scale data processing.
java
// Writing Avro data to HDFS
DatumWriter<YourAvroRecord> datumWriter = new
SpecificDatumWriter<>(YourAvroRecord.class);
DataFileWriter<YourAvroRecord> dataFileWriter = new
DataFileWriter<>(datumWriter);
dataFileWriter.create(yourAvroRecord.getSchema(), new File("output.avro"));
dataFileWriter.append(yourAvroRecord);
dataFileWriter.close();
By utilizing these mechanisms, Hadoop ensures that data integrity is maintained during storage
and processing. Additionally, compression and efficient serialization techniques optimize storage
and data transfer, contributing to the overall performance of Hadoop applications.
Apache Avro is a data serialization framework that provides efficient data interchange in
Hadoop. It enables the serialization of data structures in a language-independent way, making it
ideal for data stored in files. Avro uses JSON for defining data types and protocols, allowing data
to be self-describing and allowing complex data structures.
Key Concepts:
1. Schema Definition: Avro uses JSON to define schemas. Schemas define the data
structure, including types and their relationships. For example, you can define records,
enums, arrays, and more in Avro schemas.
json
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"},
{"name": "address", "type": "string"}
lOMoARcPSD|214 401 88
}
{
1.
2. Serialization: Avro encodes data using the defined schema, producing compact binary
files. Avro data is self-describing, meaning that the schema is embedded in the data itself.
3. Deserialization: Avro can deserialize the data back into its original format using the
schema information contained within the data.
4. Code Generation: Avro can generate code in various programming languages from a
schema. This generated code helps in working with Avro data in a type-safe manner.
Avro is widely used in the Hadoop ecosystem due to its efficiency, schema evolution capabilities,
and language independence, making it a popular choice for serializing data in Hadoop
applications.
Apache Cassandra is a highly scalable, distributed NoSQL database that can handle large
amounts of data across many commodity servers. Integrating Cassandra with Hadoop provides
the ability to combine the advantages of a powerful database system with the extensive data
processing capabilities of the Hadoop ecosystem.
Integration Strategies:
1. Cassandra Hadoop Connector: Cassandra provides a Hadoop integration tool called the
Cassandra Hadoop Connector. It allows MapReduce jobs to read and write data to and
from Cassandra.
2. Cassandra as a Source or Sink: Cassandra can act as a data source or sink for Apache
Hadoop and Apache Spark jobs. You can configure Hadoop or Spark to read data from
Cassandra tables or write results back to Cassandra.
3. Cassandra Input/Output Formats: Cassandra supports Hadoop Input/Output formats,
allowing MapReduce jobs to directly read from and write to Cassandra tables.
Benefits of Integration:
• Data Processing: You can perform complex data processing tasks on data stored in
Cassandra using Hadoop's distributed processing capabilities.
• Data Aggregation: Aggregate data from multiple Cassandra nodes using Hadoop's
parallel processing, enabling large-scale data analysis.
• Data Export and Import: Use Hadoop to export data from Cassandra for backup or
analytical purposes. Similarly, you can import data into Cassandra after processing it
using Hadoop.
Integrating Cassandra and Hadoop allows businesses to leverage the best of both worlds:
Cassandra's real-time, high-performance database capabilities and Hadoop's extensive data
processing and analytics features. This integration enables robust, large-scale data applications
for a variety of use cases.
lOMoARcPSD|214 401 88
UNIT V
What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It
is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick
random access to huge amounts of structured data. It leverages the fault tolerance
provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to
data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the
Hadoop File System and provides read and write access.
Features of Hbase
o fundamentally, it's a platform for storing and retrieving data withrandom access.
o It doesn't care about datatypes(storing an integer in one row and astring in another
for the same column).
o It doesn't enforce relationships within your data.
o It is designed to run on a cluster of computers, built using commodity
hardware.
o
Pig
Apache Pig is a high-level data 昀氀ow platform for executing MapReduce programs of
Hadoop. The language used for Pig is Pig Latin.
The Pig scripts get internally converted to Map Reduce jobs and get executed on data
stored in HDFS. Apart from that, Pig can also execute its job in Apache Tez or Apache
Spark.
Pig can handle any type of data, i.e., structured, semi-structured or unstructured and
stores the corresponding results into Hadoop Data File
lOMoARcPSD|214 401 88
System. Every task which can be achieved using PIG can also be achieved using java
used in MapReduce.
1) Ease of programming
Writing complex java programs for map reduce is quite tough for non- programmers.
Pig makes this process easy. In the Pig, the queries are converted to MapReduce
internally.
2) Optimization opportunities
It is how tasks are encoded permits the system to optimize their execution
automatically, allowing the user to focus on semantics rather than e昀케ciency.
3) Extensibility
A user-de昀椀ned function is written in which the user can write their logic to
execute over the data set.
4) Flexible
It can easily handle structured as well as unstructured data.
5) In-built operators
It contains various type of operators such as sort, 昀椀lter and joins.
ordering.
It doesn't allow nested data types. It provides nested data types like tuple, bag,and map.
Before we take a look at the operators that Pig Latin provides, we first need tounderstand Pig’s
data model. This includes Pig’s data types, how it handles concepts such as missing data, and how
you can describe your data to Pig.
Types
Pig’s data types can be divided into two categories: scalar types, whichcontain a single
value, and complex types, which contain other types.
Scalar Type
Pig’s scalar types are simple types that appear in most programming languages. With the
exception of bytearray, they are all represented in Piginterfaces by java.lang classes, making
them easy to work with in UDFs:
int
An integer. Ints are represented in interfaces by java.lang.Integer. Theystore a four-
byte signed integer. Constant integers are expressed as integer numbers, for example, 42
lon
lOMoARcPSD|214 401 88
Complex Types
Pig has three complex data types: maps, tuples, and bags. All of these types can contain data of
any type, including other complex types. So it is possible to have a map where the value field is a
bag, which contains a tuple where oneof the fields is a map.
Map
A map in Pig is a char array to data element mapping, where that element canbe any Pig type,
including a complex type. The char array is called a key and is used as an index to find the
element, referred to as the value
Because Pig does not know the type of the value, it will assume it is a byte array. However, the
actual value might be something different. If you know what the actual type is (or what you want
it to be), you can cast it; see Casts. Ifyou do not cast the value, Pig will make a best guess based
on how you use the value in your script. If the value is of a type other than bytearray, Pig will
figure that out at runtime and handle it. See Schemas for more information on how Pig handles
unknown types.
By default there is no requirement that all values in a map must be of the sametype. It is legitimate
to have a map with two keys name and age, where the valuefor name is a chararray and the value
for age is an int. Beginning in Pig 0.9, a map can declare its values to all be of the same type.
This is useful if you know all values in the map will be of the same type, as it allows you to
avoidthe casting, and Pig can avoid the runtime type-massaging referenced in the previous
paragraph
Map constants are formed using brackets to delimit the map, a hash betweenkeys and values,
and a comma between key-value pairs. For
example, ['name'#'bob', 'age'#55] will create a map with two
keys, “name” and “age”. The first value is a chararray, and the second is aninteger.
lOMoARcPSD|214 401 88
Tuple
A tuple is a fixed-length, ordered collection of Pig data elements. Tuples aredivided into fields,
with each field containing one data element. These elements can be of any type—they do not all
need to be the same type. A tupleis analogous to a row in SQL, with the fields being SQL
columns. Because tuples are ordered, it is possible to refer to the fields by position;
see Expressions in foreach for details. A tuple can, but is not required to, havea schema associated
with it that describes each field’s type and provides a name for each field. This allows Pig to
check that the data in the tuple is whatthe user expects, and it allows the user to reference the
fields of the tuple by name.
Tuple constants use parentheses to indicate the tuple and commas to delimitfields in the tuple.
For example, ('bob', 55) describes a tuple constant with two fields.
Bag
A bag is an unordered collection of tuples. Because it has no order, it is not possible to reference
tuples in a bag by position. Like tuples, a bag can, but isnot required to, have a schema associated
with it. In the case of a bag, the schema describes all tuples within the bag.
Bag constants are constructed using braces, with tuples in the bag separated by
commas. For example, {('bob', 55), ('sally', 52), ('john', 25)} constructs a
bag with three tuples, each with two fields.
Pig users often notice that Pig does not provide a list or set type that can store items of any type. It
is possible to mimic a set type using the bag, by wrappingthe desired type in a tuple of one field.
For instance, if you want to store a set of integers, you can create a bag with a tuple with one field,
which is an int.
This is a bit cumbersome, but it works.
Bag is the one type in Pig that is not required to fit into memory. As you willsee later, because
bags are used to store collections when grouping, bags canbecome quite large. Pig has the ability
to spill bags to disk when necessary, keeping only partial sections of the bag in memory. The size
of the bag is limited to the amount of local disk available for spilling the bag.
lOMoARcPSD|214 401 88
Pig Latin
The Pig Latin is a data 昀氀ow language used by Apache Pig to analyze the data in
Hadoop. It is a textual language that abstracts the programming from the Java
MapReduce idiom into a notation.
The Pig Latin statements are used to process the data. It is an operator that accepts a
relation as an input and generates another relation as an output.
Convention Description
() The parenthesis can enclose one or more items. It can also be usedto indicate the tuple data
type.
Example - (10, xyz, (3,6,9))
[] The straight brackets can enclose one or more items. It can alsobe used to indicate the
map data type.
Example - [INNER | OUTER]
{} The curly brackets enclose two or more items. It can also be used toindicate the bag
data type
Example - { block | nested_block }
... The horizontal ellipsis points indicate that you can repeat a portionof the code. Example -
cat path [path ...]
Type Description
lOMoARcPSD|214 401 88
Complex Types
Type Description
The last few chapters focused on Pig Latin the language. Now we will turn tothe practical
matters of developing and testing your scripts. This chapter covers helpful debugging tools such
as describe and explain. It also covers ways to test your scripts. Information on how to make
your scripts perform better will be covered in the next chapter.
Development Tools
Pig provides several tools and diagnostic operators to help you develop your applications. In
this section we will explore these and also look at some tools others have written to make it
easier to develop Pig with standard editors andintegrated development environments (IDEs).
Tool URL
Eclipse http://code.google.com/p/pig-eclipse
TextMate http://www.github.com/kevinweil/pig.tmbundle
Vim http://www.vim.org/scripts/script.php?script_id=2186
lOMoARcPSD|214 401 88
In addition to these syntax highlighting packages, Pig will also let you check the syntax of your
script without running it. If you add -c or -check to thecommand line, Pig will just parse and
run semantic checks on your script.
The -dryrun command-line option will also check your syntax, expand anymacros and
imports, and perform parameter substitution.
describe
describe shows you the schema of a relation in your script. This can be veryhelpful as you
are developing your scripts. It is especially useful as you are learning Pig Latin and
understanding how various operators change the data. describe can be applied to any relation
in your script, and you can have
multiple describes in a script:
--describe.pig
date:chararray, dividends:float);
describe trimmed;
describe grpd;
describe avgdiv;
lOMoARcPSD|214 401 88
Data Types
Hive data types are categorized in numeric types, string types, misc types, and
complex types. A list of Hive data types is given below.
Integer Types
Decimal Type
Date/Time Types
TIMESTAMP
DATES
The Date value is used to specify a particular year, month and day, in the form
YYYY--MM--DD. However, it didn't provide the time of the day. The range of Date type lies
between 0000--01--01 to 9999--12--31.
String Types
STRING
The string is a sequence of characters. It values can be enclosed within single quotes
(') or double quotes (").
Varchar
The varchar is a variable length type whose range lies between 1 and 65535, which
speci昀椀es that the maximum number of characters allowed in the character string.
CHAR
lOMoARcPSD|214 401 88
• Text File
• Sequence File
• RC File
• AVRO File
• ORC File
• Parquet File
Hive Text File Format
Hive Text 昀椀le format is a default storage format. You can use the text format to
interchange the data with other client application. The text file format is very common
most of the applications. Data is stored in lines, with each line being a record. Each lines
are terminated by a newline character (\n).
The text format is simple plane file format. You can use the compression (BZIP2) on the
text file to reduce the storage spaces.
Create a TEXT file by add storage option as ‘STORED AS TEXTFILE’ at the end of a Hive
CREATE TABLE command.
Store as textfile;
lOMoARcPSD|214 401 88
Sequence 昀椀les are Hadoop flat files which stores values in binary key-value pairs.
The sequence files are in binary format and these files are able to split. The main
advantages of using sequence file is to merge two or more files into one file.
(column_specs)
Stored as sequencefile_table
AVRO is open source project that provides data serialization and data exchange
services for Hadoop. You can exchange data between Hadoop ecosystem and program
written in any programming languages. Avro is one of the popular file format in Big Data
Hadoop based applications.
Create AVRO file by specifying ‘STORED AS AVRO’ option at the end of a CREATE
TABLE Command.
(column_specs)
Stored as aveo:
Still, much of HiveQL will be familiar. This chapter and the ones that follow discuss the
features of HiveQL using representative examples.In some cases, we will briefly
mention details for completeness, thenexplore them more fully in later chapters.
This chapter starts with the so-called data definition language parts of HiveQL, which
are used for creating, altering, and dropping databases,tables, views, functions, and
indexes. We’ll discuss databases and tables in this chapter, deferring the discussion of
views until Chapter 7,indexes until Chapter 8, and functions until Chapter 13.
We’ll also discuss the SHOW and DESCRIBE commands for listing anddescribing items as
we go.
Subsequent chapters explore the data manipulation language parts of HiveQL that are
used to put data into Hive tables and to extract data tothe filesystem, and how to explore
and manipulate data with queries, grouping, filtering, joining, etc.
Databases in Hive
The Hive concept of a database is essentially just
a catalog or namespace of tables. However, they are very useful for larger clusters with
multiple teams and users, as a way of avoiding table name collisions. It’s also common
to use databases to organizeproduction tables into logical groups.
If you don’t specify a database, the default database is used.
Hive will throw an error if financials already exists. You can suppressthese
warnings with this variation:
While normally you might like to be warned if a database of the samename already
exists, the IF NOT EXISTS clause is useful for scripts that should create a database on-
the-fly, if necessary, before proceeding.
You can also use the keyword SCHEMA instead of DATABASE in all thedatabase-
related commands.
At any time, you can see the databases that already exist as follows:
lOMoARcPSD|214 401 88
If you have a lot of databases, you can restrict the ones listed using
a regular expression, a concept we’ll explain in LIKE and RLIKE, if it isnew to you. The
following example lists only those databases that startwith the letter h and end with any
other characters (the .* part):
Hive will create a directory for each database. Tables in that database will be stored in
subdirectories of the database directory. The exceptionis tables in the default database,
which doesn’t have its own directory.
lOMoARcPSD|214 401 88
You can override this default location for the new directory as shown inthis
example: