0% found this document useful (0 votes)
6 views139 pages

BIgdatabase Overall Notes

The document outlines a syllabus for a course on Big Data, covering topics such as its evolution, analytics, technologies, security, and case studies. It emphasizes the importance of Big Data analytics for organizations, detailing its benefits and challenges. The course aims to equip students with knowledge and skills to understand and analyze Big Data effectively.

Uploaded by

Sudarshan Selvam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views139 pages

BIgdatabase Overall Notes

The document outlines a syllabus for a course on Big Data, covering topics such as its evolution, analytics, technologies, security, and case studies. It emphasizes the importance of Big Data analytics for organizations, detailing its benefits and challenges. The course aims to equip students with knowledge and skills to understand and analyze Big Data effectively.

Uploaded by

Sudarshan Selvam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 139

ITE79 - BIG DATABASE

SYLLABUS:

UNIT CONTENTS OF THE SYLLABUS HOURS

Introduction to Big Data: Big Data – The Evolution of Big data - Basics - Big Data
I Analytics and its Importance – challenges-Issues- Future of Big Data 12

II Basic Big Data Analytic Methods and Modeling: Introduction to “R”, analyzing 12
and exploring data with “R”-Modeling:Architecture - Hybrid Data Modeling – Data
Computing Modeling.

III Technology and Tools: MapReduce/Hadoop – NoSQL: Cassandra,HBASE – Apache 12


Mahout – Tools.

Big Data Security: Big Data Security, Compliance, Auditing and Protection: Pragmatic
IV Steps to Securing Big Data, Classifying Data, Protecting Big Data Analytics, Big Data 12
and Compliance, The Intellectual Property Challenge –Big Data in Cyber defense.

Case Studies: MapReduce: Simplified Data Processing on Large Clusters-


V RDBMS to NoSQL: Reviewing Some Next- Generation Non-Relational Database's - 12
Analytics: The real-world use of big data - New Analysis Practices for Big Data.

TOTAL HOURS 60

Text Books:
1. Frank.J.Ohlhorst, “Big Data Analytics : Turning Big Data into Big Money”, Wiley &Sas
Business Series, 2013

Reference Books:
1. Paul C. Zikopoulos, Chris Eaton, Dirk deRoos, Thomas Deutsch, George Lapis,
“Understanding Big Data Analytics for Enterprise Class Hadoop and Streaming
Data”, The McGraw Hill, 2012.
2. “Planning for Big Data”, O’Reilly Radar Team, 2012.
3. “Big Data Now Current Perspectives”, O’Reilly Media, 2011.
IT-E79BIG DATABASES UNIT 1

UNIT 1
Introduction to Big Data: Big Data – The Evolution of Big data - Basics - Big Data Analytics and
its Importance – challenges- Issues- Future of Big Data

Course Objectives:
 The students are able to understand the concepts of Big Data
 To understand the importance of big data analytics
Course Outcomes:
 The students learn the evolution of big data
 The students can be able to analyze various applications through Big data

1. What is Big Data?

Big data means really a big data; it is a collection of large datasets that cannot be processed using
traditional computing techniques. Big data is not merely a data; rather it has become a complete
subject, which involves various tools, techniques and frameworks.

1.2 Sources of Big Data

There is a need for storing the data into a wide variety of formats. With the evolution and
advancement of technology, the amount of data that is being generated is ever increasing. Sources of
Big Data can be broadly classified into six different categories as shown below.

Enterprise Data

There are large volumes of data in enterprises in different formats. Common formats include flat
files, emails, Word documents, spreadsheets, presentations, HTML pages/documents, PDF
documents, XMLs, legacy formats, etc. This data that is spread across the organization in different
formats is referred to as Enterprise Data.

Transactional Data

Every enterprise has some kind of applications which involve performing different kinds of
transactions like Web Applications, Mobile Applications, CRM Systems, and many more. To
support the transactions in these applications, there are usually one or more relational databases as a
backend infrastructure. This is mostly structured data and is referred to as Transactional Data.

MVIT/IT/IV/VII 1
IT-E79BIG DATABASES UNIT 1
Social Media

This is self-explanatory. There is a large amount of data getting generated on social networks like
Twitter, Facebook, etc. The social networks usually involve mostly unstructured data formats which
include text, images, audio, videos, etc. This category of data source is referred to as Social Media.

Activity Generated

There is a large amount of data being generated by machines which surpasses the data volume
generated by humans. These include data from medical devices, censor data, surveillance videos,
satellites, cell phone towers, industrial machinery, and other data generated mostly by machines.
These types of data are referred to as Activity Generated data.

Public Data

This data includes data that is publicly available like data published by governments, research data
published by research institutes, data from weather and meteorological departments, census data,
Wikipedia, sample open source data feeds, and other data which is freely available to the public.
This type of publicly accessible data is referred to as Public Data.

Archives

Organizations archive a lot of data which is either not required anymore or is very rarely required. In
today's world, with hardware getting cheaper, no organization wants to discard any data, they want
to capture and store as much data as possible. Other data that is archived includes scanned
documents, scanned copies of agreements, records of ex-employees/completed projects, banking
transactions older than the compliance regulations. This type of data, which is less frequently
accessed, is referred to as Archive Data.

MVIT/IT/IV/VII 2
IT-E79BIG DATABASES UNIT 1
1.3 Dimensions of Big Data or Characteristics of Big Data

Big Data in multidimensional terms, in which four dimensions relate to the primary aspects
of Big Data. These dimensions can be defined as follows:

1. Volume. Big Data comes in one size: large. Enterprises are awash with data, easily amassing
terabytes and even petabytes of information.

2. Variety. Big Data extends beyond structured data to include unstructured data of all varieties:
text, audio, video, click streams, log files, and more.

3. Veracity. The massive amounts of data collected for Big Data purposes can lead to statistical
errors and misinterpretation of the collected information. Purity of the information is critical for
value.

4. Velocity. Often time sensitive, Big Data must be used as it is streaming into the enterprise in order
to maximize its value to the business, but it must also still be available from the archival sources as
well.

These 4Vs of Big Data lay out the path to analytics; with each have intrinsic value in the process of
discovering value.

4Vs of Big Data layout

MVIT/IT/IV/VII 3
IT-E79BIG DATABASES UNIT 1
1.4 Evolution of Big Data

Data has always been around and there has always been a need for storage, processing, and
management of data, since the beginning of human civilization and human societies. However, the
amount and type of data captured, stored, processed, and managed depended then and even now on
various factors including the necessity felt by humans, available tools/technologies for storage,
processing, management, effort/cost, and ability to gain insights into the data, make decisions, and
so on.

Going back a few centuries, in the ancient days, humans used very primitive ways of
capturing/storing data like carving on stones, metal sheets, wood, etc. Then with new inventions and
advancements a few centuries in time, humans started capturing the data on paper, cloth, etc. As time
progressed, the medium of capturing/storage/management became punching cards followed by
magnetic drums, laser disks, floppy disks, magnetic tapes, and finally today we are storing data on
various devices like USB Drives, Compact Discs, Hard Drives, etc.

As we can clearly see from this trend, the capacity of data storage has been increasing
exponentially, and today with the availability of the cloud infrastructure, potentially one can store
unlimited amounts of data. Today Terabytes and Petabytes of data is being generated, captured,
processed, stored, and managed.

The future of Big Data depends on Smart Data. Smart Data supports rapid integration of either
unstructured or semi-structured data. The self-describing properties of Smart Data are practically
necessities for the massive quantities, differentiated data types, and high volumes of Big Data
because they facilitate:

 Structured data : Relational data.

 Semi Structured data : XML data.

 Unstructured data : Word, PDF, Text, Media Logs.

1.5 Big Data Analytics and its Importance

1.5.1Big Data Analytics

The definition of big data holds the key to understanding big data analysis. According to the Gartner
IT Glossary, Big Data is high-volume, high-velocity, and high-variety information assets that
demand cost effective, innovative forms of information processing for enhanced insight and decision
making.

Like conventional analytics and business intelligence solutions, big data mining and analytics helps
uncover hidden patterns, unknown correlations, and other useful business information. However, big
data tools can analyze high-volume, high-velocity, and high-variety information assets far better than

MVIT/IT/IV/VII 4
IT-E79BIG DATABASES UNIT 1
conventional tools and relational databases that struggle to capture, manage, and process big data
within a tolerable elapsed time and at an acceptable total cost of ownership.
Organizations are using new big data technologies and solutions such as Hadoop,
MapReduce, Hadoop Hive, Spark, Presto, Yarn, Pig, NoSQL databases, and more to support their
big data requirements.
Big data analytics helps organizations harness their data and use it to identify new opportunities.
That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier
customers.

1. Cost reduction.

Big data technologies such as Hadoop and cloud-based analytics bring significant cost
advantages when it comes to storing large amounts of data – plus they can identify more efficient
ways of doing business.

2. Faster, better decision making.

The speed of Hadoop and in-memory analytics, combined with the ability to analyze new
sources of data, businesses are able to analyze information immediately – and make decisions
based on what they‘ve learned.

3. Competitive Advantage

One of the major advantages of big data analytics is that it gives businesses access to data
that was previously unavailable or difficult to access. With increased access to data sources such
as social media streams and clickstream data, businesses can better target their marketing efforts
to customers, better predict demand for a certain product, and adapt marketing and advertising
messaging in real-time. With these advantages, businesses are able to gain an edge on their
competitors and act more quickly and decisively when compared to what rival organizations do

4. New Business Opportunities

The final benefit of big data analytics tools is the possibility of exploring new business
opportunities. Entrepreneurs have taken advantage of big data technology to offer new services
in AdTech and MarketingTech. Mature companies can also take advantage of the data they
collect to offer add-on services or to create new product segments that offer additional value to
their current customers.
In addition to those benefits, big data analytics can pinpoint new or potential audiences that
have yet to be tapped by the enterprise. Finding whole new customer segments can lead to
tremendous new value.
These are just a few of the actionable insights made possible by available big data analytics
tools. Whether an organization is looking to boost sales and marketing results, uncover new
revenue opportunities, improve customer service, optimize operational efficiency, reduce risk,
improve security, or drive other business results, big data insights can help.
MVIT/IT/IV/VII 5
IT-E79BIG DATABASES UNIT 1
5. New products and services.
The ability to gauge customer needs and satisfaction through analytics comes the power to
give customers what they want. Davenport points out that with big data analytics, more
companies are creating new products to meet customers‘ needs.

1.5.2 Big Data Challenges


The major challenges associated with big data are as follows:

 Capturing data
o Automatic identification and data capture (AIDC) refers to the methods of
automatically identifying objects, collecting data about them, and entering
that datadirectly into computer systems
 Curation
o Curation is a field of endeavor involved with assembling, managing and presenting
some type of collection.
 Storage
o Storage for synchronous analytics. Real-time analytics applications are typically run
on databases like NoSQL, which are massively scalable and can be supported with
commodity hardware.
 Searching
o Search Technologies, we specialize in addressing unstructured content sources, and
helping customers to prepare, analyze and merge insight from human generated
content, with structured, machine-generated data.
 Sharing
o Share original data in a controlled way so that different groups within your
organization only see part of the whole.
 Transfer
o To capitalize on the tremendous business value inherent in big data and Hadoop, there
remains the challenge of big data transfer
 Analysis
o It‘s a process of examining large data sets containing a variety ofdata types --
i.e., big data
 Presentation
o It takes care that the data is sent in such a way that the receiver will understand the
information (data) and will be able to use the data.

MVIT/IT/IV/VII 6
IT-E79BIG DATABASES UNIT 1
To fulfill the above challenges, organizations normally take the help of enterprise servers.

1.6 Issues of Big Data

 Making tools easier to use. Hadoop stack and NoSQLs really do require programming
knowledge to unlock their power.

 Getting quicker answers across large data sets. We can get them in "acceptable"
amounts of time. Its about getting that 3 hour query down to 5 minutes or less. Apache
Impala (incubating) is a good example of work in this space.

 Integration with existing tools. There's a few companies out there already working on
this but I'm not seeing any real standards being developed for tight integration. I think
that eventually how you query data should be seamless to the person querying it,
whether it is in a big data solution, RDBMS, JSON/XML, etc.

 Better security models. There is almost no security in place for virtually all big data
tools. Once you get access, you get access to everything. Improvements are being
made, but it still isn't enterprise grade.

 Defining best practices for developing and using big data tools.

 Creating industry solutions that utilize big data. We already see this happening in a
couple of places, like utilities and healthcare, but it isn't yet wide spread.

 Defining that use case that isn't analyzing web based information that all enterprises can
leverage. Right now, most big data use cases are centered around solving problems with
something involving the web.

 More mature software. Right now, Hadoop logs are riddled with errors, warnings, and
numerous other things that are almost impossible to decipher. Yet, the damn thing still
seems to magically work. There needs to be improvements in better predicting and
avoiding problems when running map-reduce jobs and giving human readable
information to solve problems.

 Defining what big data actually is Adoption. I'm still not seeing truly widespread
adoption of big data tools in the enterprise. There are a lot of department solutions,
technical solutions or proof of concepts.

MVIT/IT/IV/VII 7
IT-E79BIG DATABASES UNIT 1
1.6.1 Future of Big Data.

1. Data volumes will continue to grow. There‘s absolutely no question that we will continue
generating larger and larger volumes of data, especially considering that the number of handheld
devices and Internet-connected devices is expected to grow exponentially.

2. Ways to analyze data will improve. While SQL is still the standard, Spark is emerging as a
complementary tool for analysis and will continue to grow, according to Ovum.

3. More tools for analysis (without the analyst) will emerge. Microsoft MSFT -
1.41% andSalesforce both recently announced features to let non-coders create apps to view
business data.

4. Prescriptive analytics will be built in to business analytics software. IDC predicts that half of
all business analytics software will include the intelligence where it‘s needed by 2020.

5. In addition, real-time streaming insights into data will be the hallmarks of data
winners going forward, according to Forrester. Users will want to be able to use data to make
decisions in real time with programs like Kafka and Spark.

6. Machine learning is a top strategic trend for 2016, according to Gartner. And Ovumpredicts
that machine learning will be a necessary element for data preparation and predictive analysis in
businesses moving forward.

7. Big data will face huge challenges around privacy, especially with the new privacy regulation
by the European Union. Companies will be forced to address the ‗elephant in the room‘ around
their privacy controls and procedures. Gartner predicts that by 2018, 50% of business ethics
violations will be related to data.

8. More companies will appoint a chief data officer. Forrester believes the CDO will see a rise in
prominence — in the short term. But certain types of businesses and even generational differences
will see less need for them in the future.

9. “Autonomous agents and things” will continue to be a huge trend, according toGartner,
including robots, autonomous vehicles, virtual personal assistants, and smart advisers.

10. Big data staffing shortages will expand from analysts and scientists to include architects and
experts in data management according to IDC.

11. But the big data talent crunch may ease as companies employ new tactics. The International
Institute for Analytics predicts that companies will use recruiting and internal training to get their
personnel problems solved.

MVIT/IT/IV/VII 8
IT-E79BIG DATABASES UNIT 1
12. The data-as-a-service business model is on the horizon. Forrester suggests that afterIBM IBM
-0.40%‘s acquisition of The Weather Channel, more businesses will attempt to monetize their
data.

13. Algorithm markets will also emerge. Forrester surmises that businesses will quickly learn that
they can purchase algorithms rather than program them and add their own data. Existing services
like Algorithmia, Data Xu, and Kaggle can be expected to grow and multiply.

14. Cognitive technology will be the new buzzword. For many businesses, the link between
cognitive computing and analytics will become synonymous in much the same way that
businesses now see similarities between analytics and big data.

15. ―All companies are data businesses now,‖ according to Forrester. More companies will attempt
to drive value and revenue from their data.

16. Businesses using data will see $430 billion in productivity benefits over their competition not
using data by 2020, according to International Institute for Analytics.

17. “Fast data” and “actionable data” will replace big data, according to some experts. The
argument is that big isn‘t necessarily better when it comes to data, and that businesses don‘t use a
fraction of the data they have access too.

1.7Big Data Use Cases

Big Data technologies can solve the business problems in a wide range of industries. Below are a
few use cases.

 Banking and Financial Services


o Fraud Detection to detect the possible fraud or suspicious transactions in Accounts, Credit
Cards, Debit Cards, and Insurance etc.

 Retail
o Targeting customers with different discounts, coupons, and promotions etc. based on
demographic data like gender, age group, location, occupation, dietary habits, buying patterns,
and other information which can be useful to differentiate/categorize the customers.

 Marketing
o Specifically outbound marketing can make use of customer demographic information like
gender, age group, location, occupation, and dietary habits, customer interests/preferences
usually expressed in the form of comments/feedback and on social media networks.
o Customer's communication preferences can be identified from various sources like polls,
reviews, comments/feedback, and social media etc. and can be used to target customers via
different channels like SMS, Email, Online Stores, Mobile Applications, and Retail Stores etc.

 Sentiment Analysis
MVIT/IT/IV/VII 9
IT-E79BIG DATABASES UNIT 1
o Organizations use the data from social media sites like Facebook, Twitter etc. to understand
what customers are saying about the company, its products, and services. This type of analysis
is also performed to understand which companies, brands, services, or technologies people are
talking about.

 Customer Service
o IT Services and BPO companies analyze the call records/logs to gain insights into customer
complaints and feedback, call center executive response/ability to resolve the ticket, and to
improve the overall quality of service.
o Call center data from telecommunications industries can be used to analyze the call
records/logs and optimize the price, and calling, messaging, and data plans etc.

Apart from these, Big Data technologies/solutions can solve the business problems in other
industries like Healthcare, Automobile, Aeronautical, Gaming, and Manufacturing etc.

1.7 Big data Tools Classification

MVIT/IT/IV/VII 10
IT-E79BIG DATABASES UNIT 1
 Data Storage and Management

 Data Cleaning

 Data Mining

 Data Analysis

 Data Visualization

 Data Integration

 Data Languages

1.7.1Data Storage and Management Tool

Hadoop
The name Hadoop has become synonymous with big data. It‘s an open-source software framework
for distributed storage of very large datasets on computer clusters. All that means you can scale your
data up and down without having to worry about hardware failures. Hadoop provides massive
amounts of storage for any kind of data, enormous processing power and the ability to handle
virtually limitless concurrent tasks or jobs.

Cloudera
Cloudera is essentially a brand name for Hadoop with some extra services stuck on. They can help
your business build an enterprise data hub, to allow people in your organization better access to the
data you are storing. While it does have an open source element, Cloudera is mostly and enterprise
solution to help businesses manage their Hadoop ecosystem. Essentially, they do a lot of the hard
work of administering Hadoop for you. They will also deliver a certain amount of data security,
which is highly important if you‘re storing any sensitive or personal data.

MongoDB
MongoDB is the modern, start-up approach to databases. Think of them as an alternative to
relational databases. It‘s good for managing data that changes frequently or data that is unstructured
or semi-structured. Common use cases include storing data for mobile apps, product catalogs, real-
time personalization, content management and applications delivering a single view across multiple
systems.

MVIT/IT/IV/VII 11
IT-E79BIG DATABASES UNIT 1

Talend

Talend is another great open source company that offers a number of data products. Here we‘re
focusing on their Master Data Management (MDM) offering, which combines real-time data,
applications, and process integration with embedded data quality and stewardship. Because it‘s open
source, Talend is completely free making it a good option no matter what stage of business you are
in. And it saves you having to build and maintain your own data management system – which is a
tremendously complex and difficult task.

1.7.2 Data Cleaning Tool

OpenRefine
OpenRefine (formerly GoogleRefine) is an open source tool that is dedicated to cleaning messy data.
You can explore huge data sets easily and quickly even if the data is a little unstructured. As far as
data softwares go, OpenRefine is pretty user-friendly. Though, a good knowledge of data cleaning

MVIT/IT/IV/VII 12
IT-E79BIG DATABASES UNIT 1
principles certainly helps. The nice thing about OpenRefine is that it has a huge community with lots
of contributors meaning that the software is constantly getting better and better.

DataCleaner
DataCleaner recognises that data manipulation is a long and drawn out task. Data visualization tools
can only read nicely structured, ―clean‖ data sets. DataCleaner does the hard work for you and
transforms messy semi-structured data sets into clean readable data sets that all of the visualization
companies can read. DataCleaner also offers data warehousing and data management services. The
company offers a 30-day free trial and then after that a monthly subscription fee.

1.7.3 Data Mining Tool

RapidMiner
The client list that includes PayPal, Deloitte, eBay and Cisco, RapidMiner is a fantastic tool for
predictive analysis. It‘s powerful, easy to use and has a great open source community behind it. You
can even integrate your own specialized algorithms into RapidMiner through their APIs.

IBM SPSS Modeler


The IBM SPSS Modeler offers a whole suite of solutions dedicated to data mining. This includes
text analysis, entity analytics, decision management and optimization. Their five products provide a
range of advanced algorithms and techniques that include text analytics, entity analytics, decision
management and optimization.

SPSS Modeler is a heavy-duty solution that is well suited for the needs of big companies. It can run
on virtually any type of database and you can integrate it with other IBM SPSS products such as
SPSS collaboration and deployment services and the SPSS Analytic server.

MVIT/IT/IV/VII 13
IT-E79BIG DATABASES UNIT 1

Oracle data mining


Another big hitter in the data mining sphere is Oracle. As part of their Advanced Analytics Database
option, Oracle data mining allows its users to discover insights, make predictions and leverage their
Oracle data. You can build models to discover customer behavior, target best customers and develop
profiles. The Oracle Data Miner GUI enables data analysts, business analysts and data scientists to
work with data inside a database using a rather elegant drag and drop solution. It can also create
SQL and PL/SQL scripts for automation, scheduling and deployment throughout the enterprise.

Teradata
Teradata recognizes the fact that, although big data is awesome, if you don‘t actually know how to
analyze and use it, it‘s worthless. Imagine having millions upon millions of data points without the
skills to query them. That‘s where Teradata comes in. They provide end-to-end solutions and
services in data warehousing, big data and analytics and marketing applications. This all means that
you can truly become a data-driven business. Teradata also offers a whole host of services including
implementation, business consulting, training and support.

FramedData
If you‘re after a specific type of data mining there are a bunch of startups which specialize in helping
businesses answer tough questions with data. If you‘re worried about user churn, we
recommend FramedData, a startup which analyzes your analytics and tell you which customers are
about to abandon your product.

Kaggle
If you‘re stuck on a data mining problem or want to try solving the world‘s toughest problems, check
out Kaggle. Kaggle is the world‘s largest data science community. Companies and researchers post
their data and statisticians and data miners from all over the world compete to produce the best
models.

MVIT/IT/IV/VII 14
IT-E79BIG DATABASES UNIT 1
1.7.4 Data Analysis Tool
Qubole
Qubole simplifies, speeds and scales big data analytics workloads against data stored on AWS,
Google, or Azure clouds. They take the hassle out of infrastructure wrangling. Once the IT policies
are in place, any number of data analysts can be set free to collaboratively ―click to query‖ with the
power of Hive, Spark, Presto and many others in a growing list of data processing engines. Qubole is
an enterprise level solution. They offer a free trial that you can sign up to at this page.The flexibility
of the program really does set it apart from the rest as well as being the most accessible of the
platforms.

BigML
BigML is attempting to simplify machine learning. They offer a powerful Machine Learning service
with an easy-to-use interface for you to import your data and get predictions out of it. You can even
use their models for predictive analytics. A good understanding of modeling is certainly helpful, but
not essential, if you want to get the most from BigML. They have a free version of the tool that
allows you to create tasks that are under 16mb as well as having a pay as you go plan and a virtual
private cloud that meet enterprise-grade requirements.

Statwing
Statwing takes data analysis to a new level providing everything from beautiful visuals to complex
analysis. They have a particularly cool blog post on NFL data! It‘s so simple to use that you can
actually get started with Statwing in under 5 minutes. This allows you to use unlimited datasets of up
to 50mb in size each. There are other enterprise plans that give you the ability to upload bigger
datasets.

MVIT/IT/IV/VII 15
IT-E79BIG DATABASES UNIT 1
1.7.5 Data Visualization Tool

Tableau
Tableau is a data visualization tool with a primary focus on business intelligence. You can create
maps, bar charts, scatter plots and more without the need for programming. They recently released a
web connector that allows you to connect to a database or API thus giving you the ability to get live
data in a visualisation. Exploring that tool should give you an idea of which of the other Tableau
products you‘d rather pay for.

Silk
Silk is a much simpler data visualization and analytical tool than Tableau. It allows you to bring your
data to life by building interactive maps and charts with just a few clicks of the mouse. Silk also
allows you to collaborate on a visualisation with as many people as you want.

CartoDB
CartoDB is a data visualization tool that specialises in making maps. They make it easy for anyone
you to visualize location data – without the need for any coding. CartoDB can manage a myriad of
data files and types, they even have sample datasets that you can play around with while you‘re
getting the hang of it.

Chartio
Chartio allows you to combine data sources and execute queries in-browser. You can create
powerful dashboards in just a few clicks. Chartio‘s visual query language allows anyone to grab data
from anywhere without having to know SQL or other complicated model languages. They also let
you schedule PDF reports so you can export and email your dashboard as a PDF file to anyone you
want.

Plot.ly
If you are wanting to build a graph, Plot.ly is the place to go. This handy platform allows you to
create stunning 2d and 3d charts (you really need to see it to believe it!). Again, all without needing
programming knowledge. The free version allows you create one private chart and unlimited public
charts or you can upgrade to the enterprise packages to make unlimited private and public charts as
well as giving you the option for Vector exports and saving of custom themes.

Datawrapper
Datawrapper is an open source tool that creates embeddable charts in minutes. Because it‘s open
source, it will be constantly evolving as anyone can contribute to it. They have an awesome chart
gallery where you can check out the kind of stuff people are doing with Datawrapper.

MVIT/IT/IV/VII 16
IT-E79BIG DATABASES UNIT 1
1.7.6 Data Integration Tool

Blockspring
Blockspring is a unique program in the way that they harness all of the power of services such as
IFTTT and Zapier in familiar platforms such as Excel and Google Sheets. You can connect to a
whole host of 3rd party programs by simply writing a Google Sheet formula. You can post Tweets
from a spreadsheet, look to see who your followers are following as well as connecting to AWS,
Import.io and Tableau to name a few. Blockspring is free to use, but they also have an organization
package that allows you to create and share private functions, add custom tags for easy search and
discovery and set API tokens for your whole organization at once.

Pentaho
Pentaho offers big data integration with zero coding required. Using a simple drag and drop UI you
can integrate a number of tools with minimal coding. They also offer embedded analytics and
business analytics services too. Pentaho is an enterprise solution. You can request a free trial of
the data integration product, after which, a payment will be required.

1.7.7 Data Languages

R Language
R is a language for statistical computing and graphics. If the data mining and statistical software
listed above doesn‘t quite do what you want it to, learning R is the way forward. In fact, if you‘re
planning on being a data scientist, knowing R is a requirement.

Python
Another language that is gaining popularity in the data community is Python. Created in the 1980s
and named from Monty Python‘s Flying Circus, it has consistently ranked in the top ten most
popular programming languages in the world. Many journalists use Python to write custom scrapers
if data collection tools fail to get the data that they need.

RegEx
RegEx or Regular Expressions are a set of characters that can manipulate and change data. It‘s used
mainly for pattern matching with strings, or string matching. At Import.io, you can use RegEx while
extracting data to delete parts of a string or keep particular parts of a string. It is an incredibly useful
tool to use when doing data extraction as you can get exactly what you want when you extract data
meaning you don‘t need to rely on those data manipulation companies mentioned above!

MVIT/IT/IV/VII 17
IT-E79BIG DATABASES UNIT 1

XPath

XPath is a query language used for selecting certain nodes from an XML document. Whereas RegEx
manipulates and changes the data makeup, XPath will extract the raw data ready for RegEx. XPath is
most commonly used in data extraction. Import.io actually automatically creates XPathseverytime
you click on a piece of data – you just don‘t see them! It is also possible to insert your own XPath to
get data from drop down menus and data that is in tabs on a webpage. Put simply, an XPath is a path,
a set of directions to a certain part of the HTML of a webpage.

MVIT/IT/IV/VII 18
IT-E79BIG DATABASES UNIT 2

UNIT 2

Basic Big Data Analytic Methods and Modeling: Introduction to “R”, analyzing and exploring
data with “R”-Modeling: Architecture - Hybrid Data Modeling – Data Computing Modeling.

Course Objectives:
 To understand the concepts of Big Data and R programming
 To understand the importance of modeling architecture

Course Outcomes:
 The students can use the R programming tool for Big Data
 The students can able be to analyze and visualize the large data set using R

2.1 Introduction to big data analytics

 Big data analytics is the process of examining large data sets containing a variety of data
types -- i.e., big data -- to uncover hidden patterns, unknown correlations, market trends,
customer preferences and other useful business information.

 The primary goal of big data analytics is to help companies make more informed business
decisions by enabling data scientists, predictive modelers and other analytics
professionals to analyze large volumes of transaction data, as well as other forms of data
that may be untapped by conventional business intelligence(BI) programs.

 That could include Web server logs and Internet clickstream data, social media content
and social network activity reports, text from customer emails and survey responses,
mobile-phone call detail records and machine data captured by sensors connected to
the Internet of Things.

 Big data can be analyzed with the software tools commonly used as part of advanced
analytics disciplines such as predictive analytics, data mining, text analytics and
statistical analysis.

 Many organizations looking to collect, process and analyze big data have turned to a
newer class of technologies that includes Hadoop and related tools such
as YARN,MapReduce, Spark, Hive and Pig as well as NoSQL databases.

 Those technologies form the core of an open source software framework that supports the
processing of large and diverse data sets across clustered systems.

1
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

Big Data Analytics

2.2‘R’ A Programming Language


R is a programming language and software environment for statistical analysis, graphics
representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand, and is currently developed by the R Development Core
Team.

2.2.1The R environment

R is an integrated suite of software facilities for data manipulation, calculation and graphical
display. Among other things it has

• An effective data handling and storage facility,

• A suite of operators for calculations on arrays, in particular matrices,

• A large, coherent, integrated collection of intermediate tools for data analysis,

• Graphical facilities for data analysis and display either directly at the computer or on hardcopy

• A well developed, simple and effective programming language (called „S‟) which includes
conditionals, loops, user defined recursive functions and input and output facilities. (Indeed most
of the system supplied functions are themselves written in the S language.)

The term “environment” is intended to characterize it as a fully planned and coherent system,
rather than an incremental accretion of very specific and inflexible tools, as is frequently the case
with other data analysis software.

R is very much a vehicle for newly developing methods of interactive data analysis. It has
developed rapidly, and has been extended by a large collection of packages. However, most
programs written in R are essentially ephemeral, written for a single piece of data analysis.

2
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

2.2.1.1Related software and documentation

R can be regarded as an implementation of the S language which was developed at Bell


Laboratories by Rick Becker, John Chambers and Allan Wilks, and also forms the basis of the S-
Plus systems.

The evolution of the S language is characterized by four books by John Chambers and coauthors.
For R, the basic reference is The New S Language: A Programming Environment for Data
Analysis and Graphics by Richard A. Becker, John M. Chambers and Allan R. Wilks. The new
features of the 1991 release of S are covered in Statistical Models in S edited by John M.
Chambers and Trevor J. Hastie. The formal methods and classes of the methods package are
based on those described in Programming with Data by John M. Chambers. See Appendix F
[References], page 99, for precise references.

There are now a number of books which describe how to use R for data analysis and statistics,
and documentation for S/S-Plus can typically be used with R, keeping the differences between
the S implementations in mind.

2.2.2 R and statistics

Introduction to the R environment did not mention statistics, yet many people use R as a
statistics system. We prefer to think of it of an environment within which many classical and
modern statistical techniques have been implemented. A few of these are built into the base R
environment, but many are supplied as packages. There are about 25 packages supplied with R
(called “standard” and “recommended” packages) and many more are available through the
CRAN family of Internet sites (via https://CRAN.R-project.org) and elsewhere.

There is an important difference in philosophy between S (and hence R) and the other main
statistical systems. In S a statistical analysis is normally done as a series of steps, with
intermediate results being stored in objects. Thus whereas SAS and SPSS will give copious
output from a regression or discriminant analysis, R will give minimal output and store the
results in a fit object for subsequent interrogation by further R functions.

2.2.3 R and the window system

The most convenient way to use R is at a graphics workstation running a windowing system.
This guide is aimed at users who have this facility. In particular we will occasionally refer to the
use of R on an X window system although the vast bulk of what is said applies generally to any
implementation of the R environment.

Most users will find it necessary to interact directly with the operating system on their computer
from time to time. In this guide, we mainly discuss interaction with the operating system on
UNIX machines. If you are running R under Windows or OS X you will need to make some
small adjustments

3
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

Setting up a workstation to take full advantage of the customizable features of R is a


straightforward if somewhat tedious procedure, and will not be considered further here. Users in
difficulty should seek local expert help.

2.2.4 Features of R
As stated earlier, R is a programming language and software environment for statistical
analysis, graphics representation and reporting. The following are the important features of R −

 R is a well-developed, simple and effective programming language which includes


conditionals, loops, user defined recursive functions and input and output facilities.
 R has an effective data handling and storage facility,
 R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
 R provides a large, coherent and integrated collection of tools for data analysis.
 R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.
As a conclusion, R is world‟s most widely used statistics programming language.

2.2.5 Example

# Print Hello World.

print("Hello World")

# Add two numbers.

print(23.9 + 11.6)

2.3Environment Setup
2.3.1 Local Environment Setup

If you are still willing to set up your environment for R, you can follow the steps given below.

2.3.2 Windows Installation

 You can download the Windows installer version of R from R-3.2.2 for Windows (32/64
bit) and save it in a local directory.

 As it is a Windows installer (.exe) with a name "R-version-win.exe". You can just


double click and run the installer accepting the default settings. If your Windows is 32-
bit version, it installs the 32-bit version. But if your windows is 64-bit, then it installs
both the 32-bit and 64-bit versions.

4
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

 After installation you can locate the icon to run the Program in a directory structure
"R\R3.2.2\bin\i386\Rgui.exe" under the Windows Program Files. Clicking this icon
brings up the R-GUI which is the R console to do R Programming.

2.4 R - Basic Syntax

2.4.1 R Command Prompt


Once you have R environment setup, then it‟s easy to start your R command prompt by just
typing the following command at your command prompt −

$R

This will launch R interpreter and you will get a prompt > where you can start typing your
program as follows −

>myString<- "Hello, World!"

> print ( myString)

[1] "Hello, World!"

Here first statement defines a string variable myString, where we assign a string "Hello,
World!" and then next statement print() is being used to print the value stored in variable
myString.

5
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

2.4.1.1 R Script File


Usually, you will do your programming by writing your programs in script files and then you
execute those scripts at your command prompt with the help of R interpreter called Rscript. So
let's start with writing following code in a text file called test.R as under −

# My first program in R Programming

myString<- "Hello, World!"

print ( myString)

Save the above code in a file test.R and execute it at Linux command prompt as given below.
Even if you are using Windows or other system, syntax will remain same.

$ Rscripttest.R

When we run the above program, it produces the following result.

[1] "Hello, World!"

2.4.1.2 Comments
Comments are like helping text in your R program and they are ignored by the interpreter while
executing your actual program. Single comment is written using # in the beginning of the
statement as follows −

# My first program in R Programming

R does not support multi-line comments but you can perform a trick which is something as
follows −

if(FALSE) {

"This is a demo for multi-line comments and it should be put inside either a single

of double quote"

myString<- "Hello, World!"

print ( myString)

2.4.2 R - Data Types

6
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

Generally, while doing programming in any programming language, you need to use various
variables to store various information. Variables are nothing but reserved memory locations to
store values. This means that, when you create a variable you reserve some space in memory.

In contrast to other programming languages like C and java in R, the variables are not declared
as some data type. The variables are assigned with R-Objects and the data type of the R-object
becomes the data type of the variable. There are many types of R-objects.

 Vectors
 Lists
 Matrices
 Arrays
 Factors
 Data Frames
The simplest of these objects is the vector object and there are six data types of these atomic
vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic
vectors.

Data Type Example Verify

v <- TRUE
print(class(v))
Logical TRUE, FALSE it produces the following result −
[1] "logical"

v <- 23.5
print(class(v))
Numeric 12.3, 5, 999 it produces the following result −
[1] "numeric"

v <- 2L
print(class(v))
Integer 2L, 34L, 0L it produces the following result −
[1] "integer"

7
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

v <- 2+5i
print(class(v))
Complex 3 + 2i it produces the following result −
[1] "complex"

v <- "TRUE"
print(class(v))
Character 'a' , '"good", "TRUE", '23.4' it produces the following result −
[1] "character"

v <- charToRaw("Hello")
print(class(v))
Raw "Hello" is stored as 48 65 6c 6c 6f it produces the following result −
[1] "raw"

In R programming, the very basic data types are the R-objects called vectorswhich hold
elements of different classes as shown above. Please note in R the number of classes is not
confined to only the above six types. For example, we can use many atomic vectors and create an
array whose class will become array.

2.4.2.1Vectors
When you want to create vector with more than one element, you should usec() function which
means to combine the elements into a vector.

# Create a vector.

apple<- c('red','green',"yellow")

print(apple)

# Get the class of the vector.

print(class(apple))

When we execute the above code, it produces the following result −

[1] "red" "green" "yellow"

8
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

[1] "character"

2.4.2.2Lists
A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.

# Create a list.
list1 <- list(c(2,5,3),21.3,sin)

# Print the list.


print(list1)

When we execute the above code, it produces the following result −

[[1]]
[1] 2 5 3

[[2]]
[1] 21.3

[[3]]
function (x) .Primitive("sin")

2.4.2.3Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the
matrix function.

# Create a matrix.
M = matrix( c('a','a','b','c','b','a'),nrow=2,ncol=3,byrow= TRUE)
print(M)

When we execute the above code, it produces the following result −

[,1] [,2] [,3]

[1,] "a" "a" "b"

[2,] "c" "b" "a"

9
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

2.4.2.4Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions.
The array function takes a dim attribute which creates the required number of dimension. In the
below example we create an array with two elements which are 3x3 matrices each.

# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)

When we execute the above code, it produces the following result −

,,1
[,1] [,2] [,3]
[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"

,,2

[,1] [,2] [,3]


[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"

2.4.2.5Factors
Factors are the r-objects which are created using a vector. It stores the vector along with the
distinct values of the elements in the vector as labels. The labels are always character
irrespective of whether it is numeric or character or Boolean etc. in the input vector. They are
useful in statistical modeling.

Factors are created using the factor() function.The nlevels functions gives the count of levels.

# Create a vector.
apple_colors<- c('green','green','yellow','red','red','red','green')

# Create a factor object.


factor_apple<-factor(apple_colors)

# Print the factor.


10
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

print(factor_apple)
print(nlevels(factor_apple))

When we execute the above code, it produces the following result −

[1] green green yellow red redred yellow green


Levels: green red yellow
# applying the nlevels function we can know the number of distinct values
[1] 3

2.4.2.6Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain
different modes of data. The first column can be numeric while the second column can be
character and third column can be logical. It is a list of vectors of equal length.

Data Frames are created using the data.frame() function.

# Create the data frame.


BMI <- data.frame(
gender= c("Male","Male","Female"),
height= c(152,171.5,165),
weight= c(81,93,78),
Age=c(42,38,26)
)
print(BMI)

When we execute the above code, it produces the following result −

Gender height weight Age

1 Male 152.0 81 42

2 Male 171.5 93 38

3 Female 165.0 78 26

2.4.3 R – Variables

11
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

A variable provides us with named storage that our programs can manipulate. A variable in R
can store an atomic vector, group of atomic vectors or a combination of many Robjects. A valid
variable name consists of letters, numbers and the dot or underline characters. The variable
name starts with a letter or the dot not followed by a number.

Variable Name Validity Reason

var_name2. Valid Has letters, numbers, dot and underscore

var_name% Invalid Has the character '%'. Only dot(.) and underscore
allowed.

2var_name Invalid Starts with a number

.var_name , Valid Can start with a dot(.) but the dot(.)should not be
var.name followed by a number.

.2var_name invalid The starting dot is followed by a number making it


invalid.

_var_name invalid Starts with _ which is not valid

2.4.3.1 Variable Assignment

The variables can be assigned values using leftward, rightward and equal to operator. The
values of the variables can be printed using print() orcat()function. The cat() function
combines multiple items into a continuous print output.

# Assignment using equal operator.


var.1 = c(0,1,2,3)

# Assignment using leftward operator.


var.2 <- c("learn","R")

# Assignment using rightward operator.


c(TRUE,1) -> var.3

print(var.1)
cat ("var.1 is ", var.1 ,"\n")

12
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

cat ("var.2 is ", var.2 ,"\n")


cat ("var.3 is ", var.3 ,"\n")

When we execute the above code, it produces the following result


[1] 0 1 2 3
var.1 is 0 1 2 3
var.2 is learn R
var.3 is 1 1
2.4.3.2 Data Type of a Variable

In R, a variable itself is not declared of any data type, rather it gets the data type of the R - object
assigned to it. So R is called a dynamically typed language, which means that we can change a
variable‟s data type of the same variable again and again when using it in a program.
var_x<- "Hello"
cat("The class of var_x is ",class(var_x),"\n")

var_x<- 34.5
cat(" Now the class of var_x is ",class(var_x),"\n")

var_x<- 27L
cat(" Next the class of var_x becomes ",class(var_x),"\n")

2.4.4 R – Operators

An operator is a symbol that tells the compiler to perform specific mathematical or logical
manipulations. R language is rich in built-in operators and provides following types of
operators.

2.4.4.1 Types of Operators


We have the following types of operators in R programming −

 Arithmetic Operators
 Relational Operators
 Logical Operators
 Assignment Operators
 Miscellaneous Operators

13
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

2.4.4.2 R Functions

A function is a set of statements organized together to perform a specific task. R has a large
number of in-built functions and the user can create their own functions.

In R, a function is an object so the R interpreter is able to pass control to the function, along
with arguments that may be necessary for the function to accomplish the actions.

The basic syntax of an R function definition is as follows −

function_name<- function(arg_1, arg_2, ...) {

Function body

2.4.4.3Calling a Function
# Create a function to print squares of numbers in sequence.

new.function<- function(a) {

for(i in 1:a) {

b <- i^2

print(b)

# Call the function new.function supplying 6 as an argument.

new.function(6)

2.4.5 Strings

Following examples clarify the rules about creating a string in R.

a <- 'Start and end with single quote'


print(a)

b <- "Start and end with double quotes"


print(b)

14
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

c <- "single quote ' in between double quotes"


print(c)

d <- 'Double quotes " in between single quote'print(d)


2.4.6 R- Lists

Lists are the R objects which contain elements of different types like − numbers, strings, vectors
and another list inside it. A list can also contain a matrix or a function as its elements. List is
created using list() function.

2.4.6.1 Creating a List


Following is an example to create a list containing strings, numbers, vectors and a logical values

# Create a list containing strings, numbers, vectors and a logical values.


list_data<-list("Red","Green", c(21,32,11), TRUE,51.23,119.1)
print(list_data)

When we execute the above code, it produces the following result

[[1]]
[1] "Red"

[[2]]
[1] "Green"

[[3]]
[1] 21 32 11

[[4]]
[1] TRUE

[[5]]
[1] 51.23

[[6]]
[1] 119.1

2.5R Packages

15
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

R packages are a collection of R functions, complied code and sample data. They are stored
under a directory called "library" in the R environment. By default, R installs a set of packages
during installation.

More packages are added later, when they are needed for some specific purpose. When we start
the R console, only the default packages are available by default. Other packages which are
already installed have to be loaded explicitly to be used by the R program that is going to use
them.
Check Available R Packages - .libPaths()

Get the list of all the packages installed - library()

Get all packages currently loaded in the R environment - search()

2.5.1Installing a new package

There are two ways to add new R packages. One is installing directly from the CRAN directory
and another is downloading the package to your local system and installing it manually.

2.5.2 Install directly from CRAN


The following command gets the packages directly from CRAN webpage and installs the
package in the R environment. You may be prompted to choose a nearest mirror. Choose the
one appropriate to your location.

install.packages("Package Name")

# Install the package named "XML".


install.packages("XML")
2.5.3 Install package manually
Go to the link R Packages to download the package needed. Save the package as a .zip file in a
suitable location in the local system.Now you can run the following command to install this
package in the R environment.

install.packages(file_name_with_path, repos = NULL, type = "source")

# Install the package named "XML"


install.packages("E:/XML_3.98-1.3.zip", repos = NULL, type = "source")

16
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

2.5.4 Load Package to Library


Before a package can be used in the code, it must be loaded to the current R environment. You
also need to load a package that is already installed previously but not available in the current
environment.A package is loaded using the following command

library("package Name", lib.loc = "path to library")

# Load the package named "XML"


install.packages("E:/XML_3.98-1.3.zip", repos = NULL, type = "source")

2.6 R- Data Reshaping

Data Reshaping in R is about changing the way data is organized into rows and columns. Most of
the time data processing in R is done by taking the input data as a data frame. It is easy to extract
data from the rows and columns of a data frame but there are situations when we need the data
frame in a format that is different from format in which we received it. R has many functions to
split, merge and change the rows to columns and vice-versa in a data frame.

2.6.1 Joining Columns and Rows in a Data Frame


We can join multiple vectors to create a data frame using the cbind()function. Also we can
merge two data frames using rbind() function.

# Create vector objects.


city<- c("Tampa","Seattle","Hartford","Denver")
state<- c("FL","WA","CT","CO")
zipcode<- c(33602,98104,06161,80294)

# Combine above three vectors into one data frame.


addresses<- cbind(city,state,zipcode)

# Print a header.
cat("# # # # The First data frame\n")

# Print the data frame.


print(addresses)

# Create another data frame with similar columns


new.address<- data.frame(
city = c("Lowry","Charlotte"),
state = c("CO","FL"),
17
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

zipcode = c("80230","33949"),
stringsAsFactors = FALSE
)

# Print a header.
cat("# # # The Second data frame\n")

# Print the data frame.


print(new.address)

# Combine rows form both the data frames.


all.addresses<- rbind(addresses,new.address)

# Print a header.
cat("# # # The combined data frame\n")

# Print the result.


print(all.addresses)

When we execute the above code, it produces the following result −

# # # # The First data frame


city state zipcode
[1,] "Tampa" "FL" "33602"
[2,] "Seattle" "WA" "98104"
[3,] "Hartford" "CT" "6161"
[4,] "Denver" "CO" "80294"

# # # The Second data frame


city state zipcode
1 Lowry CO 80230
2 Charlotte FL 33949

# # # The combined data frame


city state zipcode
1 Tampa FL 33602
2 Seattle WA 98104
3 Hartford CT 6161
4 Denver CO 80294
5 Lowry CO 80230

18
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

6 Charlotte FL 33949

2.6.2. R - Importing Files


In R, we can read data from files stored outside the R environment. We can also write data into
files which will be stored and accessed by the operating system. R can read and write into
various file formats like csv, excel, xml etc.

2.6.2.1 Getting and Setting the Working Directory


You can check which directory the R workspace is pointing to using thegetwd() function. You
can also set a new working directory usingsetwd()function.

2.6.3 Reading a CSV File


Following is a simple example of read.csv() function to read a CSV file available in your
current working directory −

data<- read.csv("input.csv")

print(data)

When we execute the above code, it produces the following result

id, name, salary, start_date, dept


1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance

2.6.3.1 Analyzing the CSV File


By default the read.csv() function gives the output as a data frame. This can be easily checked
as follows. Also we can check the number of columns and rows.

data<- read.csv("input.csv")

print(is.data.frame(data))
print(ncol(data))
print(nrow(data))

19
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

2.6.3.2 Reading the Excel File


The input.xlsx is read by using the read.xlsx() function as shown below. The result is stored as
a data frame in the R environment.

# Read the first worksheet in the file input.xlsx.

data<- read.xlsx("input.xlsx", sheetIndex = 1)

print(data)

When we execute the above code, it produces the following result −

id, name, salary, start_date, dept


1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance

2.6.4 Reading the Binary File


The binary file created above stores all the data as continuous bytes. So we will read it by
choosing appropriate values of column names as well as the column values.

# Create a connection object to read the file in binary mode using "rb".
read.filename<- file("/web/com/binmtcars.dat", "rb")

# First read the column names. n = 3 as we have 3 columns.


column.names<- readBin(read.filename, character(), n = 3)

# Next read the column values. n = 18 as we have 3 column names and 15 values.
read.filename<- file("/web/com/binmtcars.dat", "rb")
bindata<- readBin(read.filename, integer(), n = 18)

# Print the data.


print(bindata)

# Read the values from 4th byte to 8th byte which represents "cyl".
cyldata = bindata[4:8]
print(cyldata)

20
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

# Read the values form 9th byte to 13th byte which represents "am".
amdata = bindata[9:13]
print(amdata)

# Read the values form 9th byte to 13th byte which represents "gear".
geardata = bindata[14:18]
print(geardata)

# Combine all the read values to a dat frame.


finaldata = cbind(cyldata, amdata, geardata)
colnames(finaldata) = column.names
print(finaldata)

When we execute the above code, it produces the following result and chart
[1] 7108963 1728081249 7496037 6 6 4
[7] 6 8 1 1 1 0
[13] 0 4 4 4 3 3

[1] 6 6 4 6 8

[1] 1 1 1 0 0

[1] 4 4 4 3 3

cyl am gear
[1,] 6 1 4
[2,] 6 1 4
[3,] 4 1 4
[4,] 6 0 3
[5,] 8 0 3

2.6.5R – XML Files

XML is a file format which shares both the file format and the data on the World Wide Web,
intranets, and elsewhere using standard ASCII text. It stands for Extensible Markup Language
(XML). Similar to HTML it contains markup tags. But unlike HTML where the markup tag
describes structure of the page, in xml the markup tags describe the meaning of the data
contained into the file.You can read a xml file in R using the "XML" package. This package can
be installed using following command.

install.packages("XML")

21
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

2.6.5.1 Input Data


Create aXMl file by copying the below data into a text editor like notepad. Save the file with
a .xml extension and choosing the file type as all files(*.*).

<RECORDS>
<EMPLOYEE>
<ID>1</ID>
<NAME>Rick</NAME>
<SALARY>623.3</SALARY>
<STARTDATE>1/1/2012</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<ID>2</ID>
<NAME>Dan</NAME>
<SALARY>515.2</SALARY>
<STARTDATE>9/23/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<ID>3</ID>
<NAME>Michelle</NAME>
<SALARY>611</SALARY>
<STARTDATE>11/15/2014</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<ID>4</ID>
<NAME>Ryan</NAME>
<SALARY>729</SALARY>
<STARTDATE>5/11/2014</STARTDATE>
<DEPT>HR</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<ID>5</ID>
<NAME>Gary</NAME>
<SALARY>843.25</SALARY>

22
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

<STARTDATE>3/27/2015</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<ID>6</ID>
<NAME>Nina</NAME>
<SALARY>578</SALARY>
<STARTDATE>5/21/2013</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<ID>7</ID>
<NAME>Simon</NAME>
<SALARY>632.8</SALARY>
<STARTDATE>7/30/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<ID>8</ID>
<NAME>Guru</NAME>
<SALARY>722.5</SALARY>
<STARTDATE>6/17/2014</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>

</RECORDS>
2.6.5.2 Reading XML File
The xml file is read by R using the function xmlParse(). It is stored as a list in R.

# Load the package required to read XML files.


library("XML")

# Also load the other required package.


library("methods")

# Give the input file name to the function.


result<- xmlParse(file = "input.xml")

23
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

# Print the result.


print(result)
When we execute the above code, it produces the following result −

1
Rick
623.3
1/1/2012
IT
2
Dan
515.2
9/23/2013
Operations
3
Michelle
611
11/15/2014
IT

4
Ryan
729
5/11/2014
HR
5
Gary
843.25
3/27/2015
Finance
6
Nina
578
5/21/2013
IT
7
Simon
632.8
7/30/2013
Operations
8
Guru
722.5
6/17/2014
Finance

24
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

2.7Data analysis example with ggplot and dplyr

2.7.1 Load dataset


df.car_spec_data<- read.csv(url("http://www.sharpsightlabs.com/wp-
content/uploads/2015/01/auto-snout_car-specifications_COMBINED.txt"))

df.car_spec_data$year<- as.character(df.car_spec_data$year)

2.7.2 Data Exploration with ggplot2 and dplyr

For our purposes here, data exploration is the application of data visualization and data
manipulation techniques to understand the properties of our dataset.

Let‟s start with a simple scatterplot of horsepower vs speed.

###########################################
# PLOT DATA (Preliminary Data Inspection) #
###########################################
#-------------------------
# Horsepower vs. Top Speed
#-------------------------
ggplot(data=df.car_spec_data, aes(x=horsepower_bhp, y=top_speed_mph)) +
geom_point(alpha=.4, size=4, color="#880011") +
ggtitle("Horsepower vs. Top Speed") +
labs(x="Horsepower, bhp", y="Top Speed,\n mph") +theme.car_chart_SCATTER

25
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

2.8Modeling: Architecture

2.8.1 Data Modeling

 Data modeling is a useful technique to manage a workflow for various entities and for
making a sequential workflow in order to have a successful completion of a task.
 hadoop and its big data model, we need to have a comprehensive study before
implementing any execution task and setting up any progressive environment.
 Mainly hadoop is a collection of tools and techniques as it is not a single technology, so
at each point of time we need a task execution environment and some projection plans as
well.
 The data modeling and logical workflow consists of an abstract layer that is used in
management of data storage when the data is stored in physical drives in hadoop
distributed file system.
 Because of huge expansion of data in terms of big data, we need to have a multi
distributed and logically managed system. The data modeling also helps us in managing
various data resources and creates basic data layered architecture in order to optimize
data reuse and execution failure as well.

26
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2

2.8.2 Hybrid Data Modeling

 Apache oozie are inbuilt tools for managing map reduce and to make sure they
are synchronized in order to maintain equilibrium amongst the tasks that are assigned by
job trackers to task trackers.
 However, we still need a modeling scheme to manage and maintain a workflow of
hadoop framework and for this we need a hybrid model for more flexibility.
 In spite of many NoSQL databases that are used to resolve the problem of data
management for schema on read and schema on write, we still need a hybrid model in
order to improve the overall performance for SQL and NoSQL databases.
 As big data is changing a lot in terms of its execution approach, we need to have a new
data and storage model (two separate models). We can fix these problems by using the
data migration technique in order to migrate big data (raw and unstructured data) into
NoSQL data.

2.8.3Data Computing Model

 On the top of the physical data model, we normally need to create data flow and compute
tasks for business requirements. With physical-data modeling, it‟s possible to create a
computing model, which can present the logic path of computing data. This will help
computing tasks be well designed and enable more efficient data reuse.
 Hadoop provides a new distributed data processing model, and its HBase database
provides an impressive solution for data replication, backup, scalability, and so on
 Hadoop also provides the Map/Reduce computing framework to retrieve value from data
stored in a distributed system. Map/Reduce is a framework for parallel processing using
mappers dividing a problem into smaller sub-problems to feed reducers that process the
sub problems and produce the final answer

27
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

UNIT 4
Technology and Tools:MapReduce/Hadoop – NoSQL: Cassandra,HBASE – Apache Mahout –
Tools.
Course Objectives:
 To understand the technology of MapReduce concepts
 To understand the importance of modeling architecture

Course Outcomes:
 The students can use the technology of MapReduce in big data
 The students can be able to develop application and manage using NoSQL and
Cassandra
 They will learn how to develop various applications using these technologies
3.1 HADOOP
Doug Cutting, Mike Cafarella and team took the solution provided by Google and started
an Open Source Project called HADOOP in 2005 and Doug named it after his son's toy elephant.
Now Apache Hadoop is a registered trademark of the Apache Software Foundation.
Hadoop runs applications using the MapReduce algorithm, where the data is processed in
parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop
applications capable of running on clusters of computers and they could perform complete
statistical analysis for a huge amounts of data.

1
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

3.1.1 Hadoop Architecture


Hadoop framework includes following four modules:
 Hadoop Common: These are Java libraries and utilities required by other Hadoop
modules. These libraries provides filesystem and OS level abstractions and contains the
necessary Java files and scripts required to start Hadoop.
 Hadoop YARN: This is a framework for job scheduling and cluster resource
management.
 Hadoop Distributed File System (HDFS™): A distributed file system that provides
high-throughput access to application data.
 Hadoop MapReduce: This is YARN-based system for parallel processing of large data
sets.

We can use following diagram to depict these four components available in Hadoop framework.

2
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

3.1.2How Does Hadoop Work?


3.1.2.1 Stage 1
 A user/application can submit a job to the Hadoop (a hadoop job client) for required
process by specifying the following items:
 The location of the input and output files in the distributed file system.
 The java classes in the form of jar file containing the implementation of map and reduce
functions.
 The job configuration by setting different parameters specific to the job.

3.1.2.2 Stage 2
The Hadoop job client then submits the job (jar/executable etc) and configuration to the
JobTracker which then assumes the responsibility of distributing the software/configuration to
the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to
the job-client.

3.1.2.3 Stage 3
The TaskTrackers on different nodes execute the task as per MapReduce implementation
and output of the reduce function is stored into the output files on the file system.

3.1.3Advantages of Hadoop
 Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores.

 Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures at
the application layer.

 Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.

3
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

 Another big advantage of Hadoop is that apart from being open source, it is compatible
on all the platforms since it is Java based.

3.2 MAP REDUCE


3.2.1MapReduce
 HadoopMapReduce is a software framework for easily writing applications which
process big amounts of data in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.

 The term MapReduce actually refers to the following two different tasks that Hadoop
programs perform:

 The Map Task: This is the first task, which takes input data and converts it into a set of
data, where individual elements are broken down into tuples (key/value pairs).

 The Reduce Task: This task takes the output from a map task as input and combines those
data tuples into a smaller set of tuples. The reduce task is always performed after the map
task.

 Typically both the input and the output are stored in a file-system. The framework takes
care of scheduling tasks, monitoring them and re-executes the failed tasks.

 The MapReduce framework consists of a single master JobTracker and one slave
TaskTracker per cluster-node. The master is responsible for resource management,
tracking resource consumption/availability and scheduling the jobs component tasks on
the slaves, monitoring them and re-executing the failed tasks. The slaves TaskTracker
execute the tasks as directed by the master and provide task-status information to the
master periodically.

 The JobTracker is a single point of failure for the Hadoop MapReduce service which
means if JobTracker goes down, all running jobs are halted.

4
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

3.2.2What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing
based on java. The MapReduce algorithm contains two important tasks, namely Map and
Reduce. Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the
output from a map as an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed after the map
job.
The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing primitives are
called mappers and reducers. Decomposing a data processing application into mappers and
reducers is sometimes nontrivial. But, once we write an application in the MapReduce form,
scaling the application to run over hundreds, thousands, or even tens of thousands of machines in
a cluster is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.

3.2.2.1 Why MapReduce?


Traditional Enterprise Systems normally have a centralized server to store and process
data. The following illustration depicts a schematic view of a traditional enterprise system.
Traditional model is certainly not suitable to process huge volumes of scalable data and cannot
be accommodated by standard database servers. Moreover, the centralized system creates too
much of a bottleneck while processing multiple files simultaneously.

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce
divides a task into small parts and assigns them to many computers. Later, the results are
collected at one place and integrated to form the result dataset.

5
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

3.2.2. How MapReduce Works?


The MapReduce algorithm contains two important tasks, namely Map and Reduce.
 The Map task takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Let us now take a close look at each of the phases and try to understand their
significance.

6
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

 Input Phase − Here we have a Record Reader that translates each record in an input file

and sends the parsed data to the mapper in the form of key-value pairs.

 Map − Map is a user-defined function, which takes a series of key-value pairs and

processes each one of them to generate zero or more key-value pairs.

 Intermediate Keys − They key-value pairs generated by the mapper are known as

intermediate keys.

 Combiner − A combiner is a type of local Reducer that groups similar data from the

map phase into identifiable sets. It takes the intermediate keys from the mapper as input

and applies a user-defined code to aggregate the values in a small scope of one mapper.

It is not a part of the main MapReduce algorithm; it is optional.

 Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads

the grouped key-value pairs onto the local machine, where the Reducer is running. The

individual key-value pairs are sorted by key into a larger data list. The data list groups

the equivalent keys together so that their values can be iterated easily in the Reducer

task.

 Reducer − The Reducer takes the grouped key-value paired data as input and runs a

Reducer function on each one of them. Here, the data can be aggregated, filtered, and

combined in a number of ways, and it requires a wide range of processing. Once the

execution is over, it gives zero or more key-value pairs to the final step.

 Output Phase − In the output phase, we have an output formatter that translates the final

key-value pairs from the Reducer function and writes them onto a file using a record

writer.

7
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

Let us try to understand the two tasks Map & Reduce with the help of a small diagram −

3.2.2.3 MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter
receives around 500 million tweets per day, which is nearly 3000 tweets per second. The
following illustration shows how Tweeter manages its tweets with the help of MapReduce.

As shown in the illustration, the MapReduce algorithm performs the following actions −

8
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

 Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value
pairs.
 Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as
key-value pairs.
 Count − Generates a token counter per word.
 Aggregate Counters − Prepares an aggregate of similar counter values into small
manageable units.

3.2.3The Algorithm
 Generally MapReduce paradigm is based on sending the computer to where the data
resides!
 MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
o Map stage: The map or mapper‟s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line. The
mapper processes the data and creates several small chunks of data.
o Reduce stage: This stage is the combination of the Shuffle stage and the Reduce
stage. The Reducer‟s job is to process the data that comes from the mapper. After
processing, it produces a new set of output, which will be stored in the HDFS.
 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers
in the cluster.
 The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
 Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.
 After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.

9
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

The MapReduce algorithm contains two important tasks, namely Map and Reduce.
 The map task is done by means of Mapper Class
 The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class
is used as input by Reducer class, which in turn searches matching pairs and reduces them.

MapReduce implements various mathematical algorithms to divide a task into small


parts and assign them to multiple systems. In technical terms, MapReduce algorithm helps in
sending the Map & Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following −
 Sorting
 Searching
 Indexing
 TF-IDF

10
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

3.2.3.1 Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the mapper
by their keys.
 Sorting methods are implemented in the mapper class itself.
 In the Shuffle and Sort phase, after tokenizing the values in the mapper class,
the Context class (user-defined class) collects the matching valued keys as a
collection.
 To collect similar key-value pairs (intermediate keys), the Mapper class takes
the help of RawComparator class to sort the key-value pairs.
 The set of intermediate key-value pairs for a given Reducer is automatically
sorted by Hadoop to form key-values (K2, {V2, V2, …}) before they are
presented to the Reducer.
3.2.3.2 Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner
phase (optional) and in the Reducer phase. Let us try to understand how Searching works with
the help of an example.
Example
The following example shows how MapReduce employs Searching algorithm to find out the
details of the employee who draws the highest salary in a given employee dataset.
 Let us assume we have employee data in four different files − A, B, C, and D. Let us also
assume there are duplicate employee records in all four files because of importing the
employee data from all database tables repeatedly. See the following illustration.

11
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

 The Map phase processes each input file and provides the employee data in key-value
pairs (<k, v> :<emp name, salary>). See the following illustration.

 The combiner phase (searching technique) will accept the input from the Map phase as
a key-value pair with employee name and salary. Using searching technique, the
combiner will check all the employee salary to find the highest salaried employee in
each file. See the following snippet.
<k: employeename, v: salary>
Max= the salary of an first employee. Treated as max salary

if(v(second employee).salary > Max){


Max = v(salary);
}

else{
Continue checking;
}
The expected result is as follows −

<satish, 26000> <gopal, 50000> <kiran, 45000> <manisha,


45000>

 Reducer phase − Form each file, you will find the highest salaried employee. To avoid
redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same
algorithm is used in between the four <k, v> pairs, which are coming from four input
files.
The final output should be as follows −<gopal, 50000>

12
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

3.2.3.3 Indexing
Normally indexing is used to point to a particular data and its address. It performs batch
indexing on the input files for a particular Mapper.
The indexing technique that is normally used in MapReduce is known asinverted
index. Search engines like Google and Bing use inverted indexing technique. Let us try to
understand how Indexing works with the help of a simple example.
Example
The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the file
names and their content are in double quotes.
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
After applying the Indexing algorithm, we get the following output −
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2}
implies the term "is" appears in the files T[0], T[1], and T[2].

3.2.4TF-IDF
TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse
Document Frequency. It is one of the common web analysis algorithms. Here, the term
'frequency' refers to the number of times a term appears in a document.
3.2.4.1 Term Frequency (TF)
It measures how frequently a particular term occurs in a document. It is calculated by the
number of times a word appears in a document divided by the total number of words in that
document.
TF(the) = (Number of times term the „the‟ appears in a document) / (Total number of terms in
the document)

13
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

3.2.4.2 Inverse Document Frequency (IDF)


It measures the importance of a term. It is calculated by the number of documents in the
text database divided by the number of documents where a specific term appears.
While computing TF, all the terms are considered equally important. That means, TF
counts the term frequency for normal words like “is”, “a”, “what”, etc. Thus we need to know
the frequent terms while scaling up the rare ones, by computing the following −
IDF(the) = log_e(Total number of documents / Number of documents with term „the‟ in it).
The algorithm is explained below with the help of a small example.
Example
Consider a document containing 1000 words, wherein the word hive appears 50 times.
The TF for hive is then (50 / 1000) = 0.05.
Now, assume we have 10 million documents and the word hive appears in 1000 of these.
Then, the IDF is calculated as log(10,000,000 / 1,000) = 4.
The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.

3.2.5Terminology
 PayLoad - Applications implement the Map and the Reduce functions, and form the core
of the job.
 Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
 NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
 DataNode - Node where data is presented in advance before any processing takes place.
 MasterNode - Node where JobTracker runs and which accepts job requests from clients.
 SlaveNode - Node where Map and Reduce program runs.
 JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
 Task Tracker - Tracks the task and reports status to JobTracker.
 Job - A program is an execution of a Mapper and Reducer across a dataset.
 Task - An execution of a Mapper or a Reducer on a slice of data.
 Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.

14
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

Example Scenario
Given below is the data regarding the electrical consumption of an organization. It
contains the monthly electrical consumption and the annual average for various years.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg

1979 23 23 2 43 24 25 26 26 26 26 25 26 25

1980 26 27 28 28 28 30 31 31 31 30 30 30 29

1981 31 32 32 32 33 34 35 36 36 34 34 34 34

1984 39 38 39 39 39 41 42 43 40 39 38 38 40

1985 38 39 39 39 39 41 41 41 00 40 39 39 45

If the above data is given as input, we have to write applications to process it and
produce results such as finding the year of maximum usage, year of minimum usage, and so on.
This is a walkover for the programmers with finite number of records. They will simply write
the logic to produce the required output, and pass the data to the application written.But, think
of the data representing the electrical consumption of all the largescale industries of a particular
state, since its formation.When we write applications to process such bulk data,
 They will take a lot of time to execute.
 There will be a heavy network traffic when we move data from source to network server
and so on.
To solve these problems, we have the MapReduce framework.

3.3 NoSQL
3.3.1 NoSQLDatabase
A NoSQL database (sometimes called as Not Only SQL) is a database that provides a
mechanism to store and retrieve data other than the tabular relations used in relational
databases. These databases are schema-free, support easy replication, have simple API,
eventually consistent, and can handle huge amounts of data.

15
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

The primary objective of a NoSQL database is to have


 simplicity of design,
 horizontal scaling, and
 finer control over availability.
NoSql databases use different data structures compared to relational databases. It makes
some operations faster in NoSQL. The suitability of a given NoSQL database depends on the
problem it must solve.

3.3.2 NoSQL vs. Relational Database


The following table lists the points that differentiate a relational database from a NoSQL
database.

Relational Database NoSql Database

Supports powerful query language. Supports very simple query language.

It has a fixed schema. No fixed schema.

Follows ACID (Atomicity, Consistency, It is only “eventually consistent”.


Isolation, and Durability).

Supports transactions. Does not support transactions.

Besides Cassandra, we have the following NoSQL databases that are quite popular:
 Apache HBase - HBase is an open source, non-relational, distributed database modeled
after Google‟s BigTable and is written in Java. It is developed as a part of Apache
Hadoop project and runs on top of HDFS, providing BigTable-like capabilities for
Hadoop.
 MongoDB - MongoDB is a cross-platform document-oriented database system that
avoids using the traditional table-based relational database structure in favor of JSON-
like documents with dynamic schemas making the integration of data in certain types of
applications easier and faster.

16
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

3.3.3 NoSQL pros/cons


3.3.3.1 Advantages :
 High scalability
 Distributed Computing
 Lower cost
 Schema flexibility, semi-structure data
 No complicated Relationships
3.3.3.2 Disadvantages
 No standardization
 Limited query capabilities (so far)
 Eventual consistent is not intuitive to program for

3.3.4 Types of NoSQL Databases


3.3.4.1 Document Oriented Databases
Document oriented databases treat a document as a whole and avoid splitting a document
in its constituent name/value pairs. At a collection level, this allows for putting together a
diverse set of documents into a single collection. Document databases allow indexing of
documents on the basis of not only its primary identifier but also its properties. Different open-
source document databases are available today but the most prominent among the available
options are MongoDB and CouchDB. In fact, MongoDB has become one of the most popular
NoSQL databases.

3.3.4.2 Graph Based Databases


A graph database uses graph structures with nodes, edges, and properties to represent and
store data. By definition, a graph database is any storage system that provides index-free
adjacency. This means that every element contains a direct pointer to its adjacent element and
no index lookups are necessary. General graph databases that can store any graph are distinct
from specialized graph databases such as triple-stores and network databases. Indexes are used
for traversing the graph.

17
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

3.3.4.3 Column Based Databases


The column-oriented storage allows data to be stored effectively. It avoids consuming
space when storing nulls by simply not storing a column when a value doesn‟t exist for that
column. Each unit of data can be thought of as a set of key/value pairs, where the unit itself is
identified with the help of a primary identifier, often referred to as the primary key. Bigtable
and its clones tend to call this primary key the row-key.

3.3.4.4 Key Value Databases


The key of a key/value pair is a unique value in the set and can be easily looked up to
access the data. Key/value pairs are of varied types: some keep the data in memory and some
provide the capability to persist the data to disk. A simple, yet powerful, key/value store is
Oracle‟s Berkeley DB.

3.3.5 Popular NoSQL Databases


Let us summarize some popular NoSQL databases that falls in the above categories
respectively.
 Document Oriented Databases – MongoDB, HBase, Cassandra, Amazon
SimpleDB, Hypertable, etc.
 Graph Based Databases – Neo4j, OrientDB, Facebook Open Graph, FlockDB, etc.
 Column Based Databases – CouchDB, OrientDB, etc.
 Key Value Databases – Membase, Redis, MemcacheDB, etc.

3.4 Cassandra
3.4.1 What is Apache Cassandra?
Apache Cassandra is an open source, distributed and decentralized/distributed storage system
(database), for managing very large amounts of structured data spread out across the world. It
provides highly available service with no single point of failure.
Listed below are some of the notable points of Apache Cassandra:
 It is scalable, fault-tolerant, and consistent.
 It is a column-oriented database.

18
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

 Its distribution design is based on Amazon‟s Dynamo and its data model on Google‟s
Bigtable.
 Created at Facebook, it differs sharply from relational database management systems.
 Cassandra implements a Dynamo-style replication model with no single point of failure,
but adds a more powerful “column family” data model.
 Cassandra is being used by some of the biggest companies such as Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.

3.4.2 Features of Cassandra


Cassandra has become so popular because of its outstanding technical features. Given below are
some of the features of Cassandra:

 Elastic scalability - Cassandra is highly scalable; it allows to add more hardware to


accommodate more customers and more data as per requirement.
 Always on architecture - Cassandra has no single point of failure and it is continuously
available for business-critical applications that cannot afford a failure.
 Fast linear-scale performance - Cassandra is linearly scalable, i.e., it increases your
throughput as you increase the number of nodes in the cluster. Therefore it maintains a
quick response time.
 Flexible data storage - Cassandra accommodates all possible data formats including:
structured, semi-structured, and unstructured. It can dynamically accommodate changes
to your data structures according to your need.
 Easy data distribution - Cassandra provides the flexibility to distribute data where you
need by replicating data across multiple data centers.
 Transaction support - Cassandra supports properties like Atomicity, Consistency,
Isolation, and Durability (ACID).
 Fast writes - Cassandra was designed to run on cheap commodity hardware. It performs
blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the
read efficiency.

19
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

3.4.3 History of Cassandra


 Cassandra was developed at Facebook for inbox search.
 It was open-sourced by Facebook in July 2008.
 Cassandra was accepted into Apache Incubator in March 2009.
 It was made an Apache top-level project since February 2010.

The design goal of Cassandra is to handle big data workloads across multiple nodes without
any single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and
data is distributed among all the nodes in a cluster.
 All the nodes in a cluster play the same role. Each node is independent and at the same
time interconnected to other nodes.
 Each node in a cluster can accept read and write requests, regardless of where the data is
actually located in the cluster.
 When a node goes down, read/write requests can be served from other nodes in the
network.
3.4.4 Data Replication in Cassandra
In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of
data. If it is detected that some of the nodes responded with an out-of-date value, Cassandra will
return the most recent value to the client. After returning the most recent value, Cassandra
performs a read repair in the background to update the stale values.
The following figure shows a schematic view of how Cassandra uses data replication
among the nodes in a cluster to ensure no single point of failure.

20
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

Note − Cassandra uses the Gossip Protocol in the background to allow the nodes to
communicate with each other and detect any faulty nodes in the cluster.
Components of Cassandra
The key components of Cassandra are as follows −
 Node − It is the place where data is stored.
 Data center − It is a collection of related nodes.
 Cluster − A cluster is a component that contains one or more data centers.
 Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write
operation is written to the commit log.
 Mem-table − A mem-table is a memory-resident data structure. After commit log, the
data will be written to the mem-table. Sometimes, for a single-column family, there will
be multiple mem-tables.
 SSTable − It is a disk file to which the data is flushed from the mem-table when its
contents reach a threshold value.
 Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing
whether an element is a member of a set. It is a special kind of cache. Bloom filters are
accessed after every query.

3.4.5 Cassandra Query Language


Users can access Cassandra through its nodes using Cassandra Query Language (CQL).
CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt
to work with CQL or separate application language drivers.
Clients approach any of the nodes for their read-write operations. That node
(coordinator) plays a proxy between the client and the nodes holding the data.

3.4.5.1 Write Operations


Every write activity of nodes is captured by the commit logs written in the nodes. Later
the data will be captured and stored in the mem-table. Whenever the mem-table is full, data
will be written into the SStable data file. All writes are automatically partitioned and replicated
throughout the cluster. Cassandra periodically consolidates the SSTables, discarding
unnecessary data.

21
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

3.4.5.2 Read Operations


During read operations, Cassandra gets values from the mem-table and checks the bloom
filter to find the appropriate SSTable that holds the required data.

3.4.6 Data Models of Cassandra and RDBMS


The following table lists down the points that differentiate the data model of Cassandra from
that of an RDBMS.
RDBMS Cassandra
RDBMS deals with structured data. Cassandra deals with unstructured data.
It has a fixed schema. Cassandra has a flexible schema.
In RDBMS, a table is an array of arrays. In Cassandra, a table is a list of “nested key-
(ROW x COLUMN) value pairs”. (ROW x COLUMN key x
COLUMN value)
Database is the outermost container that Keyspace is the outermost container that
contains data corresponding to an contains data corresponding to an application.
application.
Tables are the entities of a database. Tables or column families are the entity of a
keyspace.
Row is an individual record in RDBMS. Row is a unit of replication in Cassandra.
Column represents the attributes of a Column is a unit of storage in Cassandra.
relation.
RDBMS supports the concepts of foreign Relationships are represented using collections.
keys, joins.

3.5 HBASE
3.5.1 What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system.
It is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google‟s big table designed to provide quick
random access to huge amounts of structured data. It leverages the fault tolerance provided by
the Hadoop File System (HDFS).

22
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

It is a part of the Hadoop ecosystem that provides random real-time read/write access to
data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File
System and provides read and write access.

3.5.2 HBase and HDFS


HDFS HBase
HDFS is a distributed file system suitable for HBase is a database built on top of the HDFS.
storing large files.
HDFS does not support fast individual record HBase provides fast lookups for larger tables.
lookups.
It provides high latency batch processing; no It provides low latency access to single rows
concept of batch processing. from billions of records (Random access).

It provides only sequential access of data. HBase internally uses Hash tables and provides
random access, and it stores the data in indexed
HDFS files for faster lookups.

3.5.3 Storage Mechanism in HBase


HBase is a column-oriented database and the tables in it are sorted by row. The table
schema defines only column families, which are the key value pairs. A table have multiple
column families and each column family can have any number of columns. Subsequent column

23
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

values are stored contiguously on the disk. Each cell value of the table has a timestamp. In
short, in an HBase:
 Table is a collection of rows.
 Row is a collection of column families.
 Column family is a collection of columns.
 Column is a collection of key value pairs.
Given below is an example schema of table in HBase.
Rowid Column Family Column Family Column Family Column Family
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
1
2
3

3.5.4 Column Oriented and Row Oriented


Column-oriented databases are those that store data tables as sections of columns of data, rather
than as rows of data. Shortly, they will have column families.
Row-Oriented Database Column-Oriented Database
It is suitable for Online Transaction Process It is suitable for Online Analytical
(OLTP). Processing (OLAP).
Such databases are designed for small number of Column-oriented databases are designed
rows and columns. for huge tables.
The following image shows column families in a column-oriented database:

24
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

3.5.5 HBase and RDBMS


HBase RDBMS
HBase is schema-less, it doesn't have the An RDBMS is governed by its schema,
concept of fixed columns schema; defines only which describes the whole structure of
column families. tables.
It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard
scalable. to scale.
No transactions are there in HBase. RDBMS is transactional.
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as It is good for structured data.
structured data.

3.5.6 Features of HBase


 HBase is linearly scalable.
 It has automatic failure support.
 It provides consistent read and writes.
 It integrates with Hadoop, both as a source and a destination.
 It has easy java API for client.
 It provides data replication across clusters.

3.5.7 Where to Use HBase


 Apache HBase is used to have random, real-time read/write access to Big Data.
 It hosts very large tables on top of clusters of commodity hardware.
 Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable
acts up on Google File System, likewise Apache HBase works on top of Hadoop and
HDFS.

3.5.8 Applications of HBase


 It is used whenever there is a need to write heavy applications.
 HBase is used whenever we need to provide fast random access to available data.
 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

25
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

3.5.9 HBase History


Year Event
Nov 2006 Google released the paper on BigTable.
Feb 2007 Initial HBase prototype was created as a Hadoop contribution.
Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.
Jan 2008 HBase became the sub project of Hadoop.
Oct 2008 HBase 0.18.1 was released.
Jan 2009 HBase 0.19.0 was released.
Sept 2009 HBase 0.20.0 was released.
May 2010 HBase became Apache top-level project.

3.5.10 Architecture
In HBase, tables are split into regions and are served by the region servers. Regions are
vertically divided by column families into “Stores”. Stores are saved as files in HDFS. Shown
below is the architecture of HBase.
Note: The term „store‟ is used for regions to explain the storage structure.

HBase has three major components: the client library, a master server, and region
servers. Region servers can be added or removed as per requirement.

26
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

3.5.10.1 MasterServer
The master server -
 Assigns regions to the region servers and takes the help of Apache ZooKeeper for this
task.
 Handles load balancing of the regions across region servers. It unloads the busy servers
and shifts the regions to less occupied servers.
 Maintains the state of the cluster by negotiating the load balancing.
 Is responsible for schema changes and other metadata operations such as creation of
tables and column families.
3.5.10.2 Regions
Regions are nothing but tables that are split up and spread across the region servers.
3.5.10.3 Region server
The region servers have regions that -
 Communicate with the client and handle data-related operations.
 Handle read and write requests for all the regions under it.
 Decide the size of the region by following the region size thresholds.

When we take a deeper look into the region server, it contain regions and stores as shown
below:

27
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

The store contains memory store and HFiles. Memstore is just like a cache memory.
Anything that is entered into the HBase is stored here initially. Later, the data is transferred and
saved in Hfiles as blocks and the memstore is flushed.
3.5.10.4 Zookeeper
 Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization, etc.
 Zookeeper has ephemeral nodes representing different region servers. Master servers use
these nodes to discover available servers.
 In addition to availability, the nodes are also used to track server failures or network
partitions.
 Clients communicate with region servers via zookeeper.
 In pseudo and standalone modes, HBase itself will take care of zookeeper.

3.6 Apache Mahout


3.6.1 What is Apache Mahout?
A mahout is one who drives an elephant as its master. The name comes from its close
association with Apache Hadoop which uses an elephant as its logo.
Hadoop is an open-source framework from Apache that allows to store and process big data in
a distributed environment across clusters of computers using simple programming models.
Apache Mahout is an open source project that is primarily used for creating scalable machine
learning algorithms. It implements popular machine learning techniques such as:
 Recommendation
 Classification
 Clustering
Apache Mahout started as a sub-project of Apache‟s Lucene in 2008. In 2010, Mahout became
a top level project of Apache.
3.6.2 Features of Mahout
The primitive features of Apache Mahout are listed below.
 The algorithms of Mahout are written on top of Hadoop, so it works well in distributed
environment. Mahout uses the Apache Hadoop library to scale effectively in the cloud.

28
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

 Mahout offers the coder a ready-to-use framework for doing data mining tasks on large
volumes of data.
 Mahout lets applications to analyze large sets of data effectively and in quick time.
 Includes several MapReduce enabled clustering implementations such as k-means, fuzzy
k-means, Canopy, Dirichlet, and Mean-Shift.
 Supports Distributed Naive Bayes and Complementary Naive Bayes classification
implementations.
 Comes with distributed fitness function capabilities for evolutionary programming.
 Includes matrix and vector libraries.
3.6.3 Applications of Mahout
 Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use
Mahout internally.
 Foursquare helps you in finding out places, food, and entertainment available in a
particular area. It uses the recommender engine of Mahout.
 Twitter uses Mahout for user interest modelling.
 Yahoo! uses Mahout for pattern mining.
Apache Mahout is a highly scalable machine learning library that enables developers to use
optimized algorithms. Mahout implements popular machine learning techniques such as
recommendation, classification, and clustering. Therefore, it is prudent to have a brief section
on machine learning before we move further.

3.6.4 What is Machine Learning?


Machine learning is a branch of science that deals with programming the systems in such
a way that they automatically learn and improve with experience. Here, learning means
recognizing and understanding the input data and making wise decisions based on the supplied
data.
It is very difficult to cater to all the decisions based on all possible inputs. To tackle this
problem, algorithms are developed. These algorithms build knowledge from specific data and
past experience with the principles of statistics, probability theory, logic, combinatorial
optimization, search, reinforcement learning, and control theory.
The developed algorithms form the basis of various applications such as:

29
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

 Vision processing
 Language processing
 Forecasting (e.g., stock market trends)
 Pattern recognition
 Games
 Data mining
 Expert systems
 Robotics
There are several ways to implement machine learning techniques, however the most
commonly used ones are supervised andunsupervised learning.
3.6.4.1 Learning
Supervised learning deals with learning a function from available training data. A supervised
learning algorithm analyzes the training data and produces an inferred function, which can be
used for mapping new examples. Common examples of supervised learning include:
 classifying e-mails as spam,
 labeling webpages based on their content, and
 voice recognition.
There are many supervised learning algorithms such as neural networks, Support Vector
Machines (SVMs), and Naive Bayes classifiers. Mahout implements Naive Bayes classifier.
3.6.4.2 Unsupervised Learning
Unsupervised learning makes sense of unlabeled data without having any predefined dataset
for its training. Unsupervised learning is an extremely powerful tool for analyzing available data
and look for patterns and trends. It is most commonly used for clustering similar input into
logical groups. Common approaches to unsupervised learning include:
 k-means
 self-organizing maps, and
 hierarchical clustering
3.6.4.3 Recommendation
Recommendation is a popular technique that provides close recommendations based on user
information such as previous purchases, clicks, and ratings.

30
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

 Amazon uses this technique to display a list of recommended items that you might be
interested in, drawing information from your past actions. There are recommender
engines that work behind Amazon to capture user behavior and recommend selected
items based on your earlier actions.
 Facebook uses the recommender technique to identify and recommend the “people you
may know list”.

3.6.4.4 Classification
Classification, also known as categorization, is a machine learning technique that uses
known data to determine how the new data should be classified into a set of existing categories.
Classification is a form of supervised learning.
 Mail service providers such as Yahoo! and Gmail use this technique to decide whether a
new mail should be classified as a spam. The categorization algorithm trains itself by
analyzing user habits of marking certain mails as spams. Based on that, the classifier
decides whether a future mail should be deposited in your inbox or in the spams folder.
 iTunes application uses classification to prepare playlists.

31
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

3.6.4.5 Clustering
Clustering is used to form groups or clusters of similar data based on common
characteristics. Clustering is a form of unsupervised learning.
 Search engines such as Google and Yahoo! use clustering techniques to group data with
similar characteristics.
 Newsgroups use clustering techniques to group various articles based on related topics.
The clustering engine goes through the input data completely and based on the
characteristics of the data, it will decide under which cluster it should be grouped. Take a look
at the following example.
The Goals of Clustering
So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But
how to decide what constitutes a good clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the clustering. Consequently, it is the
user which must supply this criterion, in such a way that the result of the clustering will suit their
needs.
For instance, we could be interested in finding representatives for homogeneous groups (data
reduction), in finding “natural clusters” and describe their unknown properties (“natural” data
types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data
objects (outlier detection).

32
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4

Possible Applications
Clustering algorithms can be applied in many fields, for instance:

 Marketing: finding groups of customers with similar behavior given a large database of
customer data containing their properties and past buying records;

 Biology: classification of plants and animals given their features;

 Libraries: book ordering;

 Insurance: identifying groups of motor insurance policy holders with a high average
claim cost; identifying frauds;

 City-planning: identifying groups of houses according to their house type, value and
geographical location;

 Earthquake studies: clustering observed earthquake epicenters to identify dangerous


zones;

 WWW: document classification; clustering weblog data to discover groups of similar


access patterns.

33
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3

Unit 3
Big Data Security: Big Data Security, Compliance, Auditing and Protection: Pragmatic Steps to
Securing Big Data, Classifying Data, Protecting Big Data Analytics, Big Data and Compliance,
The Intellectual Property Challenge –Big Data in Cyber defense.

Course Objectives:
 To understand the concepts of Big data security and classification
 To understand the importance of intellectual property and cyber defense in Big data

Course Outcomes:
 The students can use the tools of Big Data
 The students can be able to provide security to Big Data
 They will learn how to handle the security issues in Big data

4.1 Security, Compliance, Auditing, and Protection

The sheer size of a Big Data repository brings with it a major security challenge, generating the
age-old question presented to IT: How can the data be protected? However, that is a trick
question—the answer has many caveats, which dictate how security must be imagined as well as
deployed. Proper security entails more than just keeping the bad guys out; it also means backing
up data and protecting data from corruption.

The first caveat is access. Data can be easily protected, but only if you eliminate access to the
data. That’s not a pragmatic solution, to say the least. The key is to control access, but even then,
knowing the who, what, when, and where of data access is only a start.

The second caveat is availability: controlling where the data are stored and how the data are
distributed. The more control you have, the better you are positioned to protect the data.

The third caveat is performance. Higher levels of encryption, complex security methodologies,
and additional security layers can all improve security. However, these security techniques all
carry a processing burden that can severely affect performance.

The fourth caveat is liability. Accessible data carry with them liability, such as the sensitivity of
the data, the legal requirements connected to the data, privacy issues, and intellectual property
concerns.

1
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3

Adequate security in the Big Data realm becomes a strategic balancing act among these caveats
along with any additional issues the caveats create. Nonetheless, effective security is an
obtainable, if not perfect, goal. With planning, logic, and observation, security becomes
manageable and omnipresent, effectively protecting data while still offering access to authorized
users and systems.

4.2 PRAGMATIC STEPS TO SECURING BIG DATA

Securing the massive amounts of data that are inundating organizations can be addressed in
several ways. A starting point is to basically get rid of data that are no longer needed. If you do
not need certain information, it should be destroyed, because it represents a risk to the
organization. That risk grows every day for as long as the information is kept. Of course, there
are situations in which information cannot legally be destroyed; in that case, the information
should be securely archived by an offline method.

The real challenge may be determining whether the data are needed—a difficult task in the world
of Big Data, where value can be found in unexpected places. For example, getting rid of activity
logs may be a smart move from a security standpoint. After all, those seeking to compromise
networks may start by analyzing activity so they can come up with a way to monitor and
intercept traffic to break into a network. In a sense, those logs present a serious risk to an
organization, and to prevent the logs from being exposed, the best method may be to delete them
after their usefulness ends.

However, those logs could be used to determine scale, use, and efficiency of large data systems,
an analytical process that falls right under the umbrella of Big Data analytics. Here a catch-22 is
created: Logs are a risk, but analyzing those logs properly can mitigate risks as well. Should you
keep or dispose of the data in these cases?

There is no easy answer to that dilemma, and it becomes a case of choosing the lesser of two
evils. If the data have intrinsic value for analytics, they must be kept, but that does not mean they
need to be kept on a system that is connected to the Internet or other systems. The data can be
archived, retrieved for processing, and then returned to the archive.

4.3 CLASSIFYING DATA

Protecting data becomes much easier if the data are classified—that is, the data should be divided
into appropriate groupings for management purposes. A classification system does not have to be
very sophisticated or complicated to enable the security process, and it can be limited to a few
different groups or categories to keep things simple for processing and monitoring.

With data classification in mind, it is essential to realize that all data are not created equal. For
example, Internal e-mails between two colleagues should not be secured or treated the same way
as financial reports, human resources (HR)information, or customer data.

Understanding the classifications and the value of the data sets is not a one-task job; the life-
cycle management of data may need to be shared by several departments or teams in an
2
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3

enterprise. For example, you may want to divide the responsibilities among technical, security,
and business organizations. Although it may sound complex, it really isn’t all that hard to
educate the various corporate shareholders to understand the value of data and where their
responsibilities lie.

Classification can become a powerful tool for determining the sensitivity of data. A simple
approach may just include classifications such as financial, HR, sales, inventory, and
communications, each of which is self-explanatory and offers insight into the sensitivity of the
data.

Once organizations better understand their data, they can take important steps to segregate the
information, which will make the deployment of security measures like encryption and
monitoring more manageable. The more data are placed into silos at higher levels, the easier it
becomes to protect and control them. Smaller sample sizes are easier to protect and can be
monitored separately for specific necessary controls.

4.4 PROTECTING BIG DATA ANALYTICS

It is sad to report that protecting data is an often forgotten inclination in the data center, an
afterthought that falls behind current needs. The launch of Big Data initiatives is no exception in
the data center, and protection is too often an afterthought. Big Data offers more of a challenge
than most other data center technologies, making it the perfect storm for a data protection
disaster.

The real cause of concern is the fact that Big Data contains all of the things you don’t want to see
when you are trying to protect data. Big Data can contain very unique sample sets—for example,
data from devices that monitor physical elements (e.g., traffic, movement, soil pH, rain, wind) on
a frequent schedule, surveillance cameras, or any other type of data that are accumulated
frequently and in real time. All of the data are unique to the moment, and if they are lost, they are
impossible to recreate.

That uniqueness also means you cannot leverage time-saving backup preparation and security
technologies, such as deduplication; this greatly increases the capacity requirements for backup
subsystems, slows down security scanning, makes it harder to detect data corruption, and
complicates archiving.

There is also the issue of the large size and number of files often found in Big Data analytic
environments. In order for a backup application and associated appliances or hardware to churn
through a large number of files, bandwidth to the backup systems and/or the backup appliance
must be large, and the receiving devices must be able to ingest data at the rate that the data can
be delivered, which means that significant CPU processing power is necessary to churn through
billions of files.

There is more to backup than just processing files. Big Data normally includes a database
component, which cannot be overlooked. Analytic information is often processed into an Oracle,
NoSQL, or Hadoop environment of some type, so real-time (or live) protection of that

3
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3

environment may be required. A database component shifts the backup ideology from a massive
number of small files to be backed up to a small number of massive files to be backed up. That
changes the dynamics of how backups need to be processed.

Big Data often presents the worst-case scenario for most backup appliances, in which the
workload mix consists of billions of small files and a small number of large files. Finding a
backup solution that can ingest this mixed workload of data at full speed and that can scale to
massive capacities may be the biggest challenge in the Big Data backup market.

4.5 BIG DATA AND COMPLIANCE

Compliance issues are becoming a big concern in the data center, and these issues have a major
effect on how Big Data is protected, stored, accessed, and archived. Whether Big Data is going
to reside in the data warehouse or in some other more scalable data store remains unresolved for
most of the industry; it is an evolving paradigm. However, one thing is certain: Big Data is not
easily handled by the relational databases that the typical database administrator is used to
working with in the traditional enterprise database server environment. This means it is harder to
understand how compliance affects the data.

Big Data is transforming the storage and access paradigms to an emerging new world of
horizontally scaling, unstructured databases, which are better at solving some old business
problems through analytics. More important, this new world of file types and data is prompting
analysis professionals to think of new problems to solve, some of which have never been
attempted before. With that in mind, it becomes easy to see that a rebalancing of the database
landscape is about to commence, and data architects will finally embrace the fact that relational
databases are no longer the only tool in the tool kit.

This has everything to do with compliance. New data types and methodologies are still expected
to meet the legislative requirements placed on businesses by compliance laws. There will be no
excuses accepted and no passes given if a new data methodology breaks the law.

Preventing compliance from becoming the next Big Data nightmare is going to be the job of
security professionals. They will have to ask themselves some important questions and take into
account the growing mass of data, which are becoming increasingly unstructured and are
accessed from a distributed cloud of users and applications looking to slice and dice them in a
million and one ways. How will security professionals be sure they are keeping tabs on the
regulated information in all that mix?

Many organizations still have to grasp the importance of such areas as payment card industry and
personal health information compliance and are failing to take the necessary steps because the
Big Data elements are moving through the enterprise with other basic data. The trend seems to
be that as businesses jump into Big Data, they forget to worry about very specific pieces of
information that may be mixed into their large data stores, exposing them to compliance issues.

Health care probably provides the best example for those charged with compliance as they
examine how Big Data creation, storage, and flow work in their organizations. The move to

4
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3

electronic health record systems, driven by the Health Insurance Portability and Accountability
Act (HIPAA) and other legislation, is causing a dramatic increase in the accumulation, access,
and inter-enterprise exchange of personal identifying information. That has already created a Big
Data problem for the largest health care providers and payers, and it must be solved to maintain
compliance.

The concepts of Big Data are as applicable to health care as they are to other businesses. The
types of data are as varied and vast as the devices collecting the data, and while the concept of
collecting and analyzing the unstructured data is not new, recently developed technologies make
it quicker and easier than ever to store, analyze, and manipulate these massive data sets.

Health care deals with these massive data sets using Big Data stores, which can span tens of
thousands of computers to enable enterprises, researchers, and governments to develop
innovative products, make important discoveries, and generate new revenue streams. The rapid
evolution of Big Data has forced vendors and architects to focus primarily on the storage,
performance, and availability elements, while security—which is often thought to diminish
performance—has largely been an afterthought.

In the medical industry, the primary problem is that unsecured Big Data stores are filled with
content that is collected and analyzed in real time and is often extraordinarily sensitive:
intellectual property, personal identifying information, and other confidential information. The
disclosure of this type of data, by either attack or human error, can be devastating to a company
and its reputation.

However, because this unstructured Big Data doesn’t fit into traditional, structured, SQL-based
relational databases, NoSQL, a new type of data management approach, has evolved. These
nonrelational data stores can store, manage, and manipulate terabytes, petabytes, and even
exabytes of data in real time.

No longer scattered in multiple federated databases throughout the enterprise, Big Data
consolidates information in a single massive database stored in distributed clusters and can be
easily deployed in the cloud to save costs and ease management. Companies may also move Big
Data to the cloud for disaster recovery, replication, load balancing, storage, and other purposes.

Unfortunately, most of the data stores in use today—including Hadoop, Cassandra, and
MongoDB—do not incorporate sufficient data security tools to provide enterprises with the
peace of mind that confidential data will remain safe and secure at all times. The need for
security and privacy of enterprise data is not a new concept. However, the development of Big
Data changes the situation in many ways. To date, those charged with network security have
spent a great deal of time and money on perimeter-based security mechanisms such as firewalls,
but perimeter enforcement cannot prevent unauthorized access to data once a criminal or a
hacker has entered the network.

Add to this the fact that most Big Data platforms provide little to no data-level security along
with the alarming truth that Big Data centralizes most critical, sensitive, and proprietary data in a
single logical data store, and it’s clear that Big Data requires big security.

5
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3

The lessons learned by the health care industry show that there is a way to keep Big Data secure
and in compliance. A combination of technologies has been assembled to meet four important
goals:

1. Control access by process, not job function. Server and network administrators, cloud
administrators, and other employees often have access to more information than their jobs
require because the systems simply lack the appropriate access controls. Just because a user has
operating system–level access to a specific server does not mean that he or she needs, or should
have, access to the Big Data stored on that server.
2. Secure the data at rest. Most consumers today would not conduct an online transaction
without seeing the familiar padlock symbol or at least a certification notice designating that
particular transaction as encrypted and secure. So why wouldn’t you require the same data to be
protected at rest in a Big Data store? All Big Data, especially sensitive information, should
remain encrypted, whether it is stored on a disk, on a server, or in the cloud and regardless of
whether the cloud is inside or outside the walls of your organization.
3. Protect the cryptographic keys and store them separately from the data.Cryptographic
keys are the gateway to the encrypted data. If the keys are left unprotected, the data are easily
compromised. Organizations—often those that have cobbled together their own encryption and
key management solution—will sometimes leave the key exposed within the configuration file or
on the very server that stores the encrypted data. This leads to the frightening reality that any
user with access to the server, authorized or not, can access the key and the data. In addition, that
key may be used for any number of other servers. Storing the cryptographic keys on a separate,
hardened server, either on the premises or in the cloud, is the best practice for keeping data safe
and an important step in regulatory compliance. The bottom line is to treat key security with as
much, if not greater, rigor than the data set itself.
4. Create trusted applications and stacks to protect data from rogue users. You may encrypt
your data to control access, but what about the user who has access to the configuration files that
define the access controls to those data? Encrypting more than just the data and hardening the
security of your overall environment—including applications, services, and configurations—
gives you peace of mind that your sensitive information is protected from malicious users and
rogue employees.

There is still time to create and deploy appropriate security rules and compliance objectives. The
health care industry has helped to lay some of the groundwork. However, the slow development
of laws and regulations works in favor of those trying to get ahead on Big Data. Currently, many
of the laws and regulations have not addressed the unique challenges of data warehousing. Many
of the regulations do not address the rules for protecting data from different customers at
different levels.

For example, if a database has credit card data and health care data, do the PCI Security
Standards Council and HIPAA apply to the entire data store or only to the parts of the data store
that have their types of data? The answer is highly dependent on your interpretation of the
requirements and the way you have implemented the technology.

Similarly, social media applications that are collecting tons of unregulated yet potentially
sensitive data may not yet be a compliance concern. But they are still a security problem that if
not properly addressed now may be regulated in the future. Social networks are accumulating

6
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3

massive amounts of unstructured data—a primary fuel for Big Data, but they are not yet
regulated, so this is not a compliance concern but remains as a security concern.

Security professionals concerned about how things like Hadoop and NoSQL deployments are
going to affect their compliance efforts should take a deep breath and remember that the general
principles of data security still apply. The first principle is knowing where the data reside. With
the newer database solutions, there are automated ways of detecting data and triaging systems
that appear to have data they shouldn’t.

Once you begin to map and understand the data, opportunities should become evident that will
lead to automating and monitoring compliance and security through data warehouse
technologies. Automation offers the ability to decrease compliance and security costs and still
provide the higher levels of assurance, which validates where the data are and where they are
going.

Of course, automation does not solve every problem for security, compliance, and backup. There
are still some very basic rules that should be used to enable security while not derailing the value
of Big Data:

 Ensure that security does not impede performance or availability. Big Data is all
about handling volume while providing results, being able to deal with the velocity and
variety of data, and allowing organizations to capture, analyze, store, or move data in real
time. Security controls that limit any of these processes are a nonstarter for organizations
serious about Big Data.
 Pick the right encryption scheme. Some data security solutions encrypt at the file level
or lower, such as including specific data values, documents, or rows and columns. Those
methodologies can be cumbersome, especially for key management. File level or internal
file encryption can also render data unusable because many applications cannot analyze
encrypted data. Likewise, encryption at the operating system level, but without advanced
key management and process-based access controls, can leave Big Data woefully
insecure. To maintain the high levels of performance required to analyze Big Data,
consider a transparent data encryption solution optimized for Big Data.
 Ensure that the security solution can evolve with your changing
requirements. Vendor lock-in is becoming a major concern for many enterprises.
Organizations do not want to be held captive to a sole source for security, whether it is a
single-server vendor, a network vendor, a cloud provider, or a platform. The flexibility to
migrate between cloud providers and models based on changing business needs is a
requirement, and this is no different with Big Data technologies. When evaluating
security, you should consider a solution that is platform-agnostic and can work with any
Big Data file system or database, including Hadoop, Cassandra, and MongoDB.

4.6 THE INTELLECTUAL PROPERTY CHALLENGE

One of the biggest issues around Big Data is the concept of intellectual property (IP). First we
must understand what IP is, in its most basic form. There are many definitions available, but
basically, intellectual property refers to creations of the human mind, such as inventions, literary

7
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3

and artistic works, and symbols, names, images, and designs used in commerce. Although this is
a rather broad description, it conveys the essence of IP.

With Big Data consolidating all sorts of private, public, corporate, and government data into a
large data store, there are bound to be pieces of IP in the mix: simple elements, such as
photographs, to more complex elements, such as patent applications or engineering diagrams.
That information has to be properly protected, which may prove to be difficult, since Big Data
analytics is designed to find nuggets of information and report on them.

Here is a little background: Between 1985 and 2010, the number of patents granted worldwide
rose from slightly less than 400,000 to more than 900,000. That’s an increase of more than 125
percent over one generation (25 years). Patents are filed and backed with IP rights (IPRs).

Technology is obviously pushing this growth forward, so it only makes sense that Big Data will
be used to look at IP and IP rights to determine opportunity. This should create a major concern
for companies looking to protect IP and should also be a catalyst to take action. Fortunately,
protecting IP in the realm of Big Data follows many of the same rules that organizations have
already come to embrace, so IP protection should already be part of the culture in any enterprise.

The same concepts just have to be expanded into the realm of Big Data. Some basic rules are as
follows:

 Understand what IP is and know what you have to protect. If all employees
understand what needs to be protected, they can better understand how to protect it and
whom to protect it from. Doing that requires that those charged with IP security in IT
(usually a computer security officer, or CSO) must communicate on an ongoing basis
with the executives who oversee intellectual capital. This may require meeting at least
quarterly with the chief executive, operating, and information officers and representatives
from HR, marketing, sales, legal services, production, and research and development
(R&D). Corporate leaders will be the foundation for protecting IP.
 Prioritize protection. CSOs with extensive experience normally recommend doing a risk
and cost-benefit analysis. This may require you to create a map of your company’s assets
and determine what information, if lost, would hurt your company the most. Then
consider which of those assets are most at risk of being stolen. Putting these two factors
together should help you figure out where to best allocate your protective efforts.
 Label. Confidential information should be labeled appropriately. If company data are
proprietary, note that on every log-in screen. This may sound trivial, but in court you may
have to prove that someone who was not authorized to take information had been
informed repeatedly. Your argument won’t stand up if you can’t demonstrate that you
made this clear.
 Lock it up. Physical as well as digital protection schemes are a must. Rooms that store
sensitive data should be locked. This applies to everything from the server farm to the file
room. Keep track of who has the keys, always use complex passwords, and limit
employee access to important databases.
 Educate employees. Awareness training can be effective for plugging and preventing IP
leaks, but it must be targeted to the information that a specific group of employees needs
to guard. Talk in specific terms about something that engineers or scientists have invested

8
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3

a lot of time in, and they will pay attention. Humans are often the weakest link in the
defense chain. This is why an IP protection effort that counts on firewalls and copyrights
but ignores employee awareness and training is doomed to fail.
 Know your tools. A growing variety of software tools are available for tracking
documents and other IP stores. The category of data loss protection (or data leakage
prevention) grew quickly in the middle of the first decade of this century and now shows
signs of consolidation into other security tool sets. Those tools can locate sensitive
documents and keep track of how they are being used and by whom.
 Use a holistic approach. You must take a panoramic view of security. If someone is
scanning the internal network, your internal intrusion detection system goes off, and
someone from IT calls the employee who is doing the scanning and says, ―Stop doing
that.‖ The employee offers a plausible explanation, and that’s the end of it. Later the
night watchman sees an employee carrying out protected documents, whose explanation,
when stopped, is ―Oops, I didn’t realize that got into my briefcase.‖ Over time, the HR
group, the audit group, the individual’s colleagues, and others all notice isolated
incidents, but no one puts them together and realizes that all these breaches were
perpetrated by the same person. This is why communication gaps between infosecurity
and corporate security groups can be so harmful. IP protection requires connections and
communication among all the corporate functions. The legal department has to play a role
in IP protection, and so does HR, IT, R&D, engineering, and graphic design. Think
holistically, both to protect and to detect.
 Use a counterintelligence mind-set. If you were spying on your own company, how
would you do it? Thinking through such tactics will lead you to consider protecting
phone lists, shredding the papers in the recycling bins, convening an internal council to
approve your R&D scientists’ publications, and coming up with other worthwhile ideas
for your particular business.

These guidelines can be applied to almost any information security paradigm that is
geared toward protecting IP. The same guidelines can be used when designing IP
protection for a Big Data platform.

Cyber Defense

 Cyber attacks involve advanced and sophisticated techniques to infiltrate corporate


networks and enterprise systems.

 Types of attacks include advanced malware, zero day attacks and advanced persistent
threats.

 Advance warning about attackers and intelligence about the threat landscape is
considered by many security leaders to be essential features in security technologies.

 The purpose of the Big Data Analytics in Cyber Defense study sponsored by Teradata
and conducted by Ponemon Institute is to learn about organizations’ cyber security

9
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3

defenses and the use of big data analytics to become more efficient in recognizing the
patterns that represent network threats.

 Big data analytics in security involves the ability to gather massive amounts of digital
information to analyze, visualize and draw insights that can make it possible to predict
and stop cyber attacks.

 The study looks at the awareness among IT and IT security practitioners about the new
data management and analytic technologies now available to help organizations become
more proactive and intelligent about detecting and stopping threats.

 In this study, we surveyed 706 IT and IT security practitioners in financial services,


manufacturing and government with an average of 10 years experience.

 All respondents are familiar with their organization’s defense against cyber security
attacks and have some level of responsibility for managing the cyber security activities
within their organization.

Big Data to Defend Against Cyber Security Threats

While the theft's full damage is still unknown, the multipronged heist is another indicator that
cyberattacks are wreaking increasingly greater damage. In Ponemon Institute's upcoming 2013
Cost of Cyber Crime Study, the firm reports this year's average annualized cost of cybercrime
was $7.2 billion per company polled in its study — a 30 percent increase in mean value over last
year. The report also says successful cyberattacks increased 20 percent over last year, with each
company surveyed experiencing 1.4 successful attacks per week.
"We used to make statements, such as 'I have a firewall; I'm protected,' or 'I have antivirus
software; I'm protected,'" says Todd Pedersen, a cybersecurity lead for CSC. "Now,
the conversation is less about preventing an attack, threat or exposure, and more about how
quickly you can detect that an attack is happening."

Big Data-guided defenses

There's a growing demand for security information and event management (SIEM) technologies
and services, which gather and analyze security event big data that is used to manage
threats. Increasing numbers of regulations and mandates generated throughout the globe also are
pushing the adoption of SIEM technologies and services.

"Both governments and industries are introducing more and more regulations and mandates that
require the use of better data protection and security controls to help guard systems, information
and individuals," says Matthew O'Brien, a global cybersecurity expert for CSC.

10
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3

In the United States, the Federal Information Security Management Act, Health Insurance
Portability and Accountability Act, Sarbanes-Oxley Act, and the Department of Homeland
Security's Critical Infrastructure Protection guidelines, to name a few, all have requirements tied
to collecting and logging information, events and activities that occur within an organization's
environment — requirements that SIEM-related technologies and services help organizations
meet.
For example, every second, more than 300,000 events generated by CSC and its customers run
through CSC's Global Security Operations Centers.
"SIEM gives us the ability to take this massive amount of data and bring it all back to a central
place, where it's combined with the other information we get from numerous security
technologies," says Pedersen. "That gives us the ability to detect things that no individual
technology in and of itself would have picked up, and create a picture to analyze, investigate and
find security-related issues."
New levels of awareness

This SIEM capability also has become critical as organized crime, along with some nations'
armed forces and intelligence services, moves center stage in the cyberarena,
launching weapons-grade cyberattacks and advanced persistent threats.

At times these threats are global; at other times, attackers aim for specific industries. Ponemon's
report says, "The average annualized cost of cybercrime appears to vary by industry segment,
where organizations in defense, financial services, and energy and utilities experience
substantially higher cybercrime costs than organizations in retail, media and consumer products."

"SIEM helps us create an environment that allows us to use a broad range of tools, some of
which we select for a specific customer environment, and yet accrue data in a common
environment and use that common environment for correlation and analysis," says Pedersen.

Increasing enterprise system complexity also creates a driver for SIEM. Today's organizations
are adding greater numbers of connections, also known as endpoints, to their systems, either due
to incorporating mobile devices, the bring-your-owndevice trend, expanding supply chains, or a
desire to link their IT systems with their industrial control systems.

"The number of integration points with other technologies and the processes that support them
today can be overwhelming," says O'Brien. "As we ask our systems to do more, they
also become more vulnerable, which means we need a level of awareness that wasn't required
before."

11
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

UNIT 5
Case Studies: MapReduce: Simplified Data Processing on Large Clusters- RDBMS to NoSQL:
Reviewing Some Next-Generation Non-Relational Database's - Analytics: The real-world use of big data
- New Analysis Practices for Big Data.

Course Objectives:
 The students are to understand the concepts of RDBMS to NoSQL
 To understand real time applications of Big data
Course Outcomes:
 The students can use the tools of Big Data
 The students will learn how to process on large clusters
 The students can able to turn Big Data into big money

Mapreduce:

MapReduce is a framework using which we can write applications to process huge


amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.

MapReduce is a processing technique and a program model for distributed computing


based on java. The MapReduce algorithm contains two important tasks, namely Map and
Reduce. Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes
the output from a map as an input and combines those data tuples into a smaller set of
tuples. As the sequence of the name MapReduce implies, the reduce task is always
performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing primitives are
called mappers and reducers. Decomposing a data processing application into mappers and
reducers is sometimes nontrivial. But, once we write an application in the MapReduce form,
scaling the application to run over hundreds, thousands, or even tens of thousands of
machines in a cluster is merely a configuration change. This simple scalability is what has
attracted many programmers to use the MapReduce model.

1
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

Terminology used in Mapreduce:

 PayLoad - Applications implement the Map and the Reduce functions, and form the
core of the job.

 Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value
pair.

 NameNode - Node that manages the Hadoop Distributed File System (HDFS).

 DataNode - Node where data is presented in advance before any processing takes
place.

 MasterNode - Node where JobTracker runs and which accepts job requests from
clients.

 SlaveNode - Node where Map and Reduce program runs.

 JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.

 Task Tracker - Tracks the task and reports status to JobTracker.

 Job - A program is an execution of a Mapper and Reducer across a dataset.

 Task - An execution of a Mapper or a Reducer on a slice of data.

 Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.

Working of Map reduce (Algorithm):

 Generally MapReduce paradigm is based on sending the computer to where the data
resides!

 MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.

o Map stage : The map or mapper’s job is to process the input data. Generally
the input data is in the form of file or directory and is stored in the Hadoop

2
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

file system (HDFS). The input file is passed to the mapper function line by
line. The mapper processes the data and creates several small chunks of data.

o Reduce stage : This stage is the combination of the Shufflestage and


the Reduce stage. The Reducer’s job is to process the data that comes from
the mapper. After processing, it produces a new set of output, which will be
stored in the HDFS.

 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.

 The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the nodes.

 Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.

 After completion of the given tasks, the cluster collects and reduces the data to form
an appropriate result, and sends it back to the Hadoop server.

Example Scenario

3
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

Given below is the data regarding the electrical consumption of an organization. It contains
the monthly electrical consumption and the annual average for various years.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg

1979 23 23 2 43 24 25 26 26 26 26 25 26 25

1980 26 27 28 28 28 30 31 31 31 30 30 30 29

1981 31 32 32 32 33 34 35 36 36 34 34 34 34

1984 39 38 39 39 39 41 42 43 40 39 38 38 40

1985 38 39 39 39 39 41 41 41 00 40 39 39 45

If the above data is given as input, we have to write applications to process it and produce
results such as finding the year of maximum usage, year of minimum usage, and so on. This
is a walkover for the programmers with finite number of records. They will simply write the
logic to produce the required output, and pass the data to the application written.

But, think of the data representing the electrical consumption of all the largescale industries
of a particular state, since its formation.

When we write applications to process such bulk data,

 They will take a lot of time to execute.


 There will be a heavy network traffic when we move data from source to network
server and so on.
To solve these problems, we have the MapReduce framework.

Input Data

4
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

The above data is saved as sample.txtand given as input. The input file looks as shown
below.

19792323243242526262626252625
198026272828283031313130303029
198131323232333435363634343434
198439383939394142434039383840
198538393939394141410040393945

Example Program
Given below is the program to the sample data using MapReduce framework.

packagehadoop;

importjava.util.*;

importjava.io.IOException;

importjava.io.IOException;

importorg.apache.hadoop.fs.Path;

importorg.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

importorg.apache.hadoop.mapred.*;

importorg.apache.hadoop.util.*;

public class ProcessUnits

5
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

//Mapper class

public static class E_EMapper extends MapReduceBase implements

Mapper<LongWritable ,/*Input key Type */

Text, /*Input value Type*/

Text, /*Output key Type*/

IntWritable> /*Output value Type*/

//Map function

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException

String line = value.toString();

String lasttoken = null;

StringTokenizer s = new StringTokenizer(line,"\t");

String year = s.nextToken();

while(s.hasMoreTokens())

lasttoken=s.nextToken();

intavgprice = Integer.parseInt(lasttoken);

output.collect(new Text(year), new IntWritable(avgprice));

//Reducer class

6
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

public static class E_EReduce extends MapReduceBase implements

Reducer< Text, IntWritable, Text, IntWritable>

//Reduce function

public void reduce( Text key, Iterator <IntWritable> values,

OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException

intmaxavg=30;

intval=Integer.MIN_VALUE;

while (values.hasNext())

if((val=values.next().get())>maxavg)

output.collect(key, new IntWritable(val));

//Main function

public static void main(String args[])throws Exception

JobConfconf = new JobConf(ProcessUnits.class);

conf.setJobName("max_eletricityunits");

conf.setOutputKeyClass(Text.class);

7
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(E_EMapper.class);

conf.setCombinerClass(E_EReduce.class);

conf.setReducerClass(E_EReduce.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

Save the above program as ProcessUnits.java. The compilation and execution of the
program is explained below.

Compilation and Execution of Process Units Program


Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).

Follow the steps given below to compile and execute the above program.

Step 1
The following command is to create a directory to store the compiled java classes.

$ mkdir units

Step 2
Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce
program. Visit the following link
http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/1.2.1 to download the jar.
Let us assume the downloaded folder is /home/hadoop/.

8
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

Step 3
The following commands are used for compiling the ProcessUnits.javaprogram and
creating a jar for the program.

$ javac-classpath hadoop-core-1.2.1.jar-d units ProcessUnits.java


$ jar -cvf units.jar -C units/.

Step 4
The following command is used to create an input directory in HDFS.

$HADOOP_HOME/bin/hadoopfs-mkdirinput_dir

Step 5
The following command is used to copy the input file named sample.txtin the input
directory of HDFS.

$HADOOP_HOME/bin/hadoopfs-put /home/hadoop/sample.txt input_dir

Step 6
The following command is used to verify the files in the input directory.

$HADOOP_HOME/bin/hadoopfs-lsinput_dir/

Step 7
The following command is used to run the Eleunit_max application by taking the input files
from the input directory.

$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnitsinput_diroutput_dir

Wait for a while until the file is executed. After execution, as shown below, the output will
contain the number of input splits, the number of Map tasks, the number of reducer tasks,
etc.

INFO mapreduce.Job:Job job_1414748220717_0002


completed successfully
14/10/3106:02:52
INFO mapreduce.Job:Counters:49

9
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

FileSystemCounters

FILE:Number of bytes read=61


FILE:Number of bytes written=279400
FILE:Number of read operations=0
FILE:Number of large read operations=0
FILE:Number of write operations=0
HDFS:Number of bytes read=546
HDFS:Number of bytes written=40
HDFS:Number of read operations=9
HDFS:Number of large read operations=0
HDFS:Number of write operations=2JobCounters

Launched map tasks=2


Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=146137
Total time spent by all reduces in occupied slots (ms)=441
Total time spent by all map tasks (ms)=14613
Total time spent by all reduce tasks (ms)=44120
Totalvcore-seconds taken by all map tasks=146137

Totalvcore-seconds taken by all reduce tasks=44120


Total megabyte-seconds taken by all map tasks=149644288
Total megabyte-seconds taken by all reduce tasks=45178880

Map-ReduceFramework

10
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

Map input records=5


Map output records=5
Map output bytes=45
Map output materialized bytes=67
Input split bytes=208
Combine input records=5
Combine output records=5
Reduce input groups=5
Reduce shuffle bytes=6
Reduce input records=5
Reduce output records=5
SpilledRecords=10
ShuffledMaps=2
FailedShuffles=0
MergedMap outputs=2
GC time elapsed (ms)=948
CPU time spent (ms)=5160
Physical memory (bytes) snapshot=47749120
Virtual memory (bytes) snapshot=2899349504
Total committed heap usage (bytes)=277684224

FileOutputFormatCounters

BytesWritten=40

Step 8
The following command is used to verify the resultant files in the output folder.

$HADOOP_HOME/bin/hadoopfs-lsoutput_dir/

11
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

Step 9
The following command is used to see the output in Part-00000 file. This file is generated
by HDFS.

$HADOOP_HOME/bin/hadoopfs-cat output_dir/part-00000

Below is the output generated by the MapReduceprogram.

198134
198440
198545

Step 10
The following command is used to copy the output folder from HDFS to the local file
system for analyzing.

$HADOOP_HOME/bin/hadoopfs-cat output_dir/part-
00000/bin/hadoopdfsgetoutput_dir/home/hadoop

Important Commands
All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoopcommand.
Running the Hadoop script without any arguments prints the description for all commands.

Usage :hadoop [--configconfdir] COMMAND

The following table lists the options available and their description.

Options Description

namenode -format Formats the DFS filesystem.

secondarynamenode Runs the DFS secondary namenode.

namenode Runs the DFS namenode.

12
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

datanode Runs a DFS datanode.

dfsadmin Runs a DFS admin client.

mradmin Runs a Map-Reduce admin client.

fsck Runs a DFS filesystem checking utility.

fs Runs a generic filesystem user client.

balancer Runs a cluster balancing utility.

oiv Applies the offline fsimage viewer to an fsimage.

fetchdt Fetches a delegation token from the NameNode.

jobtracker Runs the MapReduce job Tracker node.

pipes Runs a Pipes job.

tasktracker Runs a MapReduce task Tracker node.

historyserver Runs job history servers as a standalone daemon.

job Manipulates the MapReduce jobs.

queue Gets information regarding JobQueues.

13
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

version Prints the version.

jar <jar> Runs a jar file.

distcp<srcurl><desturl> Copies file or directories recursively.

distcp2 <srcurl><desturl> DistCp version 2.

archive -archiveName Creates a hadoop archive.


NAME -p

<parent path><src>* <dest>

classpath Prints the class path needed to get the Hadoop jar and
the required libraries.

daemonlog Get/Set the log level for each daemon

How to Interact with MapReduce Jobs


Usage: hadoop job [GENERIC_OPTIONS]

The following are the Generic Options available in a Hadoop job.

GENERIC_OPTIONS Description

-submit <job-file> Submits the job.

-status <job-id> Prints the map and reduce completion percentage and all
job counters.

14
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

-counter <job-id><group- Prints the counter value.


name><countername>

-kill <job-id> Kills the job.

-events <job-id><fromevent- Prints the events' details received by jobtracker for the
#><#-of-events> given range.

-history [all] <jobOutputDir> - Prints job details, failed and killed tip details. More details
history <jobOutputDir> about the job such as successful tasks and task attempts
made for each task can be viewed by specifying the [all]
option.

-list[all] Displays all jobs. -list displays only jobs which are yet to
complete.

-kill-task <task-id> Kills the task. Killed tasks are NOT counted against failed
attempts.

-fail-task <task-id> Fails the task. Failed tasks are counted against failed
attempts.

-set-priority <job-id><priority> Changes the priority of the job. Allowed priority values
are VERY_HIGH, HIGH, NORMAL, LOW,
VERY_LOW

To see the status of job


$ $HADOOP_HOME/bin/hadoop job -status <JOB-ID>
e.g.
$ $HADOOP_HOME/bin/hadoop job -status job_201310191043_0004

15
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

To see the history of job output-dir


$ $HADOOP_HOME/bin/hadoop job -history <DIR-NAME>
e.g.
$ $HADOOP_HOME/bin/hadoop job -history /user/expert/output

To kill the job


$ $HADOOP_HOME/bin/hadoop job -kill <JOB-ID>
e.g.
$ $HADOOP_HOME/bin/hadoop job -kill job_201310191043_0004

RDBMS to NoSQL:

What is RDBMS?
RDBMS stands for Relational Database Management System. RDBMS is the basis
for SQL, and for all modern database systems like MS SQL Server, IBM DB2, Oracle,
MySQL, and Microsoft Access.

Challenges of RDBMS
 RDBMS assumes a well-defined structure of data and assumes that the data is largely
uniform.
 It needs the schema of your application and its properties (columns, types, etc.) to be
defined up-front before building the application. This does not match well with the agile
development approaches for highly dynamic applications.
 As the data starts to grow larger, you have to scale your database vertically, i.e. adding
more capacity to the existing servers.

“A NoSQL (originally referring to “non SQL” or “non relational”) database provides


a mechanism for storage and retrieval of data that is modeled in means other than the tabular
relations used in relation databases (RDBMS). It encompasses a wide variety of different
database technologies that were developed in response to a rise in the volume of data stored
about users, objects and products, the frequency in which this data is accessed, and
performance and processing needs. Generally, NoSQL databases are structured in a key-
value pair, graph database, document-oriented or column-oriented structure.

As an example, consider that you have a blogging application that stores user blogs.
Now suppose that you have to incorporate some new features in your application such as
users liking these blog posts or commenting on them or liking these comments. With a

16
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

typical RDBMS implementation, this will need a complete overhaul to your existing
database design. However, if you use NoSQL in such scenarios, you can easily modify your
data structure to match these agile requirements. With NoSQL you can directly start inserting
this new data in your existing structure without creating any new pre-defined columns or pre-
defined structure.

Benefits of NoSQL over RDBMS

Schema Less:
NoSQL databases being schema-less do not define any strict data structure.

Dynamic and Agile:


NoSQL databases have good tendency to grow dynamically with changing requirements. It
can handle structured, semi-structured and unstructured data.

Scales Horizontally:
In contrast to SQL databases which scale vertically, NoSQL scales horizontally by adding
more servers and using concepts of sharding and replication. This behavior of NoSQL fits
with the cloud computing services such as Amazon Web Services (AWS) which allows you
to handle virtual servers which can be expanded horizontally on demand.

Better Performance:
All the NoSQL databases claim to deliver better and faster performance as compared to
traditional RDBMS implementations.

Talking about the limitations, since NoSQL is an entire set of databases (and not a single
database), the limitations differ from database to database. Some of these databases do not
support ACID transactions while some of them might be lacking in reliability. But each one
of them has their own strengths due to which they are well suited for specific requirements.

Types of NoSQL Databases

Document Oriented Databases:


Document oriented databases treat a document as a whole and avoid splitting a document in
its constituent name/value pairs. At a collection level, this allows for putting together a
diverse set of documents into a single collection. Document databases allow indexing of
documents on the basis of not only its primary identifier but also its properties. Different
open-source document databases are available today but the most prominent among the
available options are MongoDB and CouchDB. In fact, MongoDB has become one of the
most popular NoSQL databases.

17
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5

Graph Based Databases:


A graph database uses graph structures with nodes, edges, and properties to represent and
store data. By definition, a graph database is any storage system that provides index-free
adjacency. This means that every element contains a direct pointer to its adjacent element
and no index lookups are necessary. General graph databases that can store any graph are
distinct from specialized graph databases such as triple-stores and network databases.
Indexes are used for traversing the graph.

Column Based Databases:


The column-oriented storage allows data to be stored effectively. It avoids consuming space
when storing nulls by simply not storing a column when a value doesn’t exist for that
column. Each unit of data can be thought of as a set of key/value pairs, where the unit itself
is identified with the help of a primary identifier, often referred to as the primary key.
Bigtable and its clones tend to call this primary key the row-key.

KeyValue Databases:
The key of a key/value pair is a unique value in the set and can be easily looked up to access
the data. Key/value pairs are of varied types: some keep the data in memory and some
provide the capability to persist the data to disk. A simple, yet powerful, key/value store is
Oracle’s Berkeley DB.

Popular NoSQL Databases:


Let us summarize some popular NoSQL databases that falls in the above categories
respectively.

 Document Oriented Databases – MongoDB, HBase, Cassandra, Amazon SimpleDB,


Hypertable, etc.
 Graph Based Databases – Neo4j, OrientDB, Facebook Open Graph, FlockDB, etc.
 Column Based Databases – CouchDB, OrientDB, etc.
 Key Value Databases – Membase, Redis, MemcacheDB, etc.

18
MVIT/IT/IV/VII
IT-E79BIG DATABASES Part A

TWO MARKS

UNIT-I

1. What is big data?

Big data means really a big data; it is a collection of large datasets that cannot be processed using traditional
computing techniques. Big data is not merely a data; rather it has become a complete subject, which involves
various tools, techniques and frameworks.

2. What are the sources of big data?


 Enterprise Data
 Transactional Data
 Social media
 Activity generated

3. What are the Dimensions of Big Data or Characteristics of Big Data?

 Volume. Big Data comes in one size: large. Enterprises are awash with data, easily amassing
terabytes and even petabytes of information.
 Variety. Big Data extends beyond structured data to include unstructured data of all varieties: text,
audio, video, click streams, log files, and more.
 Veracity. The massive amounts of data collected for Big Data purposes can lead to statistical errors
and misinterpretation of the collected information. Purity of the information is critical for value.
 Velocity. Often time sensitive, Big Data must be used as it is streaming into the enterprise in order to
maximize its value to the business, but it must also still be available from the archival sources as
well.

4.WhatBig data analytics?

Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in turn,
leads to,smarter business moves, more efficient operations, higher profits and happier customers.

5.Advantages of Big data?

 Cost reduction
 Faster, better decision making
 Competitive Advantage
 New business opportunities
 New products and services

MVIT/IT/IV/VII 1
IT-E79BIG DATABASES Part A

6. What are the Big Data Challenges?

 Capturing data
 Storage
 Curation
 Searching
 Sharing
 Transfer
 Analysis
 Presentation.

7. what are the Issues of Big Data?

 Making tools easier to use. Hadoop stack and NoSQLs really do require programming knowledge to
unlock their power.
 Getting quicker answers across large data sets. We can get them in "acceptable" amounts of time. Its
about getting that 3 hour query down to 5 minutes or less. Apache Impala (incubating) is a good
example of work in this space

8. List out the Future of Big Data.

 Data volumes will continue to grow. There‟s absolutely no question that we will continue generating
larger and larger volumes of data, especially considering that the number of handheld devices and
Internet-connected devices is expected to grow exponentially.
 Ways to analyse data will improve. While SQL is still the standard, Spark is emerging as a
complementary tool for analysis and will continue to grow, according to Ovum.
 More tools for analysis (without the analyst) will emerge. Microsoft MSFT -1.41% andSalesforce
both recently announced features to let non-coders create apps to view business data.
 Prescriptive analytics will be built in to business analytics software.

9.Types of Big data tools and Classification

 Data Storage and Management


 Data Cleaning
 Data Mining
 Data Analysis
 Data Visualization
 Data Integration
 Data Languages

10. What is hadoop?

The name Hadoop has become synonymous with big data. It‟s an open-source software framework for
distributed storage of very large datasets on computer clusters. All that means you can scale your data up and
down without having to worry about hardware failures. Hadoop provides massive amounts of storage for any
kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

MVIT/IT/IV/VII 2
IT-E79BIG DATABASES Part A

11.What is cloudera?

Cloudera is essentially a brand name for Hadoop with some extra services stuck on. They can help your
business build an enterprise data hub, to allow people in your organization better access to the data you are
storing

12.Types of Data vizulization tool?

 Tableau
 Silk
 Cartodb
 Chartio
 Data wrapper

13. What is Mangodb?

MongoDB is the modern, start-up approach to databases. Think of them as an alternative to relational
databases. It‟s good for managing data that changes frequently or data that is unstructured or semi-
structured

14.What is R Language?

R is a language for statistical computing and graphics. If the data mining and statistical software listed
above doesn‟t quite do what you want it to, learning R is the way forward. In fact, if you‟re planning
on being a data scientist, knowing R is a requirement.

UNIT2

1. What is big data analytics?

Big data analytics is the process of examining large data sets containing a variety of data types -- i.e.,
big data -- to uncover hidden patterns, unknown correlations, market trends, customer preferences
and other useful business information.

2. List out the Features of R-Environment

• An effective data handling and storage facility,

• A suite of operators for calculations on arrays, in particular matrices,

• A large, coherent, integrated collection of intermediate tools for data analysis,

• Graphical facilities for data analysis and display either directly at the computer or on hardcopy

• A well developed, simple and effective programming language (called „S‟) which includes
conditionals, loops, user defined recursive functions and input and output facilities. (Indeed most of
the system supplied functions are themselves written in the S language.)

3.Define Environment?

MVIT/IT/IV/VII 3
IT-E79BIG DATABASES Part A

The term “environment” is intended to characterize it as a fully planned and coherent system, rather
than an incremental accretion of very specific and inflexible tools, as is frequently the case with
other data analysis software.

4.List some features of R?

• R is a well-developed, simple and effective programming language which includes


conditionals, loops, user defined recursive functions and input and output facilities.
• R has an effective data handling and storage facility,
• R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
• R provides a large, coherent and integrated collection of tools for data analysis.
• R provides graphical facilities for data analysis and display either directly at the
computer or printing at the paper

5.What are different types of R-objects?

• Vectors
• Lists
• Matrices
• Arrays
• Factors
• Data Frames

6.Define List.

A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.

7.What are Factors?

Factors are the r-objects which are created using a vector. It stores the vector along with the distinct
values of the elements in the vector as labels. The labels are always character irrespective of whether
it is numeric or character or Boolean etc. in the input vector. They are useful in statistical
modeling.Factors are created using the factor() function.

8.What are Data Frames?

The nlevels functions Data frames are tabular data objects. Unlike a matrix in data frame each
column can contain different modes of data. The first column can be numeric while the second
column can be character and third column can be logical. It is a list of vectors of equal length.Data
Frames are created using the data.frame() function. gives the count of levels.

9.What are R Functions?

A function is a set of statements organized together to perform a specific task. R has a large number
of in-built functions and the user can create their own functions.

MVIT/IT/IV/VII 4
IT-E79BIG DATABASES Part A

10.What is R- Data Reshaping?

Data Reshaping in R is about changing the way data is organized into rows and columns. Most of the
time data processing in R is done by taking the input data as a data frame. It is easy to extract data
from the rows and columns of a data frame but there are situations when we need the data frame in a
format that is different from format in which we received it. R has many functions to split, merge
and change the rows to columns and vice-versa in a data frame.

11. What are R – XML Files?

XML is a file format which shares both the file format and the data on the World Wide Web,
intranets, and elsewhere using standard ASCII text. It stands for Extensible Markup Language
12.What is Data Modeling?

• Data modeling is a useful technique to manage a workflow for various entities and for
making a sequential workflow in order to have a successful completion of a task.
• Hadoop and its big data model, we need to have a comprehensive study before
implementing any execution task and setting up any progressive environment (XML).
Similar to HTML it contains

13.What are operators in R ?

An operator is a symbol that tells the compiler to perform specific mathematical or logical
manipulations. R language is rich in built-in operators and provides following types of operators.

14.List different Types of Operators in R?

We have the following types of operators in R programming −

• Arithmetic Operators

• Relational Operators

• Logical Operators

• Assignment Operators

• Miscellaneous Operatorsmarkup tags.

UNIT-III

1.What is hadoop?

Hadoop runs applications using the MapReduce algorithm, where the data is processed in
parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop

MVIT/IT/IV/VII 5
IT-E79BIG DATABASES Part A

applications capable of running on clusters of computers and they could perform complete statistical
analysis for a huge amounts of data.
2.Describehadoop architecture?

3.Write a short notes on the advantages of hadoop?


 Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA),
rather Hadoop library itself has been designed to detect and handle failures at the application
layer.

 Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
4.DefineMapReduce in Hadoop?

HadoopMapReduce is a software framework for easily writing applications which process


big amounts of data in-parallel on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner.
5.Explain the framework of a MapReduce?
The MapReduce framework consists of a single master JobTracker and one slave
TaskTracker per cluster-node. The master is responsible for resource management, tracking resource
consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them
and re-executing the failed tasks. The slaves TaskTracker execute the tasks as directed by the master
and provide task-status information to the master periodically.

6. Explain the working of anMapReduce algorithm?


The MapReduce algorithm contains two important tasks, namely Map and Reduce.

MVIT/IT/IV/VII 6
IT-E79BIG DATABASES Part A

 The Map task takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.The reduce task is always
performed after the map job.

7.What is an map reduce algorithm?


The MapReduce algorithm contains two important tasks, namely Map and Reduce.
 The map task is done by means of Mapper Class
 The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is
used as input by Reducer class, which in turn searches matching pairs and reduces them.
8.What is sorting?
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the mapper by
their keys.
9.What is searching?
Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase. Let us try to understand how Searching works with the help of
an example.

10.What do you mean by indexing?


Normally indexing is used to point to a particular data and its address. It performs batch
indexing on the input files for a particular Mapper.The indexing technique that is normally used in
MapReduce is known as inverted index. Search engines like Google and Bing use inverted indexing
technique
11.What do you mean by TF-IDF?
TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse
Document Frequency. It is one of the common web analysis algorithms. Here, the term 'frequency'
refers to the number of times a term appears in a document.

11.What is called as Inverse Document Frequency?


It measures the importance of a term. It is calculated by the number of documents in the text
database divided by the number of documents where a specific term appears.
MVIT/IT/IV/VII 7
IT-E79BIG DATABASES Part A

12.What is called as an NOSQL database?


A NoSQL database (sometimes called as Not Only SQL) is a database that provides a
mechanism to store and retrieve data other than the tabular relations used in relational databases.
These databases are schema-free, support easy replication, have simple API, eventually consistent,
and can handle huge amounts of data.

13.What are the objectives of an NOSQL database?


The primary objective of a NoSQL database is to have
 simplicity of design,
 horizontal scaling, and
 finer control over availability.
NoSql databases use different data structures compared to relational databases. It makes
some operations faster in NoSQL. The suitability of a given NoSQL database depends on the
problem it must solve
14. What is Apache Cassandra and list some of its features?
Apache Cassandra is an open source, distributed and decentralized/distributed storage
system (database), for managing very large amounts of structured data spread out across the world.
It provides highly available service with no single point of failure.
Listed below are some of the notable points of Apache Cassandra:
 It is scalable, fault-tolerant, and consistent.
 It is a column-oriented database.

15.What are the key components of Cassandra?


The key components of Cassandra are as follows −
 Node − It is the place where data is stored.
 Data center − It is a collection of related nodes.
 Cluster − A cluster is a component that contains one or more data centers.
 Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write
operation is written to the commit log.

16. What are the features of HBase?

 HBase is linearly scalable.


 It has automatic failure support.
MVIT/IT/IV/VII 8
IT-E79BIG DATABASES Part A

 It provides consistent read and writes.


 It integrates with Hadoop, both as a source and a destination.
 It has easy java API for client.
It provides data replication across clusters
17.What is an Zookeeper and mention its features?
 Zookeeper is an open-source project that provides services like maintaining configuration
information, naming, providing distributed synchronization, etc.
 Zookeeper has ephemeral nodes representing different region servers. Master servers use
these nodes to discover available servers.

UNIT-IV

1.What are the different ways of data protection?


 The first caveat is access. Data can be easily protected, but only if you eliminate access to the
data. That‟s not a pragmatic solution, to say the least. The key is to control access, but even
then, knowing the who, what, when, and where of data access is only a start.
 The second caveat is availability: controlling where the data are stored and how the data are
distributed. The more control you have, the better you are positioned to protect the data.
 The third caveat is performance. Higher levels of encryption, complex security
methodologies, and additional security layers can all improve security.
2.What are the pragmatic ways to secure big data?
Securing the massive amounts of data that are inundating organizations can be addressed in
several ways. A starting point is to basically get rid of data that are no longer needed. That risk
grows every day for as long as the information is kept. Of course, there are situations in which
information cannot legally be destroyed; in that case, the information should be securely archived by
an offline method.
3.What do you mean by classifying data?
Classification can become a powerful tool for determining the sensitivity of data. A simple
approach may just include classifications such as financial, HR, sales, inventory, and
communications, each of which is self-explanatory and offers insight into the sensitivity of the data.
4.What are the compliance issues in bigdata?

Compliance issues are becoming a big concern in the data center, and these issues have a
major effect on how Big Data is protected, stored, accessed, and archived. Whether Big Data is
MVIT/IT/IV/VII 9
IT-E79BIG DATABASES Part A

going to reside in the data warehouse or in some other more scalable data store remains unresolved
for most of the industry; it is an evolving paradigm.

5.How is big data applicable to health care industry?

 The concepts of Big Data are as applicable to health care as they are to other businesses.
Health care deals with these massive data sets using Big Data stores, which can span tens of
thousands of computers to enable enterprises, researchers, and governments to develop
innovative products, make important discoveries, and generate new revenue streams.
 In the medical industry, the primary problem is that unsecured Big Data stores are filled with
content that is collected and analyzed in real time and is often extraordinarily sensitive:
intellectual property, personal identifying information, and other confidential information.

6.What are the data stores used in big data?

The data stores used in big data are Hadoop, Cassandra, and MongoDB.

7.What are the major goals of big data security?

 Control access by process, not job function.


 Secure the data at rest.
 Protect the cryptographic keys and store them separately from the data.
 Create trusted applications and stacks to protect data from rogue users.

8.List out the basic rules to enable security?

 Ensure that security does not impede performance or availability.


 Pick the right encryption scheme.
 Ensure that the security solution can evolve with your changing requirements.

9.What is an IP?

One of the biggest issues around Big Data is the concept of intellectual property
(IP).Intellectual property refers to creations of the human mind, such as inventions, literary and
artistic works, and symbols, names, images, and designs used in commerce. Although this is a rather
broad description, it conveys the essence of IP.

MVIT/IT/IV/VII
10
IT-E79BIG DATABASES Part A

10.What do you mean by cyber defense?

 Cyber attacks involve advanced and sophisticated techniques to infiltrate corporate


networks and enterprise systems.
 Types of attacks include advanced malware, zero day attacks and advanced persistent
threats.
 Advance warning about attackers and intelligence about the threat landscape is
considered by many security leaders to be essential features in security technologies.
11.Purpose of big data analytics in cyber defense?
The purpose of the Big Data Analytics in Cyber Defense study sponsored by Teradata
and conducted by Ponemon Institute is to learn about organizations’ cyber security
defenses and the use of big data analytics to become more efficient in recognizing the
patterns that represent network threats.
12.What do you mean by SIEM?
SIEM helps us create an environment that allows us to use a broad range of tools, some of
which we select for a specific customer environment, and yet accrue data in a common
environment and use that common environment for correlation and analysis.
13.What do you mean by end points?
Increasing enterprise system complexity also creates a driver for SIEM. Today's
organizations are adding greater numbers of connections, also known as endpoints, to their
systems to bring-your-owndevice trend, expanding supply chains, or a desire to link their IT
systems with their industrial
UNIT 5
1.What is mapreduce?

MapReduce is a framework using which we can write applications to process huge amounts of data,
in parallel, on large clusters of commodity hardware in a reliable manner.

2.Terminology used in Mapreduce:

•PayLoad
•NameNode
•DataNode
•MasterNode

MVIT/IT/IV/VII
11
IT-E79BIG DATABASES Part A

•SlaveNode
•JobTracker
•Task Tracker
•Job
•Task
•task attempt.
3.What is RDBMS?

RDBMS stands for Relational Database Management System. RDBMS is the basis for SQL, and for
all modern database systems like MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft
Access.

4.Challenges of RDBMS

•RDBMS assumes a well-defined structure of data and assumes that the data is largely uniform.

•It needs the schema of your application and its properties (columns, types, etc.) to be defined up-
front before building the application. This does not match well with the agile development
approaches for highly dynamic applications.

•As the data starts to grow larger, you have to scale your database vertically, i.e. adding more
capacity to the existing servers.

5.What are the Benefits of NoSQL over RDBMS?

 Schema Less
 Dynamic and Agile
 Scales Horizontally
 Better Performance

6.What are theTypes of NoSQL Databases ?

 Document oriented database


 Graph based database
 Coloumn based
 Key value database

7.What are the popular NoSQL Databases?

• Document Oriented Databases – MongoDB, HBase, Cassandra, Amazon SimpleDB,


Hypertable, etc.
• Graph Based Databases – Neo4j, OrientDB, Facebook Open Graph, FlockDB, etc.
MVIT/IT/IV/VII
12
IT-E79BIG DATABASES Part A

• Column Based Databases – CouchDB, OrientDB, etc.


• Key Value Databases – Membase, Redis, MemcacheDB, etc.

8.What is NOsql Database?

“A NoSQL (originally referring to “non SQL” or “non relational”) database provides a mechanism
for storage and retrieval of data that is modeled in means other than the tabular relations used in
relation databases (RDBMS).

MVIT/IT/IV/VII
13
IT-E79BIG DATABASES Part B

Academic Year 2025 – 2026


Question Bank
UNIT I
1. Discuss' on the four dimensions which are related to the primary aspects of Big Data. (NOV 19)(or)
Elucidate the multi dimenational terms related to Bigdata(Jan 23)

2. Why Big Data is important? Explain. (NOV 19)

3. What is Big Data? Summarize the evolution of big data. (MAR 21)(or) Describe the evolution of
Bigdata(Jan 23)

4. Explain characteristics of big data and discuss the importance of big data analytics in various
business domains. (MAR 21)

5. Discuss' in detail about the evolution of Big data.(SEP 21)

6. Interpret the challenges and issues involved in Big data. (SEP 21)(or) List and explain the
characteristics, challenges and issues in bigdata(Dec 2023)
7. Discuss the use of Big data analytics and its' importance with suitable real world example. (MAR 22)

8. Narrate about basics and future Big data in detail.(MAR 22)

9. Illustrate the need for bigdata and briefly discuss the applications of Bigdata(Dec 2023)(or) what is
bigdata analytics? Explain four V’s of bigdata. Briefly discuss applications of Bigdata. (Dec 2024).

10. What are the benefits of bigdata? Discuss challenges under bigdata. How big data analytics can be
useful in the development of smart cities(Dec 2024).

UNIT –2
1. Describe a Data Frame in R with its basic function.(NOV 19)
2. Review the hybrid data modeling approach in detail. (MAR 21)
3. Describe about how to handle Big data analytics with R programming. (SEP 21)
4. Explain in detail 'about data computing modelling(SEP 21)(or) Explain R modeling
architecture(Jan 23)
5. Explain the concept of analyzing and exploring data with R language. (MAR 22)(or) List
the steps to explore data in R(Jan 23)

MVIT/IT/IV/VII 1
IT-E79BIG DATABASES Part B

6. Illustrate about Hybrid data modelling with an example(MAR 22)


7.8. Outline the advantages, disadvantages and applications of R programming Languages(Dec
2023)
9. Discuss data analysis and data visualization methods in R programming with suitable example
programs(Dec 2023)
10. How analytical tools have evolved from graphical user interface to point solutions to data
visualization tools (Dec 2024).
11. Explain in detail the modeling Architecture and its types with appropriate diagram (Dec 2024).
UNIT –3
1. Explain about the basic parameters of mapper and reducer function. (NOV 19)

2. Compare RDBMS with Hadoop MapReduce(NOV 19)

3. Describe core architecture of Hadoop with suitable block diagram. Discuss the role of
each component in detail. (MAR 21) (or) Illustrate the architecture of Hadoop with
suitable block diagram. Discuss the role of each component in detail (Dec 2023)
4. Give the features of column oriented database management system, explain the
storage management of HBASE with example (MAR 21)
5. Describe the major components of Cassandra Data Model.(or) Write a detailed note
on Apache Cassandra (Jan 23)
6. Specify the similarities and differences between Hadoop, HBASE, and Cassandra

7. Outline the features of Hadoop and explain the functionalities of Hadoop(SEP 21).

8. How Hadoop streaming is suited with text processing explain? (Dec 2024).

9. Describe HBASE and write down the advantages of storing data with HBASE. (SEP 21)

10. Describe about Map-reduce framework in detail. (MAR 22)

11. Highlight the features of Apache Mahout in detail,(MAR 22)(or) Describe the feature of Apache
Mahout(Jan 23)

12. Discuss the NoSQL data stores and their characteristics features(Dec 2023) (or) Explain in detail
about an open source NoSQL Distributed database. (Dec 2024).

MVIT/IT/IV/VII 2
IT-E79BIG DATABASES Part B

UNIT –4
1. Elucidate compliance issues and its major effect on Big Data. (NOV 19)

2. Discuss about the following:

(a) Pragmatic Steps to Secure Big Data.

(b) Protecting Big Data Analytics. (NOV 19)

(or) Describe bigdata compliance and list the basic rules that enable security in bigdata(Dec 2023)

3. Brief the role of data classification to determine the sensitivity of data . Create
appropriate security rules and compliance objectives for health care industry. (MAR
21) (or) Explain in detail about the bigdata security and its compliance (Dec 2024).

4. State the relation of big data analytics to cyber security. Give detailed des cription of
how business can utilize big data analytics to address cyber security threats. (MAR 21)
5. Illustrate about classifying data and protecting Big data analytics, (SEP 21).

6. Discuss In detail about Big. data In cyber defence.(SEP 21).(or) How Big data helps in cyber
defence(Jan 23)
7. Explain about the pragmatic steps to securing Big data in detail. (MAR 22)(or) Discuss
the pragmatic steps to secure Bigdata(Jan 23)(or) Why bigdata security is essential? Explain in detail
about bigdata security(Dec 2023)
8. Paraphrase about Intellectual Challenges in Big data. (MAR 22)(or) Write in detail about
the intellectual property challenge and the use of Bigdata in cyber defence. (Dec 2024).
UNIT –5
1. (a)Analyze the SimpleDB Data Model and Architecture.
(b) "Big data is dependent upon a salable and extensible information foundation" (NOV 19)

2. Examine the new analysis practices for big data. (NOV 19)(or) Discus the new analysis practices for
bigdata(Jan 23)(or) Outline the new analaysis practices for Bigdata in detail with a case study. (Dec 2023)

3. Analyze the impact of using Map Reduce technique on 'Count of URL access frequency' in
large clusters. (MAR 21) (or) Desribe in detail the mapper class, reducer class and scaling
out with an example (Dec 2024).
4. Discuss the role of big data in education industry to improve the operational effectiveness

MVIT/IT/IV/VII 3
IT-E79BIG DATABASES Part B

and working of educational institutes. (MAR 21)

5. Describe. about RDBMS to NoSQL reviewing next generation non-relational databases. (SEP 21).(or)
Discuss briefly about the evolution from RDBMS to NoSQL(Dec 2024).

6. Define Big data Analytics, Explain in detail about new analysis practices for Big data.(SEP 21).

7. Paraphrase about simplified data processing on large clusters. (MAR 22)(or) Explicate simplifies
method of data processing on large clusters(Jan 23)(or) Demonstrate the implementation of data processing
on large cluster using map reduce with example (Dec 2023)
8. Explain in detail about the real-world use of Big data with an example. (MAR 22)

MVIT/IT/IV/VII 4
ipfS;.

5477194
B.Tech. DEGREE EXAMINATION, JANUARY 2023.

Seventh Semester

Information Technology

Elective: BIG DATABASES

(2013 - 14 Regulations)•

Time : Three hoursMaximum : 75 marks

PART A — (10 x 2 = 20 marks)

Answer ALL questions.

1.How Big Data differs from the conventional Data?

2.List the importance of Big Data in Data Analytics.

3.What is "R"?.

4.What do you mean by Hybrid data modelling?

5.Last out the advantage of NoSQL.

6.For which type of applications Apache HBASE is


suitable? - r ....

7.What is big data compliance?

8.What is Intellectual property protection and brief


. -• ' • its signiRcance in big data? . _
9.List any two new generation non-relational .:unit iv
databases.
17.Discuss, the pragmatic steps to secure Big Data. •
10.Brief any one of the real-world uses of big data.
-\Or• ; • ;
PART B — (5 x 11 = 55 marks)
18.How Big Data helps in Cyber defence. '
Answer ALL questions.
UNITV
UNIT I
19.Exphcate simplified method of data processing on
11.Elucidate the multidimensional terms related to large clusters.<
Big Data.
Or,
. ~ '' t • ' or '••."•• :. .~.
20.Discuss the new analysis practices for big data.
12.Describe the evolution of Big Data.

,UNIT II;

13.List the steps to explore data in "R".

" • •' ''• . Or .'•..• •'• '" ' .

14.Explain "R'modelhng architecture.

UNIT III '' -

15.Describe the features of Apache Mahout.

^Or

16.Write a detailed note on Apache Cassandra.

25477194 5477194
5477194
B.Tech. DEGREE EXAMINATION, MARCH 2022.

Seventh Semester

Information Technology

Elective: BIG DATABASE

Time: Three hours Maximum: 75 marks

SECTION A - (10 x 2 = 20 marks)

Answer ALL the questions.

All questions carry equal marks.

1. Define Big data platform.

2. List the challenges ·ofBig data.

3. What is the' use . of vector object III


R programming?

4. State the advantages of'Big data analytics.

5; Point out the features of Hadoop software,

6. Different Cassandra and HBASE.

7. How can we protect big data?


8. Mention the 'advantages of Big data analytics. UNIT III

Annotate data processing. ] 5. Describe about MapReduce framework in detail.(ll)

10. Summarize the real-world use of big data. Or

SECTION B - (5 x 11 = 55 marks) 16. Highlight the features of Apache Mahout in detafl,


. (11)
Answer ALL questions, ONE question from each Unit.
UNIT IV
All questions carry equal marks.
17. Explain about the pragmatic steps to securing Big
data in detail. (11)
UNIT I
Or
11. Discuss the use of Big data analytics and its'
importance with suitable real world example. (11) 18. Paraphrase about Intellectual Property
Challenges in Big data. (11)
Or
UNIT V
12. Narrate about basics and future Big data in detail.
(11) 19. Describe. about RDBMS to NoSQL reviewing next
generation non-relational databases. (11)
UNIT II
Or
13. Explain the concept of analysing and exploring
data with R language. (11)
20. Define Big data Analytics, Explain in detail about
new analysis practices for Big data. (11)
Or

14. Illustrate about Hybrid data modelling with an


example. (11)
2 5477194 3 .
,

5477194
B.Tech. DEGREE EXAMINATION, SEPTEMBER 2021.

Seventh Semester

Information Technology

BIG DATABASES

Time: Three hours Maximum: 75 marks

SECTION A - (10 x 2 = 20 marks)


Answer ALL questions.

All questions carry equal marks.

1. Define Big data.

2. Point out the importance of Big data analytics.

3. Write the basic syntax of an R function definition.

4. List out the significance of hybrid data modelling.

5. Summarize Apache Mahout.

6. Differentiate between SQL and NoSQL.

7. Annotate Big data security.

8. How Big data is important for Compliance?

..
/
9. Comprehend cluster in Big data. UNIT IV

10. What are the benefits of MapReduce? 17. Illustrate about classifying data and protecting
Big data analytics, (11)
SECTION B - (5 x 11 = 55 marks)
Answer ALL questions, ONE question from each Unit. .o-
All questions carry equal marks. 18. Discuss In detail about Big. data In cyber
UNIT I defence. (11)

11. Discuss' in detail about the evolution of Big UNIT V


data. (11)
19. Paraphrase about simplified data processing on
Or large clusters. (11)
12. Interpret the challenges and issues involved in Big
Or
data. (11)
UNIT II 20. Explain in detail about the real-world use of Big
data with an example. (11)
13. Describe about how to handle Big data analytics
with R programming. (11)
Or
14. Explain in detail 'about data computing
modelling. (11)
UNIT III
15. Outline the features of Hadoop and explain the
functionalities of Hadoop. (11)
Or
16. Describe HBASE and write down the advantages
of storing data with HBASE. (11)

2 5477194 3 - 5477194
5477194
B.Tech. DEGREE EXAMINATION, MARCH 2021.

Seventh Semester

Information Technology

BIG DATABASES

Time: Three hours . Maximum: 75 marks

Answer ALL questions.

PART A - (10 x 2 = 20 marks)


1. List the four dimensions related to primary
aspects of Big Data.

2. Specify the top challenges facing big data.

3. Mention the features of R Programming.

4. Write the differences between. hybrid data


modeling and data computing modeling.

5. Specify the components of Hadoop.

6. HBase does note support a structured query


language like SQL. Why?

7. Identify the best practices that can be adhered to


. protect big data.
8. State the role of big data m anomaly-based 17. Brief the role of data classification to determine
intrusion detection. the sensitivity of data. Create appropriate security
rules and compliance objectives for health care
9. List the differences between NoSQL and relational
industry. (11)
databases.
10. Narrate the important tasks of Map Reduce Or
technique.
18. State the relation of big data analytics to cyber
PART B - (5 x 11 = 55 marks) security. Give detailed description of how business
11. What is Big Data? Summarize the evolution of big can utilize big data analytics to address cyber
data. (11) security threats. (11)

Or 19. Analyze the impact of using Map Reduce


technique on 'Count of URL access frequency' in
12. Explain characteristics of big data and discuss the
large clusters. (11)
importance of big data analytics in various
business domains. (11)
Or
13. Elaborate the responsibility of R tool in analyzing
and exploring the big data. (11) .20. Discuss the role of big data in education industry
to improve the operational effectiveness and
Or working of educational institutes. . (11)
14. Review the hybrid data modeling approach in
detail. (11)

15. Describe core architecture of Hadoop with suitable


block diagram. Discuss the role of each component
in detail. (11)
Or
16. Give the features of column-oriented database
management system. Explore the storage
mechanism of HBase with an example. (11)

2 5477194 3 5477194
UNIT V 5477194
19. (a) Analyze the SimpleDB Data Model and B.Tech. DEGREE EXAMINATION, NOVEMBER 2019.
Architecture. (8)
(b) "Big data is dependent upon a scalable and Seventh Semester
extensible information foundation"
Information Technology
Ccmment. (3)
Elective - BIG PATABASES
Or
(2013-14 Onwards Regulations)
20. Examine the new analysis practices for big data.
(ll}
Time : Three hours Maximum: 75 marks

PART A - (10 x 2 = 20 marks) .


Answer ALL the questions.
All questions carry equal marks.

1. Define Big Data.

2. "Data and Data Analytics are becoming more


complex" - Comment.

3. What is exploratory data analysis in R?

4. How do you assign a variable in R?

5. Can reducers communicate with each other? -


Justify.

6. Mention the operational commands in HBASE.

4 5477194
7. Write the basic rules used for enabling the Big UNIT III
Data security.
15. (a) Explain about the basic parameters of
8. How the data can be protected? mapper and reducer function. (7)

9. What is master data structures? (b) Compare RDBMS with Hadoop MapReduce.
(4)
10. St" te the characteristics of NoSQL Databases.

PART B - (5 x 11 = 55 marks) Or
Answer ALL questions, choosing ONE from each Unit.
16 . (a) Describe the major components of Cassandra
. UNIT I
Data Model. (6)
11. (a) Discuss' on the four dimensions which are
related to the primary aspects of Big Data. (b) Specify the similarities and differences
(6) between Hadoop, HBASE, and Cassandra. (5)
(b) Why Big Data is important? Explain. (5)
Or UNIT IV

12. What are the issues and challenges associated 17. Elucidate compliance issues and its major effect
with Big Data? Explain. (11)
on Big Data. (11)
UNIT II
Or
13. Describe a Data Frame in R with its basic
function. (11) 18. Discuss about the following:
Or
(a) Pragmatic Steps to Secure Big Data. (6)
14. Narrate the different data types/objects in R with
(b) Protecting Big Data Analytics. (5)
suitable illustrations. (11)
2 5477194 3 5477194
a
Ji
HP
E;
crj
+J
0c
roE*;
L\.n3 b!
i.=
Z,- E ^'a d S
t d
E
tr
0)
-: =
Q
tr.:{
E; ir
9 7
.:i
l = E
E a
E
:
E
boi=
YE a
ai f; c >
x u:2 -* E
I *
;I=
Z-i: ! F {j
+ o. ?
!
== i i E
{J
<{l
i
; * a
0)
fJ
a
F-t
F-l
I kr!? 3 a q * 5 I 3 .ts o 2
{J
tr-l r--
#=
L.l ui m :
'-:
ii i
2 E ca 3 ,
" '.--;
-
(I
<{l
:i;: p <r 2
rol _r I| = F t h A 5f; o hD
-= ? c-., .a ! .c)
= o
a A
:J U
Cg
o
-:EI A Z or La ; cr tc
-u
= = Fq
i:<
.-,2 '! C- 7\ 6 'ri o c, a
o
m" Ir
-:)=tA=.' :=; ?
_ a l,
.= E : -C +l
(d
+)
'- --..L
i & 5 4 'l z
c)
a)
.^
Z-^iiirr w tr-
+J a <41
6)l
X
a
c)
+i ac
14 '-{
F-l
r-l
|
{J
z .a !$l
1l:l
O
tr o
,tr
-6 o
C)
./
!
C)
O
a. p. li
cg o
'o)
b! +J
iJ CC
o
-a (,) tf
d
o^ a c
-o
A) .r 6 -o(6
tJ
a) cd
.5
ax E c)
Frl -^
6
0) -O Fl
(,
F!a '2a
!.,1/
a 6
-0) +, c)>
.iA
a; X
a
Fli p* t'-l -l j=
,d .A
+ 6'l
b! .D Cd t: .6
(,)su dl
Cid
:(+i
c .qa
p" c)
r-l
.- ,r' a ii blE ti:
r.-l
<tl
.n4
l!
-!!
fJ u]
ii) a- irr ^
.d
€CJ Aa sc)
5oa
oaa-r q:r .:
dV
d.Q
di F] -dra cgu bO 0)-
(i
'v;! ^1 >o bo :u
.i-*Y-- .i .jr A
'J) --
Hd X.: ((! !cr
+)
.li ; i .. t-
!
u
O+) 0r '-
7.d :Yt{Aar (, .* o.E C6.:
6ri\Jh .io.1t4 a-=
Ad
!
a
aa !Xid
ddti
d!;:
P-
i!a
d \--,
UH.i A a.o
c N
a&Jv
'5a Qa o pO
h
^a
b[H ,/brt;ai | a^\ CA
OA ^,
PA
^ !! .D
U
ao ct r&
-!H > r_r.t at a - i:,Y
o-
u. a , t u .cg
a, I
-^) O- bpd aac)
() a/)
,icj-
^- >4?r'
?^-1= !r O o? .a
-lf
'44
,ilLd On c.
c0 /-9 c.= ss A
'It
o
>i
!vv
€ c) >'ii
aY +JA
!v-
- ! )--H
'i^ .9i
\4 ;> _::
Llit
Xa
i-rr = >
I
j 6l cd +
0O6lFr r--l Fl F,l

You might also like