BIgdatabase Overall Notes
BIgdatabase Overall Notes
SYLLABUS:
Introduction to Big Data: Big Data – The Evolution of Big data - Basics - Big Data
I Analytics and its Importance – challenges-Issues- Future of Big Data 12
II Basic Big Data Analytic Methods and Modeling: Introduction to “R”, analyzing 12
and exploring data with “R”-Modeling:Architecture - Hybrid Data Modeling – Data
Computing Modeling.
Big Data Security: Big Data Security, Compliance, Auditing and Protection: Pragmatic
IV Steps to Securing Big Data, Classifying Data, Protecting Big Data Analytics, Big Data 12
and Compliance, The Intellectual Property Challenge –Big Data in Cyber defense.
TOTAL HOURS 60
Text Books:
1. Frank.J.Ohlhorst, “Big Data Analytics : Turning Big Data into Big Money”, Wiley &Sas
Business Series, 2013
Reference Books:
1. Paul C. Zikopoulos, Chris Eaton, Dirk deRoos, Thomas Deutsch, George Lapis,
“Understanding Big Data Analytics for Enterprise Class Hadoop and Streaming
Data”, The McGraw Hill, 2012.
2. “Planning for Big Data”, O’Reilly Radar Team, 2012.
3. “Big Data Now Current Perspectives”, O’Reilly Media, 2011.
IT-E79BIG DATABASES UNIT 1
UNIT 1
Introduction to Big Data: Big Data – The Evolution of Big data - Basics - Big Data Analytics and
its Importance – challenges- Issues- Future of Big Data
Course Objectives:
The students are able to understand the concepts of Big Data
To understand the importance of big data analytics
Course Outcomes:
The students learn the evolution of big data
The students can be able to analyze various applications through Big data
Big data means really a big data; it is a collection of large datasets that cannot be processed using
traditional computing techniques. Big data is not merely a data; rather it has become a complete
subject, which involves various tools, techniques and frameworks.
There is a need for storing the data into a wide variety of formats. With the evolution and
advancement of technology, the amount of data that is being generated is ever increasing. Sources of
Big Data can be broadly classified into six different categories as shown below.
Enterprise Data
There are large volumes of data in enterprises in different formats. Common formats include flat
files, emails, Word documents, spreadsheets, presentations, HTML pages/documents, PDF
documents, XMLs, legacy formats, etc. This data that is spread across the organization in different
formats is referred to as Enterprise Data.
Transactional Data
Every enterprise has some kind of applications which involve performing different kinds of
transactions like Web Applications, Mobile Applications, CRM Systems, and many more. To
support the transactions in these applications, there are usually one or more relational databases as a
backend infrastructure. This is mostly structured data and is referred to as Transactional Data.
MVIT/IT/IV/VII 1
IT-E79BIG DATABASES UNIT 1
Social Media
This is self-explanatory. There is a large amount of data getting generated on social networks like
Twitter, Facebook, etc. The social networks usually involve mostly unstructured data formats which
include text, images, audio, videos, etc. This category of data source is referred to as Social Media.
Activity Generated
There is a large amount of data being generated by machines which surpasses the data volume
generated by humans. These include data from medical devices, censor data, surveillance videos,
satellites, cell phone towers, industrial machinery, and other data generated mostly by machines.
These types of data are referred to as Activity Generated data.
Public Data
This data includes data that is publicly available like data published by governments, research data
published by research institutes, data from weather and meteorological departments, census data,
Wikipedia, sample open source data feeds, and other data which is freely available to the public.
This type of publicly accessible data is referred to as Public Data.
Archives
Organizations archive a lot of data which is either not required anymore or is very rarely required. In
today's world, with hardware getting cheaper, no organization wants to discard any data, they want
to capture and store as much data as possible. Other data that is archived includes scanned
documents, scanned copies of agreements, records of ex-employees/completed projects, banking
transactions older than the compliance regulations. This type of data, which is less frequently
accessed, is referred to as Archive Data.
MVIT/IT/IV/VII 2
IT-E79BIG DATABASES UNIT 1
1.3 Dimensions of Big Data or Characteristics of Big Data
Big Data in multidimensional terms, in which four dimensions relate to the primary aspects
of Big Data. These dimensions can be defined as follows:
1. Volume. Big Data comes in one size: large. Enterprises are awash with data, easily amassing
terabytes and even petabytes of information.
2. Variety. Big Data extends beyond structured data to include unstructured data of all varieties:
text, audio, video, click streams, log files, and more.
3. Veracity. The massive amounts of data collected for Big Data purposes can lead to statistical
errors and misinterpretation of the collected information. Purity of the information is critical for
value.
4. Velocity. Often time sensitive, Big Data must be used as it is streaming into the enterprise in order
to maximize its value to the business, but it must also still be available from the archival sources as
well.
These 4Vs of Big Data lay out the path to analytics; with each have intrinsic value in the process of
discovering value.
MVIT/IT/IV/VII 3
IT-E79BIG DATABASES UNIT 1
1.4 Evolution of Big Data
Data has always been around and there has always been a need for storage, processing, and
management of data, since the beginning of human civilization and human societies. However, the
amount and type of data captured, stored, processed, and managed depended then and even now on
various factors including the necessity felt by humans, available tools/technologies for storage,
processing, management, effort/cost, and ability to gain insights into the data, make decisions, and
so on.
Going back a few centuries, in the ancient days, humans used very primitive ways of
capturing/storing data like carving on stones, metal sheets, wood, etc. Then with new inventions and
advancements a few centuries in time, humans started capturing the data on paper, cloth, etc. As time
progressed, the medium of capturing/storage/management became punching cards followed by
magnetic drums, laser disks, floppy disks, magnetic tapes, and finally today we are storing data on
various devices like USB Drives, Compact Discs, Hard Drives, etc.
As we can clearly see from this trend, the capacity of data storage has been increasing
exponentially, and today with the availability of the cloud infrastructure, potentially one can store
unlimited amounts of data. Today Terabytes and Petabytes of data is being generated, captured,
processed, stored, and managed.
The future of Big Data depends on Smart Data. Smart Data supports rapid integration of either
unstructured or semi-structured data. The self-describing properties of Smart Data are practically
necessities for the massive quantities, differentiated data types, and high volumes of Big Data
because they facilitate:
The definition of big data holds the key to understanding big data analysis. According to the Gartner
IT Glossary, Big Data is high-volume, high-velocity, and high-variety information assets that
demand cost effective, innovative forms of information processing for enhanced insight and decision
making.
Like conventional analytics and business intelligence solutions, big data mining and analytics helps
uncover hidden patterns, unknown correlations, and other useful business information. However, big
data tools can analyze high-volume, high-velocity, and high-variety information assets far better than
MVIT/IT/IV/VII 4
IT-E79BIG DATABASES UNIT 1
conventional tools and relational databases that struggle to capture, manage, and process big data
within a tolerable elapsed time and at an acceptable total cost of ownership.
Organizations are using new big data technologies and solutions such as Hadoop,
MapReduce, Hadoop Hive, Spark, Presto, Yarn, Pig, NoSQL databases, and more to support their
big data requirements.
Big data analytics helps organizations harness their data and use it to identify new opportunities.
That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier
customers.
1. Cost reduction.
Big data technologies such as Hadoop and cloud-based analytics bring significant cost
advantages when it comes to storing large amounts of data – plus they can identify more efficient
ways of doing business.
The speed of Hadoop and in-memory analytics, combined with the ability to analyze new
sources of data, businesses are able to analyze information immediately – and make decisions
based on what they‘ve learned.
3. Competitive Advantage
One of the major advantages of big data analytics is that it gives businesses access to data
that was previously unavailable or difficult to access. With increased access to data sources such
as social media streams and clickstream data, businesses can better target their marketing efforts
to customers, better predict demand for a certain product, and adapt marketing and advertising
messaging in real-time. With these advantages, businesses are able to gain an edge on their
competitors and act more quickly and decisively when compared to what rival organizations do
The final benefit of big data analytics tools is the possibility of exploring new business
opportunities. Entrepreneurs have taken advantage of big data technology to offer new services
in AdTech and MarketingTech. Mature companies can also take advantage of the data they
collect to offer add-on services or to create new product segments that offer additional value to
their current customers.
In addition to those benefits, big data analytics can pinpoint new or potential audiences that
have yet to be tapped by the enterprise. Finding whole new customer segments can lead to
tremendous new value.
These are just a few of the actionable insights made possible by available big data analytics
tools. Whether an organization is looking to boost sales and marketing results, uncover new
revenue opportunities, improve customer service, optimize operational efficiency, reduce risk,
improve security, or drive other business results, big data insights can help.
MVIT/IT/IV/VII 5
IT-E79BIG DATABASES UNIT 1
5. New products and services.
The ability to gauge customer needs and satisfaction through analytics comes the power to
give customers what they want. Davenport points out that with big data analytics, more
companies are creating new products to meet customers‘ needs.
Capturing data
o Automatic identification and data capture (AIDC) refers to the methods of
automatically identifying objects, collecting data about them, and entering
that datadirectly into computer systems
Curation
o Curation is a field of endeavor involved with assembling, managing and presenting
some type of collection.
Storage
o Storage for synchronous analytics. Real-time analytics applications are typically run
on databases like NoSQL, which are massively scalable and can be supported with
commodity hardware.
Searching
o Search Technologies, we specialize in addressing unstructured content sources, and
helping customers to prepare, analyze and merge insight from human generated
content, with structured, machine-generated data.
Sharing
o Share original data in a controlled way so that different groups within your
organization only see part of the whole.
Transfer
o To capitalize on the tremendous business value inherent in big data and Hadoop, there
remains the challenge of big data transfer
Analysis
o It‘s a process of examining large data sets containing a variety ofdata types --
i.e., big data
Presentation
o It takes care that the data is sent in such a way that the receiver will understand the
information (data) and will be able to use the data.
MVIT/IT/IV/VII 6
IT-E79BIG DATABASES UNIT 1
To fulfill the above challenges, organizations normally take the help of enterprise servers.
Making tools easier to use. Hadoop stack and NoSQLs really do require programming
knowledge to unlock their power.
Getting quicker answers across large data sets. We can get them in "acceptable"
amounts of time. Its about getting that 3 hour query down to 5 minutes or less. Apache
Impala (incubating) is a good example of work in this space.
Integration with existing tools. There's a few companies out there already working on
this but I'm not seeing any real standards being developed for tight integration. I think
that eventually how you query data should be seamless to the person querying it,
whether it is in a big data solution, RDBMS, JSON/XML, etc.
Better security models. There is almost no security in place for virtually all big data
tools. Once you get access, you get access to everything. Improvements are being
made, but it still isn't enterprise grade.
Defining best practices for developing and using big data tools.
Creating industry solutions that utilize big data. We already see this happening in a
couple of places, like utilities and healthcare, but it isn't yet wide spread.
Defining that use case that isn't analyzing web based information that all enterprises can
leverage. Right now, most big data use cases are centered around solving problems with
something involving the web.
More mature software. Right now, Hadoop logs are riddled with errors, warnings, and
numerous other things that are almost impossible to decipher. Yet, the damn thing still
seems to magically work. There needs to be improvements in better predicting and
avoiding problems when running map-reduce jobs and giving human readable
information to solve problems.
Defining what big data actually is Adoption. I'm still not seeing truly widespread
adoption of big data tools in the enterprise. There are a lot of department solutions,
technical solutions or proof of concepts.
MVIT/IT/IV/VII 7
IT-E79BIG DATABASES UNIT 1
1.6.1 Future of Big Data.
1. Data volumes will continue to grow. There‘s absolutely no question that we will continue
generating larger and larger volumes of data, especially considering that the number of handheld
devices and Internet-connected devices is expected to grow exponentially.
2. Ways to analyze data will improve. While SQL is still the standard, Spark is emerging as a
complementary tool for analysis and will continue to grow, according to Ovum.
3. More tools for analysis (without the analyst) will emerge. Microsoft MSFT -
1.41% andSalesforce both recently announced features to let non-coders create apps to view
business data.
4. Prescriptive analytics will be built in to business analytics software. IDC predicts that half of
all business analytics software will include the intelligence where it‘s needed by 2020.
5. In addition, real-time streaming insights into data will be the hallmarks of data
winners going forward, according to Forrester. Users will want to be able to use data to make
decisions in real time with programs like Kafka and Spark.
6. Machine learning is a top strategic trend for 2016, according to Gartner. And Ovumpredicts
that machine learning will be a necessary element for data preparation and predictive analysis in
businesses moving forward.
7. Big data will face huge challenges around privacy, especially with the new privacy regulation
by the European Union. Companies will be forced to address the ‗elephant in the room‘ around
their privacy controls and procedures. Gartner predicts that by 2018, 50% of business ethics
violations will be related to data.
8. More companies will appoint a chief data officer. Forrester believes the CDO will see a rise in
prominence — in the short term. But certain types of businesses and even generational differences
will see less need for them in the future.
9. “Autonomous agents and things” will continue to be a huge trend, according toGartner,
including robots, autonomous vehicles, virtual personal assistants, and smart advisers.
10. Big data staffing shortages will expand from analysts and scientists to include architects and
experts in data management according to IDC.
11. But the big data talent crunch may ease as companies employ new tactics. The International
Institute for Analytics predicts that companies will use recruiting and internal training to get their
personnel problems solved.
MVIT/IT/IV/VII 8
IT-E79BIG DATABASES UNIT 1
12. The data-as-a-service business model is on the horizon. Forrester suggests that afterIBM IBM
-0.40%‘s acquisition of The Weather Channel, more businesses will attempt to monetize their
data.
13. Algorithm markets will also emerge. Forrester surmises that businesses will quickly learn that
they can purchase algorithms rather than program them and add their own data. Existing services
like Algorithmia, Data Xu, and Kaggle can be expected to grow and multiply.
14. Cognitive technology will be the new buzzword. For many businesses, the link between
cognitive computing and analytics will become synonymous in much the same way that
businesses now see similarities between analytics and big data.
15. ―All companies are data businesses now,‖ according to Forrester. More companies will attempt
to drive value and revenue from their data.
16. Businesses using data will see $430 billion in productivity benefits over their competition not
using data by 2020, according to International Institute for Analytics.
17. “Fast data” and “actionable data” will replace big data, according to some experts. The
argument is that big isn‘t necessarily better when it comes to data, and that businesses don‘t use a
fraction of the data they have access too.
Big Data technologies can solve the business problems in a wide range of industries. Below are a
few use cases.
Retail
o Targeting customers with different discounts, coupons, and promotions etc. based on
demographic data like gender, age group, location, occupation, dietary habits, buying patterns,
and other information which can be useful to differentiate/categorize the customers.
Marketing
o Specifically outbound marketing can make use of customer demographic information like
gender, age group, location, occupation, and dietary habits, customer interests/preferences
usually expressed in the form of comments/feedback and on social media networks.
o Customer's communication preferences can be identified from various sources like polls,
reviews, comments/feedback, and social media etc. and can be used to target customers via
different channels like SMS, Email, Online Stores, Mobile Applications, and Retail Stores etc.
Sentiment Analysis
MVIT/IT/IV/VII 9
IT-E79BIG DATABASES UNIT 1
o Organizations use the data from social media sites like Facebook, Twitter etc. to understand
what customers are saying about the company, its products, and services. This type of analysis
is also performed to understand which companies, brands, services, or technologies people are
talking about.
Customer Service
o IT Services and BPO companies analyze the call records/logs to gain insights into customer
complaints and feedback, call center executive response/ability to resolve the ticket, and to
improve the overall quality of service.
o Call center data from telecommunications industries can be used to analyze the call
records/logs and optimize the price, and calling, messaging, and data plans etc.
Apart from these, Big Data technologies/solutions can solve the business problems in other
industries like Healthcare, Automobile, Aeronautical, Gaming, and Manufacturing etc.
MVIT/IT/IV/VII 10
IT-E79BIG DATABASES UNIT 1
Data Storage and Management
Data Cleaning
Data Mining
Data Analysis
Data Visualization
Data Integration
Data Languages
Hadoop
The name Hadoop has become synonymous with big data. It‘s an open-source software framework
for distributed storage of very large datasets on computer clusters. All that means you can scale your
data up and down without having to worry about hardware failures. Hadoop provides massive
amounts of storage for any kind of data, enormous processing power and the ability to handle
virtually limitless concurrent tasks or jobs.
Cloudera
Cloudera is essentially a brand name for Hadoop with some extra services stuck on. They can help
your business build an enterprise data hub, to allow people in your organization better access to the
data you are storing. While it does have an open source element, Cloudera is mostly and enterprise
solution to help businesses manage their Hadoop ecosystem. Essentially, they do a lot of the hard
work of administering Hadoop for you. They will also deliver a certain amount of data security,
which is highly important if you‘re storing any sensitive or personal data.
MongoDB
MongoDB is the modern, start-up approach to databases. Think of them as an alternative to
relational databases. It‘s good for managing data that changes frequently or data that is unstructured
or semi-structured. Common use cases include storing data for mobile apps, product catalogs, real-
time personalization, content management and applications delivering a single view across multiple
systems.
MVIT/IT/IV/VII 11
IT-E79BIG DATABASES UNIT 1
Talend
Talend is another great open source company that offers a number of data products. Here we‘re
focusing on their Master Data Management (MDM) offering, which combines real-time data,
applications, and process integration with embedded data quality and stewardship. Because it‘s open
source, Talend is completely free making it a good option no matter what stage of business you are
in. And it saves you having to build and maintain your own data management system – which is a
tremendously complex and difficult task.
OpenRefine
OpenRefine (formerly GoogleRefine) is an open source tool that is dedicated to cleaning messy data.
You can explore huge data sets easily and quickly even if the data is a little unstructured. As far as
data softwares go, OpenRefine is pretty user-friendly. Though, a good knowledge of data cleaning
MVIT/IT/IV/VII 12
IT-E79BIG DATABASES UNIT 1
principles certainly helps. The nice thing about OpenRefine is that it has a huge community with lots
of contributors meaning that the software is constantly getting better and better.
DataCleaner
DataCleaner recognises that data manipulation is a long and drawn out task. Data visualization tools
can only read nicely structured, ―clean‖ data sets. DataCleaner does the hard work for you and
transforms messy semi-structured data sets into clean readable data sets that all of the visualization
companies can read. DataCleaner also offers data warehousing and data management services. The
company offers a 30-day free trial and then after that a monthly subscription fee.
RapidMiner
The client list that includes PayPal, Deloitte, eBay and Cisco, RapidMiner is a fantastic tool for
predictive analysis. It‘s powerful, easy to use and has a great open source community behind it. You
can even integrate your own specialized algorithms into RapidMiner through their APIs.
SPSS Modeler is a heavy-duty solution that is well suited for the needs of big companies. It can run
on virtually any type of database and you can integrate it with other IBM SPSS products such as
SPSS collaboration and deployment services and the SPSS Analytic server.
MVIT/IT/IV/VII 13
IT-E79BIG DATABASES UNIT 1
Teradata
Teradata recognizes the fact that, although big data is awesome, if you don‘t actually know how to
analyze and use it, it‘s worthless. Imagine having millions upon millions of data points without the
skills to query them. That‘s where Teradata comes in. They provide end-to-end solutions and
services in data warehousing, big data and analytics and marketing applications. This all means that
you can truly become a data-driven business. Teradata also offers a whole host of services including
implementation, business consulting, training and support.
FramedData
If you‘re after a specific type of data mining there are a bunch of startups which specialize in helping
businesses answer tough questions with data. If you‘re worried about user churn, we
recommend FramedData, a startup which analyzes your analytics and tell you which customers are
about to abandon your product.
Kaggle
If you‘re stuck on a data mining problem or want to try solving the world‘s toughest problems, check
out Kaggle. Kaggle is the world‘s largest data science community. Companies and researchers post
their data and statisticians and data miners from all over the world compete to produce the best
models.
MVIT/IT/IV/VII 14
IT-E79BIG DATABASES UNIT 1
1.7.4 Data Analysis Tool
Qubole
Qubole simplifies, speeds and scales big data analytics workloads against data stored on AWS,
Google, or Azure clouds. They take the hassle out of infrastructure wrangling. Once the IT policies
are in place, any number of data analysts can be set free to collaboratively ―click to query‖ with the
power of Hive, Spark, Presto and many others in a growing list of data processing engines. Qubole is
an enterprise level solution. They offer a free trial that you can sign up to at this page.The flexibility
of the program really does set it apart from the rest as well as being the most accessible of the
platforms.
BigML
BigML is attempting to simplify machine learning. They offer a powerful Machine Learning service
with an easy-to-use interface for you to import your data and get predictions out of it. You can even
use their models for predictive analytics. A good understanding of modeling is certainly helpful, but
not essential, if you want to get the most from BigML. They have a free version of the tool that
allows you to create tasks that are under 16mb as well as having a pay as you go plan and a virtual
private cloud that meet enterprise-grade requirements.
Statwing
Statwing takes data analysis to a new level providing everything from beautiful visuals to complex
analysis. They have a particularly cool blog post on NFL data! It‘s so simple to use that you can
actually get started with Statwing in under 5 minutes. This allows you to use unlimited datasets of up
to 50mb in size each. There are other enterprise plans that give you the ability to upload bigger
datasets.
MVIT/IT/IV/VII 15
IT-E79BIG DATABASES UNIT 1
1.7.5 Data Visualization Tool
Tableau
Tableau is a data visualization tool with a primary focus on business intelligence. You can create
maps, bar charts, scatter plots and more without the need for programming. They recently released a
web connector that allows you to connect to a database or API thus giving you the ability to get live
data in a visualisation. Exploring that tool should give you an idea of which of the other Tableau
products you‘d rather pay for.
Silk
Silk is a much simpler data visualization and analytical tool than Tableau. It allows you to bring your
data to life by building interactive maps and charts with just a few clicks of the mouse. Silk also
allows you to collaborate on a visualisation with as many people as you want.
CartoDB
CartoDB is a data visualization tool that specialises in making maps. They make it easy for anyone
you to visualize location data – without the need for any coding. CartoDB can manage a myriad of
data files and types, they even have sample datasets that you can play around with while you‘re
getting the hang of it.
Chartio
Chartio allows you to combine data sources and execute queries in-browser. You can create
powerful dashboards in just a few clicks. Chartio‘s visual query language allows anyone to grab data
from anywhere without having to know SQL or other complicated model languages. They also let
you schedule PDF reports so you can export and email your dashboard as a PDF file to anyone you
want.
Plot.ly
If you are wanting to build a graph, Plot.ly is the place to go. This handy platform allows you to
create stunning 2d and 3d charts (you really need to see it to believe it!). Again, all without needing
programming knowledge. The free version allows you create one private chart and unlimited public
charts or you can upgrade to the enterprise packages to make unlimited private and public charts as
well as giving you the option for Vector exports and saving of custom themes.
Datawrapper
Datawrapper is an open source tool that creates embeddable charts in minutes. Because it‘s open
source, it will be constantly evolving as anyone can contribute to it. They have an awesome chart
gallery where you can check out the kind of stuff people are doing with Datawrapper.
MVIT/IT/IV/VII 16
IT-E79BIG DATABASES UNIT 1
1.7.6 Data Integration Tool
Blockspring
Blockspring is a unique program in the way that they harness all of the power of services such as
IFTTT and Zapier in familiar platforms such as Excel and Google Sheets. You can connect to a
whole host of 3rd party programs by simply writing a Google Sheet formula. You can post Tweets
from a spreadsheet, look to see who your followers are following as well as connecting to AWS,
Import.io and Tableau to name a few. Blockspring is free to use, but they also have an organization
package that allows you to create and share private functions, add custom tags for easy search and
discovery and set API tokens for your whole organization at once.
Pentaho
Pentaho offers big data integration with zero coding required. Using a simple drag and drop UI you
can integrate a number of tools with minimal coding. They also offer embedded analytics and
business analytics services too. Pentaho is an enterprise solution. You can request a free trial of
the data integration product, after which, a payment will be required.
R Language
R is a language for statistical computing and graphics. If the data mining and statistical software
listed above doesn‘t quite do what you want it to, learning R is the way forward. In fact, if you‘re
planning on being a data scientist, knowing R is a requirement.
Python
Another language that is gaining popularity in the data community is Python. Created in the 1980s
and named from Monty Python‘s Flying Circus, it has consistently ranked in the top ten most
popular programming languages in the world. Many journalists use Python to write custom scrapers
if data collection tools fail to get the data that they need.
RegEx
RegEx or Regular Expressions are a set of characters that can manipulate and change data. It‘s used
mainly for pattern matching with strings, or string matching. At Import.io, you can use RegEx while
extracting data to delete parts of a string or keep particular parts of a string. It is an incredibly useful
tool to use when doing data extraction as you can get exactly what you want when you extract data
meaning you don‘t need to rely on those data manipulation companies mentioned above!
MVIT/IT/IV/VII 17
IT-E79BIG DATABASES UNIT 1
XPath
XPath is a query language used for selecting certain nodes from an XML document. Whereas RegEx
manipulates and changes the data makeup, XPath will extract the raw data ready for RegEx. XPath is
most commonly used in data extraction. Import.io actually automatically creates XPathseverytime
you click on a piece of data – you just don‘t see them! It is also possible to insert your own XPath to
get data from drop down menus and data that is in tabs on a webpage. Put simply, an XPath is a path,
a set of directions to a certain part of the HTML of a webpage.
MVIT/IT/IV/VII 18
IT-E79BIG DATABASES UNIT 2
UNIT 2
Basic Big Data Analytic Methods and Modeling: Introduction to “R”, analyzing and exploring
data with “R”-Modeling: Architecture - Hybrid Data Modeling – Data Computing Modeling.
Course Objectives:
To understand the concepts of Big Data and R programming
To understand the importance of modeling architecture
Course Outcomes:
The students can use the R programming tool for Big Data
The students can able be to analyze and visualize the large data set using R
Big data analytics is the process of examining large data sets containing a variety of data
types -- i.e., big data -- to uncover hidden patterns, unknown correlations, market trends,
customer preferences and other useful business information.
The primary goal of big data analytics is to help companies make more informed business
decisions by enabling data scientists, predictive modelers and other analytics
professionals to analyze large volumes of transaction data, as well as other forms of data
that may be untapped by conventional business intelligence(BI) programs.
That could include Web server logs and Internet clickstream data, social media content
and social network activity reports, text from customer emails and survey responses,
mobile-phone call detail records and machine data captured by sensors connected to
the Internet of Things.
Big data can be analyzed with the software tools commonly used as part of advanced
analytics disciplines such as predictive analytics, data mining, text analytics and
statistical analysis.
Many organizations looking to collect, process and analyze big data have turned to a
newer class of technologies that includes Hadoop and related tools such
as YARN,MapReduce, Spark, Hive and Pig as well as NoSQL databases.
Those technologies form the core of an open source software framework that supports the
processing of large and diverse data sets across clustered systems.
1
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
2.2.1The R environment
R is an integrated suite of software facilities for data manipulation, calculation and graphical
display. Among other things it has
• Graphical facilities for data analysis and display either directly at the computer or on hardcopy
• A well developed, simple and effective programming language (called „S‟) which includes
conditionals, loops, user defined recursive functions and input and output facilities. (Indeed most
of the system supplied functions are themselves written in the S language.)
The term “environment” is intended to characterize it as a fully planned and coherent system,
rather than an incremental accretion of very specific and inflexible tools, as is frequently the case
with other data analysis software.
R is very much a vehicle for newly developing methods of interactive data analysis. It has
developed rapidly, and has been extended by a large collection of packages. However, most
programs written in R are essentially ephemeral, written for a single piece of data analysis.
2
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
The evolution of the S language is characterized by four books by John Chambers and coauthors.
For R, the basic reference is The New S Language: A Programming Environment for Data
Analysis and Graphics by Richard A. Becker, John M. Chambers and Allan R. Wilks. The new
features of the 1991 release of S are covered in Statistical Models in S edited by John M.
Chambers and Trevor J. Hastie. The formal methods and classes of the methods package are
based on those described in Programming with Data by John M. Chambers. See Appendix F
[References], page 99, for precise references.
There are now a number of books which describe how to use R for data analysis and statistics,
and documentation for S/S-Plus can typically be used with R, keeping the differences between
the S implementations in mind.
Introduction to the R environment did not mention statistics, yet many people use R as a
statistics system. We prefer to think of it of an environment within which many classical and
modern statistical techniques have been implemented. A few of these are built into the base R
environment, but many are supplied as packages. There are about 25 packages supplied with R
(called “standard” and “recommended” packages) and many more are available through the
CRAN family of Internet sites (via https://CRAN.R-project.org) and elsewhere.
There is an important difference in philosophy between S (and hence R) and the other main
statistical systems. In S a statistical analysis is normally done as a series of steps, with
intermediate results being stored in objects. Thus whereas SAS and SPSS will give copious
output from a regression or discriminant analysis, R will give minimal output and store the
results in a fit object for subsequent interrogation by further R functions.
The most convenient way to use R is at a graphics workstation running a windowing system.
This guide is aimed at users who have this facility. In particular we will occasionally refer to the
use of R on an X window system although the vast bulk of what is said applies generally to any
implementation of the R environment.
Most users will find it necessary to interact directly with the operating system on their computer
from time to time. In this guide, we mainly discuss interaction with the operating system on
UNIX machines. If you are running R under Windows or OS X you will need to make some
small adjustments
3
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
2.2.4 Features of R
As stated earlier, R is a programming language and software environment for statistical
analysis, graphics representation and reporting. The following are the important features of R −
2.2.5 Example
print("Hello World")
print(23.9 + 11.6)
2.3Environment Setup
2.3.1 Local Environment Setup
If you are still willing to set up your environment for R, you can follow the steps given below.
You can download the Windows installer version of R from R-3.2.2 for Windows (32/64
bit) and save it in a local directory.
4
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
After installation you can locate the icon to run the Program in a directory structure
"R\R3.2.2\bin\i386\Rgui.exe" under the Windows Program Files. Clicking this icon
brings up the R-GUI which is the R console to do R Programming.
$R
This will launch R interpreter and you will get a prompt > where you can start typing your
program as follows −
Here first statement defines a string variable myString, where we assign a string "Hello,
World!" and then next statement print() is being used to print the value stored in variable
myString.
5
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
print ( myString)
Save the above code in a file test.R and execute it at Linux command prompt as given below.
Even if you are using Windows or other system, syntax will remain same.
$ Rscripttest.R
2.4.1.2 Comments
Comments are like helping text in your R program and they are ignored by the interpreter while
executing your actual program. Single comment is written using # in the beginning of the
statement as follows −
R does not support multi-line comments but you can perform a trick which is something as
follows −
if(FALSE) {
"This is a demo for multi-line comments and it should be put inside either a single
of double quote"
print ( myString)
6
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
Generally, while doing programming in any programming language, you need to use various
variables to store various information. Variables are nothing but reserved memory locations to
store values. This means that, when you create a variable you reserve some space in memory.
In contrast to other programming languages like C and java in R, the variables are not declared
as some data type. The variables are assigned with R-Objects and the data type of the R-object
becomes the data type of the variable. There are many types of R-objects.
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
The simplest of these objects is the vector object and there are six data types of these atomic
vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic
vectors.
v <- TRUE
print(class(v))
Logical TRUE, FALSE it produces the following result −
[1] "logical"
v <- 23.5
print(class(v))
Numeric 12.3, 5, 999 it produces the following result −
[1] "numeric"
v <- 2L
print(class(v))
Integer 2L, 34L, 0L it produces the following result −
[1] "integer"
7
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
v <- 2+5i
print(class(v))
Complex 3 + 2i it produces the following result −
[1] "complex"
v <- "TRUE"
print(class(v))
Character 'a' , '"good", "TRUE", '23.4' it produces the following result −
[1] "character"
v <- charToRaw("Hello")
print(class(v))
Raw "Hello" is stored as 48 65 6c 6c 6f it produces the following result −
[1] "raw"
In R programming, the very basic data types are the R-objects called vectorswhich hold
elements of different classes as shown above. Please note in R the number of classes is not
confined to only the above six types. For example, we can use many atomic vectors and create an
array whose class will become array.
2.4.2.1Vectors
When you want to create vector with more than one element, you should usec() function which
means to combine the elements into a vector.
# Create a vector.
apple<- c('red','green',"yellow")
print(apple)
print(class(apple))
8
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
[1] "character"
2.4.2.2Lists
A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
[[1]]
[1] 2 5 3
[[2]]
[1] 21.3
[[3]]
function (x) .Primitive("sin")
2.4.2.3Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the
matrix function.
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'),nrow=2,ncol=3,byrow= TRUE)
print(M)
9
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
2.4.2.4Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions.
The array function takes a dim attribute which creates the required number of dimension. In the
below example we create an array with two elements which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
,,1
[,1] [,2] [,3]
[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"
,,2
2.4.2.5Factors
Factors are the r-objects which are created using a vector. It stores the vector along with the
distinct values of the elements in the vector as labels. The labels are always character
irrespective of whether it is numeric or character or Boolean etc. in the input vector. They are
useful in statistical modeling.
Factors are created using the factor() function.The nlevels functions gives the count of levels.
# Create a vector.
apple_colors<- c('green','green','yellow','red','red','red','green')
print(factor_apple)
print(nlevels(factor_apple))
2.4.2.6Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain
different modes of data. The first column can be numeric while the second column can be
character and third column can be logical. It is a list of vectors of equal length.
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
2.4.3 R – Variables
11
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
A variable provides us with named storage that our programs can manipulate. A variable in R
can store an atomic vector, group of atomic vectors or a combination of many Robjects. A valid
variable name consists of letters, numbers and the dot or underline characters. The variable
name starts with a letter or the dot not followed by a number.
var_name% Invalid Has the character '%'. Only dot(.) and underscore
allowed.
.var_name , Valid Can start with a dot(.) but the dot(.)should not be
var.name followed by a number.
The variables can be assigned values using leftward, rightward and equal to operator. The
values of the variables can be printed using print() orcat()function. The cat() function
combines multiple items into a continuous print output.
print(var.1)
cat ("var.1 is ", var.1 ,"\n")
12
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
In R, a variable itself is not declared of any data type, rather it gets the data type of the R - object
assigned to it. So R is called a dynamically typed language, which means that we can change a
variable‟s data type of the same variable again and again when using it in a program.
var_x<- "Hello"
cat("The class of var_x is ",class(var_x),"\n")
var_x<- 34.5
cat(" Now the class of var_x is ",class(var_x),"\n")
var_x<- 27L
cat(" Next the class of var_x becomes ",class(var_x),"\n")
2.4.4 R – Operators
An operator is a symbol that tells the compiler to perform specific mathematical or logical
manipulations. R language is rich in built-in operators and provides following types of
operators.
Arithmetic Operators
Relational Operators
Logical Operators
Assignment Operators
Miscellaneous Operators
13
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
2.4.4.2 R Functions
A function is a set of statements organized together to perform a specific task. R has a large
number of in-built functions and the user can create their own functions.
In R, a function is an object so the R interpreter is able to pass control to the function, along
with arguments that may be necessary for the function to accomplish the actions.
Function body
2.4.4.3Calling a Function
# Create a function to print squares of numbers in sequence.
new.function<- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
new.function(6)
2.4.5 Strings
14
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
Lists are the R objects which contain elements of different types like − numbers, strings, vectors
and another list inside it. A list can also contain a matrix or a function as its elements. List is
created using list() function.
[[1]]
[1] "Red"
[[2]]
[1] "Green"
[[3]]
[1] 21 32 11
[[4]]
[1] TRUE
[[5]]
[1] 51.23
[[6]]
[1] 119.1
2.5R Packages
15
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
R packages are a collection of R functions, complied code and sample data. They are stored
under a directory called "library" in the R environment. By default, R installs a set of packages
during installation.
More packages are added later, when they are needed for some specific purpose. When we start
the R console, only the default packages are available by default. Other packages which are
already installed have to be loaded explicitly to be used by the R program that is going to use
them.
Check Available R Packages - .libPaths()
There are two ways to add new R packages. One is installing directly from the CRAN directory
and another is downloading the package to your local system and installing it manually.
install.packages("Package Name")
16
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
Data Reshaping in R is about changing the way data is organized into rows and columns. Most of
the time data processing in R is done by taking the input data as a data frame. It is easy to extract
data from the rows and columns of a data frame but there are situations when we need the data
frame in a format that is different from format in which we received it. R has many functions to
split, merge and change the rows to columns and vice-versa in a data frame.
# Print a header.
cat("# # # # The First data frame\n")
zipcode = c("80230","33949"),
stringsAsFactors = FALSE
)
# Print a header.
cat("# # # The Second data frame\n")
# Print a header.
cat("# # # The combined data frame\n")
18
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
6 Charlotte FL 33949
data<- read.csv("input.csv")
print(data)
data<- read.csv("input.csv")
print(is.data.frame(data))
print(ncol(data))
print(nrow(data))
19
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
print(data)
# Create a connection object to read the file in binary mode using "rb".
read.filename<- file("/web/com/binmtcars.dat", "rb")
# Next read the column values. n = 18 as we have 3 column names and 15 values.
read.filename<- file("/web/com/binmtcars.dat", "rb")
bindata<- readBin(read.filename, integer(), n = 18)
# Read the values from 4th byte to 8th byte which represents "cyl".
cyldata = bindata[4:8]
print(cyldata)
20
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
# Read the values form 9th byte to 13th byte which represents "am".
amdata = bindata[9:13]
print(amdata)
# Read the values form 9th byte to 13th byte which represents "gear".
geardata = bindata[14:18]
print(geardata)
When we execute the above code, it produces the following result and chart
[1] 7108963 1728081249 7496037 6 6 4
[7] 6 8 1 1 1 0
[13] 0 4 4 4 3 3
[1] 6 6 4 6 8
[1] 1 1 1 0 0
[1] 4 4 4 3 3
cyl am gear
[1,] 6 1 4
[2,] 6 1 4
[3,] 4 1 4
[4,] 6 0 3
[5,] 8 0 3
XML is a file format which shares both the file format and the data on the World Wide Web,
intranets, and elsewhere using standard ASCII text. It stands for Extensible Markup Language
(XML). Similar to HTML it contains markup tags. But unlike HTML where the markup tag
describes structure of the page, in xml the markup tags describe the meaning of the data
contained into the file.You can read a xml file in R using the "XML" package. This package can
be installed using following command.
install.packages("XML")
21
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
<RECORDS>
<EMPLOYEE>
<ID>1</ID>
<NAME>Rick</NAME>
<SALARY>623.3</SALARY>
<STARTDATE>1/1/2012</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>2</ID>
<NAME>Dan</NAME>
<SALARY>515.2</SALARY>
<STARTDATE>9/23/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>3</ID>
<NAME>Michelle</NAME>
<SALARY>611</SALARY>
<STARTDATE>11/15/2014</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>4</ID>
<NAME>Ryan</NAME>
<SALARY>729</SALARY>
<STARTDATE>5/11/2014</STARTDATE>
<DEPT>HR</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>5</ID>
<NAME>Gary</NAME>
<SALARY>843.25</SALARY>
22
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
<STARTDATE>3/27/2015</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>6</ID>
<NAME>Nina</NAME>
<SALARY>578</SALARY>
<STARTDATE>5/21/2013</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>7</ID>
<NAME>Simon</NAME>
<SALARY>632.8</SALARY>
<STARTDATE>7/30/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>8</ID>
<NAME>Guru</NAME>
<SALARY>722.5</SALARY>
<STARTDATE>6/17/2014</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
</RECORDS>
2.6.5.2 Reading XML File
The xml file is read by R using the function xmlParse(). It is stored as a list in R.
23
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
1
Rick
623.3
1/1/2012
IT
2
Dan
515.2
9/23/2013
Operations
3
Michelle
611
11/15/2014
IT
4
Ryan
729
5/11/2014
HR
5
Gary
843.25
3/27/2015
Finance
6
Nina
578
5/21/2013
IT
7
Simon
632.8
7/30/2013
Operations
8
Guru
722.5
6/17/2014
Finance
24
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
df.car_spec_data$year<- as.character(df.car_spec_data$year)
For our purposes here, data exploration is the application of data visualization and data
manipulation techniques to understand the properties of our dataset.
###########################################
# PLOT DATA (Preliminary Data Inspection) #
###########################################
#-------------------------
# Horsepower vs. Top Speed
#-------------------------
ggplot(data=df.car_spec_data, aes(x=horsepower_bhp, y=top_speed_mph)) +
geom_point(alpha=.4, size=4, color="#880011") +
ggtitle("Horsepower vs. Top Speed") +
labs(x="Horsepower, bhp", y="Top Speed,\n mph") +theme.car_chart_SCATTER
25
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
2.8Modeling: Architecture
Data modeling is a useful technique to manage a workflow for various entities and for
making a sequential workflow in order to have a successful completion of a task.
hadoop and its big data model, we need to have a comprehensive study before
implementing any execution task and setting up any progressive environment.
Mainly hadoop is a collection of tools and techniques as it is not a single technology, so
at each point of time we need a task execution environment and some projection plans as
well.
The data modeling and logical workflow consists of an abstract layer that is used in
management of data storage when the data is stored in physical drives in hadoop
distributed file system.
Because of huge expansion of data in terms of big data, we need to have a multi
distributed and logically managed system. The data modeling also helps us in managing
various data resources and creates basic data layered architecture in order to optimize
data reuse and execution failure as well.
26
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 2
Apache oozie are inbuilt tools for managing map reduce and to make sure they
are synchronized in order to maintain equilibrium amongst the tasks that are assigned by
job trackers to task trackers.
However, we still need a modeling scheme to manage and maintain a workflow of
hadoop framework and for this we need a hybrid model for more flexibility.
In spite of many NoSQL databases that are used to resolve the problem of data
management for schema on read and schema on write, we still need a hybrid model in
order to improve the overall performance for SQL and NoSQL databases.
As big data is changing a lot in terms of its execution approach, we need to have a new
data and storage model (two separate models). We can fix these problems by using the
data migration technique in order to migrate big data (raw and unstructured data) into
NoSQL data.
On the top of the physical data model, we normally need to create data flow and compute
tasks for business requirements. With physical-data modeling, it‟s possible to create a
computing model, which can present the logic path of computing data. This will help
computing tasks be well designed and enable more efficient data reuse.
Hadoop provides a new distributed data processing model, and its HBase database
provides an impressive solution for data replication, backup, scalability, and so on
Hadoop also provides the Map/Reduce computing framework to retrieve value from data
stored in a distributed system. Map/Reduce is a framework for parallel processing using
mappers dividing a problem into smaller sub-problems to feed reducers that process the
sub problems and produce the final answer
27
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
UNIT 4
Technology and Tools:MapReduce/Hadoop – NoSQL: Cassandra,HBASE – Apache Mahout –
Tools.
Course Objectives:
To understand the technology of MapReduce concepts
To understand the importance of modeling architecture
Course Outcomes:
The students can use the technology of MapReduce in big data
The students can be able to develop application and manage using NoSQL and
Cassandra
They will learn how to develop various applications using these technologies
3.1 HADOOP
Doug Cutting, Mike Cafarella and team took the solution provided by Google and started
an Open Source Project called HADOOP in 2005 and Doug named it after his son's toy elephant.
Now Apache Hadoop is a registered trademark of the Apache Software Foundation.
Hadoop runs applications using the MapReduce algorithm, where the data is processed in
parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop
applications capable of running on clusters of computers and they could perform complete
statistical analysis for a huge amounts of data.
1
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
We can use following diagram to depict these four components available in Hadoop framework.
2
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
3.1.2.2 Stage 2
The Hadoop job client then submits the job (jar/executable etc) and configuration to the
JobTracker which then assumes the responsibility of distributing the software/configuration to
the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to
the job-client.
3.1.2.3 Stage 3
The TaskTrackers on different nodes execute the task as per MapReduce implementation
and output of the reduce function is stored into the output files on the file system.
3.1.3Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures at
the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
3
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
Another big advantage of Hadoop is that apart from being open source, it is compatible
on all the platforms since it is Java based.
The term MapReduce actually refers to the following two different tasks that Hadoop
programs perform:
The Map Task: This is the first task, which takes input data and converts it into a set of
data, where individual elements are broken down into tuples (key/value pairs).
The Reduce Task: This task takes the output from a map task as input and combines those
data tuples into a smaller set of tuples. The reduce task is always performed after the map
task.
Typically both the input and the output are stored in a file-system. The framework takes
care of scheduling tasks, monitoring them and re-executes the failed tasks.
The MapReduce framework consists of a single master JobTracker and one slave
TaskTracker per cluster-node. The master is responsible for resource management,
tracking resource consumption/availability and scheduling the jobs component tasks on
the slaves, monitoring them and re-executing the failed tasks. The slaves TaskTracker
execute the tasks as directed by the master and provide task-status information to the
master periodically.
The JobTracker is a single point of failure for the Hadoop MapReduce service which
means if JobTracker goes down, all running jobs are halted.
4
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
3.2.2What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing
based on java. The MapReduce algorithm contains two important tasks, namely Map and
Reduce. Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the
output from a map as an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed after the map
job.
The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing primitives are
called mappers and reducers. Decomposing a data processing application into mappers and
reducers is sometimes nontrivial. But, once we write an application in the MapReduce form,
scaling the application to run over hundreds, thousands, or even tens of thousands of machines in
a cluster is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce
divides a task into small parts and assigns them to many computers. Later, the results are
collected at one place and integrated to form the result dataset.
5
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
6
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
Input Phase − Here we have a Record Reader that translates each record in an input file
and sends the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and
Intermediate Keys − They key-value pairs generated by the mapper are known as
intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the
map phase into identifiable sets. It takes the intermediate keys from the mapper as input
and applies a user-defined code to aggregate the values in a small scope of one mapper.
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads
the grouped key-value pairs onto the local machine, where the Reducer is running. The
individual key-value pairs are sorted by key into a larger data list. The data list groups
the equivalent keys together so that their values can be iterated easily in the Reducer
task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final step.
Output Phase − In the output phase, we have an output formatter that translates the final
key-value pairs from the Reducer function and writes them onto a file using a record
writer.
7
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
Let us try to understand the two tasks Map & Reduce with the help of a small diagram −
3.2.2.3 MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter
receives around 500 million tweets per day, which is nearly 3000 tweets per second. The
following illustration shows how Tweeter manages its tweets with the help of MapReduce.
As shown in the illustration, the MapReduce algorithm performs the following actions −
8
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value
pairs.
Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as
key-value pairs.
Count − Generates a token counter per word.
Aggregate Counters − Prepares an aggregate of similar counter values into small
manageable units.
3.2.3The Algorithm
Generally MapReduce paradigm is based on sending the computer to where the data
resides!
MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
o Map stage: The map or mapper‟s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line. The
mapper processes the data and creates several small chunks of data.
o Reduce stage: This stage is the combination of the Shuffle stage and the Reduce
stage. The Reducer‟s job is to process the data that comes from the mapper. After
processing, it produces a new set of output, which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers
in the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
9
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The map task is done by means of Mapper Class
The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class
is used as input by Reducer class, which in turn searches matching pairs and reduces them.
10
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
3.2.3.1 Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the mapper
by their keys.
Sorting methods are implemented in the mapper class itself.
In the Shuffle and Sort phase, after tokenizing the values in the mapper class,
the Context class (user-defined class) collects the matching valued keys as a
collection.
To collect similar key-value pairs (intermediate keys), the Mapper class takes
the help of RawComparator class to sort the key-value pairs.
The set of intermediate key-value pairs for a given Reducer is automatically
sorted by Hadoop to form key-values (K2, {V2, V2, …}) before they are
presented to the Reducer.
3.2.3.2 Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner
phase (optional) and in the Reducer phase. Let us try to understand how Searching works with
the help of an example.
Example
The following example shows how MapReduce employs Searching algorithm to find out the
details of the employee who draws the highest salary in a given employee dataset.
Let us assume we have employee data in four different files − A, B, C, and D. Let us also
assume there are duplicate employee records in all four files because of importing the
employee data from all database tables repeatedly. See the following illustration.
11
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
The Map phase processes each input file and provides the employee data in key-value
pairs (<k, v> :<emp name, salary>). See the following illustration.
The combiner phase (searching technique) will accept the input from the Map phase as
a key-value pair with employee name and salary. Using searching technique, the
combiner will check all the employee salary to find the highest salaried employee in
each file. See the following snippet.
<k: employeename, v: salary>
Max= the salary of an first employee. Treated as max salary
else{
Continue checking;
}
The expected result is as follows −
Reducer phase − Form each file, you will find the highest salaried employee. To avoid
redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same
algorithm is used in between the four <k, v> pairs, which are coming from four input
files.
The final output should be as follows −<gopal, 50000>
12
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
3.2.3.3 Indexing
Normally indexing is used to point to a particular data and its address. It performs batch
indexing on the input files for a particular Mapper.
The indexing technique that is normally used in MapReduce is known asinverted
index. Search engines like Google and Bing use inverted indexing technique. Let us try to
understand how Indexing works with the help of a simple example.
Example
The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the file
names and their content are in double quotes.
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
After applying the Indexing algorithm, we get the following output −
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2}
implies the term "is" appears in the files T[0], T[1], and T[2].
3.2.4TF-IDF
TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse
Document Frequency. It is one of the common web analysis algorithms. Here, the term
'frequency' refers to the number of times a term appears in a document.
3.2.4.1 Term Frequency (TF)
It measures how frequently a particular term occurs in a document. It is calculated by the
number of times a word appears in a document divided by the total number of words in that
document.
TF(the) = (Number of times term the „the‟ appears in a document) / (Total number of terms in
the document)
13
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
3.2.5Terminology
PayLoad - Applications implement the Map and the Reduce functions, and form the core
of the job.
Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
DataNode - Node where data is presented in advance before any processing takes place.
MasterNode - Node where JobTracker runs and which accepts job requests from clients.
SlaveNode - Node where Map and Reduce program runs.
JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker - Tracks the task and reports status to JobTracker.
Job - A program is an execution of a Mapper and Reducer across a dataset.
Task - An execution of a Mapper or a Reducer on a slice of data.
Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.
14
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
Example Scenario
Given below is the data regarding the electrical consumption of an organization. It
contains the monthly electrical consumption and the annual average for various years.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
If the above data is given as input, we have to write applications to process it and
produce results such as finding the year of maximum usage, year of minimum usage, and so on.
This is a walkover for the programmers with finite number of records. They will simply write
the logic to produce the required output, and pass the data to the application written.But, think
of the data representing the electrical consumption of all the largescale industries of a particular
state, since its formation.When we write applications to process such bulk data,
They will take a lot of time to execute.
There will be a heavy network traffic when we move data from source to network server
and so on.
To solve these problems, we have the MapReduce framework.
3.3 NoSQL
3.3.1 NoSQLDatabase
A NoSQL database (sometimes called as Not Only SQL) is a database that provides a
mechanism to store and retrieve data other than the tabular relations used in relational
databases. These databases are schema-free, support easy replication, have simple API,
eventually consistent, and can handle huge amounts of data.
15
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
Besides Cassandra, we have the following NoSQL databases that are quite popular:
Apache HBase - HBase is an open source, non-relational, distributed database modeled
after Google‟s BigTable and is written in Java. It is developed as a part of Apache
Hadoop project and runs on top of HDFS, providing BigTable-like capabilities for
Hadoop.
MongoDB - MongoDB is a cross-platform document-oriented database system that
avoids using the traditional table-based relational database structure in favor of JSON-
like documents with dynamic schemas making the integration of data in certain types of
applications easier and faster.
16
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
17
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
3.4 Cassandra
3.4.1 What is Apache Cassandra?
Apache Cassandra is an open source, distributed and decentralized/distributed storage system
(database), for managing very large amounts of structured data spread out across the world. It
provides highly available service with no single point of failure.
Listed below are some of the notable points of Apache Cassandra:
It is scalable, fault-tolerant, and consistent.
It is a column-oriented database.
18
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
Its distribution design is based on Amazon‟s Dynamo and its data model on Google‟s
Bigtable.
Created at Facebook, it differs sharply from relational database management systems.
Cassandra implements a Dynamo-style replication model with no single point of failure,
but adds a more powerful “column family” data model.
Cassandra is being used by some of the biggest companies such as Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.
19
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
The design goal of Cassandra is to handle big data workloads across multiple nodes without
any single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and
data is distributed among all the nodes in a cluster.
All the nodes in a cluster play the same role. Each node is independent and at the same
time interconnected to other nodes.
Each node in a cluster can accept read and write requests, regardless of where the data is
actually located in the cluster.
When a node goes down, read/write requests can be served from other nodes in the
network.
3.4.4 Data Replication in Cassandra
In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of
data. If it is detected that some of the nodes responded with an out-of-date value, Cassandra will
return the most recent value to the client. After returning the most recent value, Cassandra
performs a read repair in the background to update the stale values.
The following figure shows a schematic view of how Cassandra uses data replication
among the nodes in a cluster to ensure no single point of failure.
20
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
Note − Cassandra uses the Gossip Protocol in the background to allow the nodes to
communicate with each other and detect any faulty nodes in the cluster.
Components of Cassandra
The key components of Cassandra are as follows −
Node − It is the place where data is stored.
Data center − It is a collection of related nodes.
Cluster − A cluster is a component that contains one or more data centers.
Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write
operation is written to the commit log.
Mem-table − A mem-table is a memory-resident data structure. After commit log, the
data will be written to the mem-table. Sometimes, for a single-column family, there will
be multiple mem-tables.
SSTable − It is a disk file to which the data is flushed from the mem-table when its
contents reach a threshold value.
Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing
whether an element is a member of a set. It is a special kind of cache. Bloom filters are
accessed after every query.
21
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
3.5 HBASE
3.5.1 What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system.
It is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google‟s big table designed to provide quick
random access to huge amounts of structured data. It leverages the fault tolerance provided by
the Hadoop File System (HDFS).
22
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
It is a part of the Hadoop ecosystem that provides random real-time read/write access to
data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File
System and provides read and write access.
It provides only sequential access of data. HBase internally uses Hash tables and provides
random access, and it stores the data in indexed
HDFS files for faster lookups.
23
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
values are stored contiguously on the disk. Each cell value of the table has a timestamp. In
short, in an HBase:
Table is a collection of rows.
Row is a collection of column families.
Column family is a collection of columns.
Column is a collection of key value pairs.
Given below is an example schema of table in HBase.
Rowid Column Family Column Family Column Family Column Family
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
1
2
3
24
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
25
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
3.5.10 Architecture
In HBase, tables are split into regions and are served by the region servers. Regions are
vertically divided by column families into “Stores”. Stores are saved as files in HDFS. Shown
below is the architecture of HBase.
Note: The term „store‟ is used for regions to explain the storage structure.
HBase has three major components: the client library, a master server, and region
servers. Region servers can be added or removed as per requirement.
26
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
3.5.10.1 MasterServer
The master server -
Assigns regions to the region servers and takes the help of Apache ZooKeeper for this
task.
Handles load balancing of the regions across region servers. It unloads the busy servers
and shifts the regions to less occupied servers.
Maintains the state of the cluster by negotiating the load balancing.
Is responsible for schema changes and other metadata operations such as creation of
tables and column families.
3.5.10.2 Regions
Regions are nothing but tables that are split up and spread across the region servers.
3.5.10.3 Region server
The region servers have regions that -
Communicate with the client and handle data-related operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.
When we take a deeper look into the region server, it contain regions and stores as shown
below:
27
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
The store contains memory store and HFiles. Memstore is just like a cache memory.
Anything that is entered into the HBase is stored here initially. Later, the data is transferred and
saved in Hfiles as blocks and the memstore is flushed.
3.5.10.4 Zookeeper
Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization, etc.
Zookeeper has ephemeral nodes representing different region servers. Master servers use
these nodes to discover available servers.
In addition to availability, the nodes are also used to track server failures or network
partitions.
Clients communicate with region servers via zookeeper.
In pseudo and standalone modes, HBase itself will take care of zookeeper.
28
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
Mahout offers the coder a ready-to-use framework for doing data mining tasks on large
volumes of data.
Mahout lets applications to analyze large sets of data effectively and in quick time.
Includes several MapReduce enabled clustering implementations such as k-means, fuzzy
k-means, Canopy, Dirichlet, and Mean-Shift.
Supports Distributed Naive Bayes and Complementary Naive Bayes classification
implementations.
Comes with distributed fitness function capabilities for evolutionary programming.
Includes matrix and vector libraries.
3.6.3 Applications of Mahout
Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use
Mahout internally.
Foursquare helps you in finding out places, food, and entertainment available in a
particular area. It uses the recommender engine of Mahout.
Twitter uses Mahout for user interest modelling.
Yahoo! uses Mahout for pattern mining.
Apache Mahout is a highly scalable machine learning library that enables developers to use
optimized algorithms. Mahout implements popular machine learning techniques such as
recommendation, classification, and clustering. Therefore, it is prudent to have a brief section
on machine learning before we move further.
29
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
Vision processing
Language processing
Forecasting (e.g., stock market trends)
Pattern recognition
Games
Data mining
Expert systems
Robotics
There are several ways to implement machine learning techniques, however the most
commonly used ones are supervised andunsupervised learning.
3.6.4.1 Learning
Supervised learning deals with learning a function from available training data. A supervised
learning algorithm analyzes the training data and produces an inferred function, which can be
used for mapping new examples. Common examples of supervised learning include:
classifying e-mails as spam,
labeling webpages based on their content, and
voice recognition.
There are many supervised learning algorithms such as neural networks, Support Vector
Machines (SVMs), and Naive Bayes classifiers. Mahout implements Naive Bayes classifier.
3.6.4.2 Unsupervised Learning
Unsupervised learning makes sense of unlabeled data without having any predefined dataset
for its training. Unsupervised learning is an extremely powerful tool for analyzing available data
and look for patterns and trends. It is most commonly used for clustering similar input into
logical groups. Common approaches to unsupervised learning include:
k-means
self-organizing maps, and
hierarchical clustering
3.6.4.3 Recommendation
Recommendation is a popular technique that provides close recommendations based on user
information such as previous purchases, clicks, and ratings.
30
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
Amazon uses this technique to display a list of recommended items that you might be
interested in, drawing information from your past actions. There are recommender
engines that work behind Amazon to capture user behavior and recommend selected
items based on your earlier actions.
Facebook uses the recommender technique to identify and recommend the “people you
may know list”.
3.6.4.4 Classification
Classification, also known as categorization, is a machine learning technique that uses
known data to determine how the new data should be classified into a set of existing categories.
Classification is a form of supervised learning.
Mail service providers such as Yahoo! and Gmail use this technique to decide whether a
new mail should be classified as a spam. The categorization algorithm trains itself by
analyzing user habits of marking certain mails as spams. Based on that, the classifier
decides whether a future mail should be deposited in your inbox or in the spams folder.
iTunes application uses classification to prepare playlists.
31
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
3.6.4.5 Clustering
Clustering is used to form groups or clusters of similar data based on common
characteristics. Clustering is a form of unsupervised learning.
Search engines such as Google and Yahoo! use clustering techniques to group data with
similar characteristics.
Newsgroups use clustering techniques to group various articles based on related topics.
The clustering engine goes through the input data completely and based on the
characteristics of the data, it will decide under which cluster it should be grouped. Take a look
at the following example.
The Goals of Clustering
So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But
how to decide what constitutes a good clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the clustering. Consequently, it is the
user which must supply this criterion, in such a way that the result of the clustering will suit their
needs.
For instance, we could be interested in finding representatives for homogeneous groups (data
reduction), in finding “natural clusters” and describe their unknown properties (“natural” data
types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data
objects (outlier detection).
32
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 4
Possible Applications
Clustering algorithms can be applied in many fields, for instance:
Marketing: finding groups of customers with similar behavior given a large database of
customer data containing their properties and past buying records;
Insurance: identifying groups of motor insurance policy holders with a high average
claim cost; identifying frauds;
City-planning: identifying groups of houses according to their house type, value and
geographical location;
33
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3
Unit 3
Big Data Security: Big Data Security, Compliance, Auditing and Protection: Pragmatic Steps to
Securing Big Data, Classifying Data, Protecting Big Data Analytics, Big Data and Compliance,
The Intellectual Property Challenge –Big Data in Cyber defense.
Course Objectives:
To understand the concepts of Big data security and classification
To understand the importance of intellectual property and cyber defense in Big data
Course Outcomes:
The students can use the tools of Big Data
The students can be able to provide security to Big Data
They will learn how to handle the security issues in Big data
The sheer size of a Big Data repository brings with it a major security challenge, generating the
age-old question presented to IT: How can the data be protected? However, that is a trick
question—the answer has many caveats, which dictate how security must be imagined as well as
deployed. Proper security entails more than just keeping the bad guys out; it also means backing
up data and protecting data from corruption.
The first caveat is access. Data can be easily protected, but only if you eliminate access to the
data. That’s not a pragmatic solution, to say the least. The key is to control access, but even then,
knowing the who, what, when, and where of data access is only a start.
The second caveat is availability: controlling where the data are stored and how the data are
distributed. The more control you have, the better you are positioned to protect the data.
The third caveat is performance. Higher levels of encryption, complex security methodologies,
and additional security layers can all improve security. However, these security techniques all
carry a processing burden that can severely affect performance.
The fourth caveat is liability. Accessible data carry with them liability, such as the sensitivity of
the data, the legal requirements connected to the data, privacy issues, and intellectual property
concerns.
1
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3
Adequate security in the Big Data realm becomes a strategic balancing act among these caveats
along with any additional issues the caveats create. Nonetheless, effective security is an
obtainable, if not perfect, goal. With planning, logic, and observation, security becomes
manageable and omnipresent, effectively protecting data while still offering access to authorized
users and systems.
Securing the massive amounts of data that are inundating organizations can be addressed in
several ways. A starting point is to basically get rid of data that are no longer needed. If you do
not need certain information, it should be destroyed, because it represents a risk to the
organization. That risk grows every day for as long as the information is kept. Of course, there
are situations in which information cannot legally be destroyed; in that case, the information
should be securely archived by an offline method.
The real challenge may be determining whether the data are needed—a difficult task in the world
of Big Data, where value can be found in unexpected places. For example, getting rid of activity
logs may be a smart move from a security standpoint. After all, those seeking to compromise
networks may start by analyzing activity so they can come up with a way to monitor and
intercept traffic to break into a network. In a sense, those logs present a serious risk to an
organization, and to prevent the logs from being exposed, the best method may be to delete them
after their usefulness ends.
However, those logs could be used to determine scale, use, and efficiency of large data systems,
an analytical process that falls right under the umbrella of Big Data analytics. Here a catch-22 is
created: Logs are a risk, but analyzing those logs properly can mitigate risks as well. Should you
keep or dispose of the data in these cases?
There is no easy answer to that dilemma, and it becomes a case of choosing the lesser of two
evils. If the data have intrinsic value for analytics, they must be kept, but that does not mean they
need to be kept on a system that is connected to the Internet or other systems. The data can be
archived, retrieved for processing, and then returned to the archive.
Protecting data becomes much easier if the data are classified—that is, the data should be divided
into appropriate groupings for management purposes. A classification system does not have to be
very sophisticated or complicated to enable the security process, and it can be limited to a few
different groups or categories to keep things simple for processing and monitoring.
With data classification in mind, it is essential to realize that all data are not created equal. For
example, Internal e-mails between two colleagues should not be secured or treated the same way
as financial reports, human resources (HR)information, or customer data.
Understanding the classifications and the value of the data sets is not a one-task job; the life-
cycle management of data may need to be shared by several departments or teams in an
2
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3
enterprise. For example, you may want to divide the responsibilities among technical, security,
and business organizations. Although it may sound complex, it really isn’t all that hard to
educate the various corporate shareholders to understand the value of data and where their
responsibilities lie.
Classification can become a powerful tool for determining the sensitivity of data. A simple
approach may just include classifications such as financial, HR, sales, inventory, and
communications, each of which is self-explanatory and offers insight into the sensitivity of the
data.
Once organizations better understand their data, they can take important steps to segregate the
information, which will make the deployment of security measures like encryption and
monitoring more manageable. The more data are placed into silos at higher levels, the easier it
becomes to protect and control them. Smaller sample sizes are easier to protect and can be
monitored separately for specific necessary controls.
It is sad to report that protecting data is an often forgotten inclination in the data center, an
afterthought that falls behind current needs. The launch of Big Data initiatives is no exception in
the data center, and protection is too often an afterthought. Big Data offers more of a challenge
than most other data center technologies, making it the perfect storm for a data protection
disaster.
The real cause of concern is the fact that Big Data contains all of the things you don’t want to see
when you are trying to protect data. Big Data can contain very unique sample sets—for example,
data from devices that monitor physical elements (e.g., traffic, movement, soil pH, rain, wind) on
a frequent schedule, surveillance cameras, or any other type of data that are accumulated
frequently and in real time. All of the data are unique to the moment, and if they are lost, they are
impossible to recreate.
That uniqueness also means you cannot leverage time-saving backup preparation and security
technologies, such as deduplication; this greatly increases the capacity requirements for backup
subsystems, slows down security scanning, makes it harder to detect data corruption, and
complicates archiving.
There is also the issue of the large size and number of files often found in Big Data analytic
environments. In order for a backup application and associated appliances or hardware to churn
through a large number of files, bandwidth to the backup systems and/or the backup appliance
must be large, and the receiving devices must be able to ingest data at the rate that the data can
be delivered, which means that significant CPU processing power is necessary to churn through
billions of files.
There is more to backup than just processing files. Big Data normally includes a database
component, which cannot be overlooked. Analytic information is often processed into an Oracle,
NoSQL, or Hadoop environment of some type, so real-time (or live) protection of that
3
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3
environment may be required. A database component shifts the backup ideology from a massive
number of small files to be backed up to a small number of massive files to be backed up. That
changes the dynamics of how backups need to be processed.
Big Data often presents the worst-case scenario for most backup appliances, in which the
workload mix consists of billions of small files and a small number of large files. Finding a
backup solution that can ingest this mixed workload of data at full speed and that can scale to
massive capacities may be the biggest challenge in the Big Data backup market.
Compliance issues are becoming a big concern in the data center, and these issues have a major
effect on how Big Data is protected, stored, accessed, and archived. Whether Big Data is going
to reside in the data warehouse or in some other more scalable data store remains unresolved for
most of the industry; it is an evolving paradigm. However, one thing is certain: Big Data is not
easily handled by the relational databases that the typical database administrator is used to
working with in the traditional enterprise database server environment. This means it is harder to
understand how compliance affects the data.
Big Data is transforming the storage and access paradigms to an emerging new world of
horizontally scaling, unstructured databases, which are better at solving some old business
problems through analytics. More important, this new world of file types and data is prompting
analysis professionals to think of new problems to solve, some of which have never been
attempted before. With that in mind, it becomes easy to see that a rebalancing of the database
landscape is about to commence, and data architects will finally embrace the fact that relational
databases are no longer the only tool in the tool kit.
This has everything to do with compliance. New data types and methodologies are still expected
to meet the legislative requirements placed on businesses by compliance laws. There will be no
excuses accepted and no passes given if a new data methodology breaks the law.
Preventing compliance from becoming the next Big Data nightmare is going to be the job of
security professionals. They will have to ask themselves some important questions and take into
account the growing mass of data, which are becoming increasingly unstructured and are
accessed from a distributed cloud of users and applications looking to slice and dice them in a
million and one ways. How will security professionals be sure they are keeping tabs on the
regulated information in all that mix?
Many organizations still have to grasp the importance of such areas as payment card industry and
personal health information compliance and are failing to take the necessary steps because the
Big Data elements are moving through the enterprise with other basic data. The trend seems to
be that as businesses jump into Big Data, they forget to worry about very specific pieces of
information that may be mixed into their large data stores, exposing them to compliance issues.
Health care probably provides the best example for those charged with compliance as they
examine how Big Data creation, storage, and flow work in their organizations. The move to
4
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3
electronic health record systems, driven by the Health Insurance Portability and Accountability
Act (HIPAA) and other legislation, is causing a dramatic increase in the accumulation, access,
and inter-enterprise exchange of personal identifying information. That has already created a Big
Data problem for the largest health care providers and payers, and it must be solved to maintain
compliance.
The concepts of Big Data are as applicable to health care as they are to other businesses. The
types of data are as varied and vast as the devices collecting the data, and while the concept of
collecting and analyzing the unstructured data is not new, recently developed technologies make
it quicker and easier than ever to store, analyze, and manipulate these massive data sets.
Health care deals with these massive data sets using Big Data stores, which can span tens of
thousands of computers to enable enterprises, researchers, and governments to develop
innovative products, make important discoveries, and generate new revenue streams. The rapid
evolution of Big Data has forced vendors and architects to focus primarily on the storage,
performance, and availability elements, while security—which is often thought to diminish
performance—has largely been an afterthought.
In the medical industry, the primary problem is that unsecured Big Data stores are filled with
content that is collected and analyzed in real time and is often extraordinarily sensitive:
intellectual property, personal identifying information, and other confidential information. The
disclosure of this type of data, by either attack or human error, can be devastating to a company
and its reputation.
However, because this unstructured Big Data doesn’t fit into traditional, structured, SQL-based
relational databases, NoSQL, a new type of data management approach, has evolved. These
nonrelational data stores can store, manage, and manipulate terabytes, petabytes, and even
exabytes of data in real time.
No longer scattered in multiple federated databases throughout the enterprise, Big Data
consolidates information in a single massive database stored in distributed clusters and can be
easily deployed in the cloud to save costs and ease management. Companies may also move Big
Data to the cloud for disaster recovery, replication, load balancing, storage, and other purposes.
Unfortunately, most of the data stores in use today—including Hadoop, Cassandra, and
MongoDB—do not incorporate sufficient data security tools to provide enterprises with the
peace of mind that confidential data will remain safe and secure at all times. The need for
security and privacy of enterprise data is not a new concept. However, the development of Big
Data changes the situation in many ways. To date, those charged with network security have
spent a great deal of time and money on perimeter-based security mechanisms such as firewalls,
but perimeter enforcement cannot prevent unauthorized access to data once a criminal or a
hacker has entered the network.
Add to this the fact that most Big Data platforms provide little to no data-level security along
with the alarming truth that Big Data centralizes most critical, sensitive, and proprietary data in a
single logical data store, and it’s clear that Big Data requires big security.
5
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3
The lessons learned by the health care industry show that there is a way to keep Big Data secure
and in compliance. A combination of technologies has been assembled to meet four important
goals:
1. Control access by process, not job function. Server and network administrators, cloud
administrators, and other employees often have access to more information than their jobs
require because the systems simply lack the appropriate access controls. Just because a user has
operating system–level access to a specific server does not mean that he or she needs, or should
have, access to the Big Data stored on that server.
2. Secure the data at rest. Most consumers today would not conduct an online transaction
without seeing the familiar padlock symbol or at least a certification notice designating that
particular transaction as encrypted and secure. So why wouldn’t you require the same data to be
protected at rest in a Big Data store? All Big Data, especially sensitive information, should
remain encrypted, whether it is stored on a disk, on a server, or in the cloud and regardless of
whether the cloud is inside or outside the walls of your organization.
3. Protect the cryptographic keys and store them separately from the data.Cryptographic
keys are the gateway to the encrypted data. If the keys are left unprotected, the data are easily
compromised. Organizations—often those that have cobbled together their own encryption and
key management solution—will sometimes leave the key exposed within the configuration file or
on the very server that stores the encrypted data. This leads to the frightening reality that any
user with access to the server, authorized or not, can access the key and the data. In addition, that
key may be used for any number of other servers. Storing the cryptographic keys on a separate,
hardened server, either on the premises or in the cloud, is the best practice for keeping data safe
and an important step in regulatory compliance. The bottom line is to treat key security with as
much, if not greater, rigor than the data set itself.
4. Create trusted applications and stacks to protect data from rogue users. You may encrypt
your data to control access, but what about the user who has access to the configuration files that
define the access controls to those data? Encrypting more than just the data and hardening the
security of your overall environment—including applications, services, and configurations—
gives you peace of mind that your sensitive information is protected from malicious users and
rogue employees.
There is still time to create and deploy appropriate security rules and compliance objectives. The
health care industry has helped to lay some of the groundwork. However, the slow development
of laws and regulations works in favor of those trying to get ahead on Big Data. Currently, many
of the laws and regulations have not addressed the unique challenges of data warehousing. Many
of the regulations do not address the rules for protecting data from different customers at
different levels.
For example, if a database has credit card data and health care data, do the PCI Security
Standards Council and HIPAA apply to the entire data store or only to the parts of the data store
that have their types of data? The answer is highly dependent on your interpretation of the
requirements and the way you have implemented the technology.
Similarly, social media applications that are collecting tons of unregulated yet potentially
sensitive data may not yet be a compliance concern. But they are still a security problem that if
not properly addressed now may be regulated in the future. Social networks are accumulating
6
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3
massive amounts of unstructured data—a primary fuel for Big Data, but they are not yet
regulated, so this is not a compliance concern but remains as a security concern.
Security professionals concerned about how things like Hadoop and NoSQL deployments are
going to affect their compliance efforts should take a deep breath and remember that the general
principles of data security still apply. The first principle is knowing where the data reside. With
the newer database solutions, there are automated ways of detecting data and triaging systems
that appear to have data they shouldn’t.
Once you begin to map and understand the data, opportunities should become evident that will
lead to automating and monitoring compliance and security through data warehouse
technologies. Automation offers the ability to decrease compliance and security costs and still
provide the higher levels of assurance, which validates where the data are and where they are
going.
Of course, automation does not solve every problem for security, compliance, and backup. There
are still some very basic rules that should be used to enable security while not derailing the value
of Big Data:
Ensure that security does not impede performance or availability. Big Data is all
about handling volume while providing results, being able to deal with the velocity and
variety of data, and allowing organizations to capture, analyze, store, or move data in real
time. Security controls that limit any of these processes are a nonstarter for organizations
serious about Big Data.
Pick the right encryption scheme. Some data security solutions encrypt at the file level
or lower, such as including specific data values, documents, or rows and columns. Those
methodologies can be cumbersome, especially for key management. File level or internal
file encryption can also render data unusable because many applications cannot analyze
encrypted data. Likewise, encryption at the operating system level, but without advanced
key management and process-based access controls, can leave Big Data woefully
insecure. To maintain the high levels of performance required to analyze Big Data,
consider a transparent data encryption solution optimized for Big Data.
Ensure that the security solution can evolve with your changing
requirements. Vendor lock-in is becoming a major concern for many enterprises.
Organizations do not want to be held captive to a sole source for security, whether it is a
single-server vendor, a network vendor, a cloud provider, or a platform. The flexibility to
migrate between cloud providers and models based on changing business needs is a
requirement, and this is no different with Big Data technologies. When evaluating
security, you should consider a solution that is platform-agnostic and can work with any
Big Data file system or database, including Hadoop, Cassandra, and MongoDB.
One of the biggest issues around Big Data is the concept of intellectual property (IP). First we
must understand what IP is, in its most basic form. There are many definitions available, but
basically, intellectual property refers to creations of the human mind, such as inventions, literary
7
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3
and artistic works, and symbols, names, images, and designs used in commerce. Although this is
a rather broad description, it conveys the essence of IP.
With Big Data consolidating all sorts of private, public, corporate, and government data into a
large data store, there are bound to be pieces of IP in the mix: simple elements, such as
photographs, to more complex elements, such as patent applications or engineering diagrams.
That information has to be properly protected, which may prove to be difficult, since Big Data
analytics is designed to find nuggets of information and report on them.
Here is a little background: Between 1985 and 2010, the number of patents granted worldwide
rose from slightly less than 400,000 to more than 900,000. That’s an increase of more than 125
percent over one generation (25 years). Patents are filed and backed with IP rights (IPRs).
Technology is obviously pushing this growth forward, so it only makes sense that Big Data will
be used to look at IP and IP rights to determine opportunity. This should create a major concern
for companies looking to protect IP and should also be a catalyst to take action. Fortunately,
protecting IP in the realm of Big Data follows many of the same rules that organizations have
already come to embrace, so IP protection should already be part of the culture in any enterprise.
The same concepts just have to be expanded into the realm of Big Data. Some basic rules are as
follows:
Understand what IP is and know what you have to protect. If all employees
understand what needs to be protected, they can better understand how to protect it and
whom to protect it from. Doing that requires that those charged with IP security in IT
(usually a computer security officer, or CSO) must communicate on an ongoing basis
with the executives who oversee intellectual capital. This may require meeting at least
quarterly with the chief executive, operating, and information officers and representatives
from HR, marketing, sales, legal services, production, and research and development
(R&D). Corporate leaders will be the foundation for protecting IP.
Prioritize protection. CSOs with extensive experience normally recommend doing a risk
and cost-benefit analysis. This may require you to create a map of your company’s assets
and determine what information, if lost, would hurt your company the most. Then
consider which of those assets are most at risk of being stolen. Putting these two factors
together should help you figure out where to best allocate your protective efforts.
Label. Confidential information should be labeled appropriately. If company data are
proprietary, note that on every log-in screen. This may sound trivial, but in court you may
have to prove that someone who was not authorized to take information had been
informed repeatedly. Your argument won’t stand up if you can’t demonstrate that you
made this clear.
Lock it up. Physical as well as digital protection schemes are a must. Rooms that store
sensitive data should be locked. This applies to everything from the server farm to the file
room. Keep track of who has the keys, always use complex passwords, and limit
employee access to important databases.
Educate employees. Awareness training can be effective for plugging and preventing IP
leaks, but it must be targeted to the information that a specific group of employees needs
to guard. Talk in specific terms about something that engineers or scientists have invested
8
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3
a lot of time in, and they will pay attention. Humans are often the weakest link in the
defense chain. This is why an IP protection effort that counts on firewalls and copyrights
but ignores employee awareness and training is doomed to fail.
Know your tools. A growing variety of software tools are available for tracking
documents and other IP stores. The category of data loss protection (or data leakage
prevention) grew quickly in the middle of the first decade of this century and now shows
signs of consolidation into other security tool sets. Those tools can locate sensitive
documents and keep track of how they are being used and by whom.
Use a holistic approach. You must take a panoramic view of security. If someone is
scanning the internal network, your internal intrusion detection system goes off, and
someone from IT calls the employee who is doing the scanning and says, ―Stop doing
that.‖ The employee offers a plausible explanation, and that’s the end of it. Later the
night watchman sees an employee carrying out protected documents, whose explanation,
when stopped, is ―Oops, I didn’t realize that got into my briefcase.‖ Over time, the HR
group, the audit group, the individual’s colleagues, and others all notice isolated
incidents, but no one puts them together and realizes that all these breaches were
perpetrated by the same person. This is why communication gaps between infosecurity
and corporate security groups can be so harmful. IP protection requires connections and
communication among all the corporate functions. The legal department has to play a role
in IP protection, and so does HR, IT, R&D, engineering, and graphic design. Think
holistically, both to protect and to detect.
Use a counterintelligence mind-set. If you were spying on your own company, how
would you do it? Thinking through such tactics will lead you to consider protecting
phone lists, shredding the papers in the recycling bins, convening an internal council to
approve your R&D scientists’ publications, and coming up with other worthwhile ideas
for your particular business.
These guidelines can be applied to almost any information security paradigm that is
geared toward protecting IP. The same guidelines can be used when designing IP
protection for a Big Data platform.
Cyber Defense
Types of attacks include advanced malware, zero day attacks and advanced persistent
threats.
Advance warning about attackers and intelligence about the threat landscape is
considered by many security leaders to be essential features in security technologies.
The purpose of the Big Data Analytics in Cyber Defense study sponsored by Teradata
and conducted by Ponemon Institute is to learn about organizations’ cyber security
9
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3
defenses and the use of big data analytics to become more efficient in recognizing the
patterns that represent network threats.
Big data analytics in security involves the ability to gather massive amounts of digital
information to analyze, visualize and draw insights that can make it possible to predict
and stop cyber attacks.
The study looks at the awareness among IT and IT security practitioners about the new
data management and analytic technologies now available to help organizations become
more proactive and intelligent about detecting and stopping threats.
All respondents are familiar with their organization’s defense against cyber security
attacks and have some level of responsibility for managing the cyber security activities
within their organization.
While the theft's full damage is still unknown, the multipronged heist is another indicator that
cyberattacks are wreaking increasingly greater damage. In Ponemon Institute's upcoming 2013
Cost of Cyber Crime Study, the firm reports this year's average annualized cost of cybercrime
was $7.2 billion per company polled in its study — a 30 percent increase in mean value over last
year. The report also says successful cyberattacks increased 20 percent over last year, with each
company surveyed experiencing 1.4 successful attacks per week.
"We used to make statements, such as 'I have a firewall; I'm protected,' or 'I have antivirus
software; I'm protected,'" says Todd Pedersen, a cybersecurity lead for CSC. "Now,
the conversation is less about preventing an attack, threat or exposure, and more about how
quickly you can detect that an attack is happening."
There's a growing demand for security information and event management (SIEM) technologies
and services, which gather and analyze security event big data that is used to manage
threats. Increasing numbers of regulations and mandates generated throughout the globe also are
pushing the adoption of SIEM technologies and services.
"Both governments and industries are introducing more and more regulations and mandates that
require the use of better data protection and security controls to help guard systems, information
and individuals," says Matthew O'Brien, a global cybersecurity expert for CSC.
10
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 3
In the United States, the Federal Information Security Management Act, Health Insurance
Portability and Accountability Act, Sarbanes-Oxley Act, and the Department of Homeland
Security's Critical Infrastructure Protection guidelines, to name a few, all have requirements tied
to collecting and logging information, events and activities that occur within an organization's
environment — requirements that SIEM-related technologies and services help organizations
meet.
For example, every second, more than 300,000 events generated by CSC and its customers run
through CSC's Global Security Operations Centers.
"SIEM gives us the ability to take this massive amount of data and bring it all back to a central
place, where it's combined with the other information we get from numerous security
technologies," says Pedersen. "That gives us the ability to detect things that no individual
technology in and of itself would have picked up, and create a picture to analyze, investigate and
find security-related issues."
New levels of awareness
This SIEM capability also has become critical as organized crime, along with some nations'
armed forces and intelligence services, moves center stage in the cyberarena,
launching weapons-grade cyberattacks and advanced persistent threats.
At times these threats are global; at other times, attackers aim for specific industries. Ponemon's
report says, "The average annualized cost of cybercrime appears to vary by industry segment,
where organizations in defense, financial services, and energy and utilities experience
substantially higher cybercrime costs than organizations in retail, media and consumer products."
"SIEM helps us create an environment that allows us to use a broad range of tools, some of
which we select for a specific customer environment, and yet accrue data in a common
environment and use that common environment for correlation and analysis," says Pedersen.
Increasing enterprise system complexity also creates a driver for SIEM. Today's organizations
are adding greater numbers of connections, also known as endpoints, to their systems, either due
to incorporating mobile devices, the bring-your-owndevice trend, expanding supply chains, or a
desire to link their IT systems with their industrial control systems.
"The number of integration points with other technologies and the processes that support them
today can be overwhelming," says O'Brien. "As we ask our systems to do more, they
also become more vulnerable, which means we need a level of awareness that wasn't required
before."
11
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
UNIT 5
Case Studies: MapReduce: Simplified Data Processing on Large Clusters- RDBMS to NoSQL:
Reviewing Some Next-Generation Non-Relational Database's - Analytics: The real-world use of big data
- New Analysis Practices for Big Data.
Course Objectives:
The students are to understand the concepts of RDBMS to NoSQL
To understand real time applications of Big data
Course Outcomes:
The students can use the tools of Big Data
The students will learn how to process on large clusters
The students can able to turn Big Data into big money
Mapreduce:
The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing primitives are
called mappers and reducers. Decomposing a data processing application into mappers and
reducers is sometimes nontrivial. But, once we write an application in the MapReduce form,
scaling the application to run over hundreds, thousands, or even tens of thousands of
machines in a cluster is merely a configuration change. This simple scalability is what has
attracted many programmers to use the MapReduce model.
1
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
PayLoad - Applications implement the Map and the Reduce functions, and form the
core of the job.
Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value
pair.
NameNode - Node that manages the Hadoop Distributed File System (HDFS).
DataNode - Node where data is presented in advance before any processing takes
place.
MasterNode - Node where JobTracker runs and which accepts job requests from
clients.
JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
Generally MapReduce paradigm is based on sending the computer to where the data
resides!
MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
o Map stage : The map or mapper’s job is to process the input data. Generally
the input data is in the form of file or directory and is stored in the Hadoop
2
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
file system (HDFS). The input file is passed to the mapper function line by
line. The mapper processes the data and creates several small chunks of data.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form
an appropriate result, and sends it back to the Hadoop server.
Example Scenario
3
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
Given below is the data regarding the electrical consumption of an organization. It contains
the monthly electrical consumption and the annual average for various years.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
If the above data is given as input, we have to write applications to process it and produce
results such as finding the year of maximum usage, year of minimum usage, and so on. This
is a walkover for the programmers with finite number of records. They will simply write the
logic to produce the required output, and pass the data to the application written.
But, think of the data representing the electrical consumption of all the largescale industries
of a particular state, since its formation.
Input Data
4
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
The above data is saved as sample.txtand given as input. The input file looks as shown
below.
19792323243242526262626252625
198026272828283031313130303029
198131323232333435363634343434
198439383939394142434039383840
198538393939394141410040393945
Example Program
Given below is the program to the sample data using MapReduce framework.
packagehadoop;
importjava.util.*;
importjava.io.IOException;
importjava.io.IOException;
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
importorg.apache.hadoop.mapred.*;
importorg.apache.hadoop.util.*;
5
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
//Mapper class
//Map function
while(s.hasMoreTokens())
lasttoken=s.nextToken();
intavgprice = Integer.parseInt(lasttoken);
//Reducer class
6
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
//Reduce function
intmaxavg=30;
intval=Integer.MIN_VALUE;
while (values.hasNext())
if((val=values.next().get())>maxavg)
//Main function
conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
7
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
Save the above program as ProcessUnits.java. The compilation and execution of the
program is explained below.
Follow the steps given below to compile and execute the above program.
Step 1
The following command is to create a directory to store the compiled java classes.
$ mkdir units
Step 2
Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce
program. Visit the following link
http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/1.2.1 to download the jar.
Let us assume the downloaded folder is /home/hadoop/.
8
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
Step 3
The following commands are used for compiling the ProcessUnits.javaprogram and
creating a jar for the program.
Step 4
The following command is used to create an input directory in HDFS.
$HADOOP_HOME/bin/hadoopfs-mkdirinput_dir
Step 5
The following command is used to copy the input file named sample.txtin the input
directory of HDFS.
Step 6
The following command is used to verify the files in the input directory.
$HADOOP_HOME/bin/hadoopfs-lsinput_dir/
Step 7
The following command is used to run the Eleunit_max application by taking the input files
from the input directory.
Wait for a while until the file is executed. After execution, as shown below, the output will
contain the number of input splits, the number of Map tasks, the number of reducer tasks,
etc.
9
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
FileSystemCounters
Map-ReduceFramework
10
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
FileOutputFormatCounters
BytesWritten=40
Step 8
The following command is used to verify the resultant files in the output folder.
$HADOOP_HOME/bin/hadoopfs-lsoutput_dir/
11
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
Step 9
The following command is used to see the output in Part-00000 file. This file is generated
by HDFS.
$HADOOP_HOME/bin/hadoopfs-cat output_dir/part-00000
198134
198440
198545
Step 10
The following command is used to copy the output folder from HDFS to the local file
system for analyzing.
$HADOOP_HOME/bin/hadoopfs-cat output_dir/part-
00000/bin/hadoopdfsgetoutput_dir/home/hadoop
Important Commands
All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoopcommand.
Running the Hadoop script without any arguments prints the description for all commands.
The following table lists the options available and their description.
Options Description
12
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
13
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
classpath Prints the class path needed to get the Hadoop jar and
the required libraries.
GENERIC_OPTIONS Description
-status <job-id> Prints the map and reduce completion percentage and all
job counters.
14
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
-events <job-id><fromevent- Prints the events' details received by jobtracker for the
#><#-of-events> given range.
-history [all] <jobOutputDir> - Prints job details, failed and killed tip details. More details
history <jobOutputDir> about the job such as successful tasks and task attempts
made for each task can be viewed by specifying the [all]
option.
-list[all] Displays all jobs. -list displays only jobs which are yet to
complete.
-kill-task <task-id> Kills the task. Killed tasks are NOT counted against failed
attempts.
-fail-task <task-id> Fails the task. Failed tasks are counted against failed
attempts.
-set-priority <job-id><priority> Changes the priority of the job. Allowed priority values
are VERY_HIGH, HIGH, NORMAL, LOW,
VERY_LOW
15
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
RDBMS to NoSQL:
What is RDBMS?
RDBMS stands for Relational Database Management System. RDBMS is the basis
for SQL, and for all modern database systems like MS SQL Server, IBM DB2, Oracle,
MySQL, and Microsoft Access.
Challenges of RDBMS
RDBMS assumes a well-defined structure of data and assumes that the data is largely
uniform.
It needs the schema of your application and its properties (columns, types, etc.) to be
defined up-front before building the application. This does not match well with the agile
development approaches for highly dynamic applications.
As the data starts to grow larger, you have to scale your database vertically, i.e. adding
more capacity to the existing servers.
As an example, consider that you have a blogging application that stores user blogs.
Now suppose that you have to incorporate some new features in your application such as
users liking these blog posts or commenting on them or liking these comments. With a
16
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
typical RDBMS implementation, this will need a complete overhaul to your existing
database design. However, if you use NoSQL in such scenarios, you can easily modify your
data structure to match these agile requirements. With NoSQL you can directly start inserting
this new data in your existing structure without creating any new pre-defined columns or pre-
defined structure.
Schema Less:
NoSQL databases being schema-less do not define any strict data structure.
Scales Horizontally:
In contrast to SQL databases which scale vertically, NoSQL scales horizontally by adding
more servers and using concepts of sharding and replication. This behavior of NoSQL fits
with the cloud computing services such as Amazon Web Services (AWS) which allows you
to handle virtual servers which can be expanded horizontally on demand.
Better Performance:
All the NoSQL databases claim to deliver better and faster performance as compared to
traditional RDBMS implementations.
Talking about the limitations, since NoSQL is an entire set of databases (and not a single
database), the limitations differ from database to database. Some of these databases do not
support ACID transactions while some of them might be lacking in reliability. But each one
of them has their own strengths due to which they are well suited for specific requirements.
17
MVIT/IT/IV/VII
IT-E79BIG DATABASES UNIT 5
KeyValue Databases:
The key of a key/value pair is a unique value in the set and can be easily looked up to access
the data. Key/value pairs are of varied types: some keep the data in memory and some
provide the capability to persist the data to disk. A simple, yet powerful, key/value store is
Oracle’s Berkeley DB.
18
MVIT/IT/IV/VII
IT-E79BIG DATABASES Part A
TWO MARKS
UNIT-I
Big data means really a big data; it is a collection of large datasets that cannot be processed using traditional
computing techniques. Big data is not merely a data; rather it has become a complete subject, which involves
various tools, techniques and frameworks.
Volume. Big Data comes in one size: large. Enterprises are awash with data, easily amassing
terabytes and even petabytes of information.
Variety. Big Data extends beyond structured data to include unstructured data of all varieties: text,
audio, video, click streams, log files, and more.
Veracity. The massive amounts of data collected for Big Data purposes can lead to statistical errors
and misinterpretation of the collected information. Purity of the information is critical for value.
Velocity. Often time sensitive, Big Data must be used as it is streaming into the enterprise in order to
maximize its value to the business, but it must also still be available from the archival sources as
well.
Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in turn,
leads to,smarter business moves, more efficient operations, higher profits and happier customers.
Cost reduction
Faster, better decision making
Competitive Advantage
New business opportunities
New products and services
MVIT/IT/IV/VII 1
IT-E79BIG DATABASES Part A
Capturing data
Storage
Curation
Searching
Sharing
Transfer
Analysis
Presentation.
Making tools easier to use. Hadoop stack and NoSQLs really do require programming knowledge to
unlock their power.
Getting quicker answers across large data sets. We can get them in "acceptable" amounts of time. Its
about getting that 3 hour query down to 5 minutes or less. Apache Impala (incubating) is a good
example of work in this space
Data volumes will continue to grow. There‟s absolutely no question that we will continue generating
larger and larger volumes of data, especially considering that the number of handheld devices and
Internet-connected devices is expected to grow exponentially.
Ways to analyse data will improve. While SQL is still the standard, Spark is emerging as a
complementary tool for analysis and will continue to grow, according to Ovum.
More tools for analysis (without the analyst) will emerge. Microsoft MSFT -1.41% andSalesforce
both recently announced features to let non-coders create apps to view business data.
Prescriptive analytics will be built in to business analytics software.
The name Hadoop has become synonymous with big data. It‟s an open-source software framework for
distributed storage of very large datasets on computer clusters. All that means you can scale your data up and
down without having to worry about hardware failures. Hadoop provides massive amounts of storage for any
kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
MVIT/IT/IV/VII 2
IT-E79BIG DATABASES Part A
11.What is cloudera?
Cloudera is essentially a brand name for Hadoop with some extra services stuck on. They can help your
business build an enterprise data hub, to allow people in your organization better access to the data you are
storing
Tableau
Silk
Cartodb
Chartio
Data wrapper
MongoDB is the modern, start-up approach to databases. Think of them as an alternative to relational
databases. It‟s good for managing data that changes frequently or data that is unstructured or semi-
structured
14.What is R Language?
R is a language for statistical computing and graphics. If the data mining and statistical software listed
above doesn‟t quite do what you want it to, learning R is the way forward. In fact, if you‟re planning
on being a data scientist, knowing R is a requirement.
UNIT2
Big data analytics is the process of examining large data sets containing a variety of data types -- i.e.,
big data -- to uncover hidden patterns, unknown correlations, market trends, customer preferences
and other useful business information.
• Graphical facilities for data analysis and display either directly at the computer or on hardcopy
• A well developed, simple and effective programming language (called „S‟) which includes
conditionals, loops, user defined recursive functions and input and output facilities. (Indeed most of
the system supplied functions are themselves written in the S language.)
3.Define Environment?
MVIT/IT/IV/VII 3
IT-E79BIG DATABASES Part A
The term “environment” is intended to characterize it as a fully planned and coherent system, rather
than an incremental accretion of very specific and inflexible tools, as is frequently the case with
other data analysis software.
• Vectors
• Lists
• Matrices
• Arrays
• Factors
• Data Frames
6.Define List.
A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.
Factors are the r-objects which are created using a vector. It stores the vector along with the distinct
values of the elements in the vector as labels. The labels are always character irrespective of whether
it is numeric or character or Boolean etc. in the input vector. They are useful in statistical
modeling.Factors are created using the factor() function.
The nlevels functions Data frames are tabular data objects. Unlike a matrix in data frame each
column can contain different modes of data. The first column can be numeric while the second
column can be character and third column can be logical. It is a list of vectors of equal length.Data
Frames are created using the data.frame() function. gives the count of levels.
A function is a set of statements organized together to perform a specific task. R has a large number
of in-built functions and the user can create their own functions.
MVIT/IT/IV/VII 4
IT-E79BIG DATABASES Part A
Data Reshaping in R is about changing the way data is organized into rows and columns. Most of the
time data processing in R is done by taking the input data as a data frame. It is easy to extract data
from the rows and columns of a data frame but there are situations when we need the data frame in a
format that is different from format in which we received it. R has many functions to split, merge
and change the rows to columns and vice-versa in a data frame.
XML is a file format which shares both the file format and the data on the World Wide Web,
intranets, and elsewhere using standard ASCII text. It stands for Extensible Markup Language
12.What is Data Modeling?
• Data modeling is a useful technique to manage a workflow for various entities and for
making a sequential workflow in order to have a successful completion of a task.
• Hadoop and its big data model, we need to have a comprehensive study before
implementing any execution task and setting up any progressive environment (XML).
Similar to HTML it contains
An operator is a symbol that tells the compiler to perform specific mathematical or logical
manipulations. R language is rich in built-in operators and provides following types of operators.
• Arithmetic Operators
• Relational Operators
• Logical Operators
• Assignment Operators
UNIT-III
1.What is hadoop?
Hadoop runs applications using the MapReduce algorithm, where the data is processed in
parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop
MVIT/IT/IV/VII 5
IT-E79BIG DATABASES Part A
applications capable of running on clusters of computers and they could perform complete statistical
analysis for a huge amounts of data.
2.Describehadoop architecture?
Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
4.DefineMapReduce in Hadoop?
MVIT/IT/IV/VII 6
IT-E79BIG DATABASES Part A
The Map task takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.The reduce task is always
performed after the map job.
UNIT-IV
Compliance issues are becoming a big concern in the data center, and these issues have a
major effect on how Big Data is protected, stored, accessed, and archived. Whether Big Data is
MVIT/IT/IV/VII 9
IT-E79BIG DATABASES Part A
going to reside in the data warehouse or in some other more scalable data store remains unresolved
for most of the industry; it is an evolving paradigm.
The concepts of Big Data are as applicable to health care as they are to other businesses.
Health care deals with these massive data sets using Big Data stores, which can span tens of
thousands of computers to enable enterprises, researchers, and governments to develop
innovative products, make important discoveries, and generate new revenue streams.
In the medical industry, the primary problem is that unsecured Big Data stores are filled with
content that is collected and analyzed in real time and is often extraordinarily sensitive:
intellectual property, personal identifying information, and other confidential information.
The data stores used in big data are Hadoop, Cassandra, and MongoDB.
9.What is an IP?
One of the biggest issues around Big Data is the concept of intellectual property
(IP).Intellectual property refers to creations of the human mind, such as inventions, literary and
artistic works, and symbols, names, images, and designs used in commerce. Although this is a rather
broad description, it conveys the essence of IP.
MVIT/IT/IV/VII
10
IT-E79BIG DATABASES Part A
MapReduce is a framework using which we can write applications to process huge amounts of data,
in parallel, on large clusters of commodity hardware in a reliable manner.
•PayLoad
•NameNode
•DataNode
•MasterNode
MVIT/IT/IV/VII
11
IT-E79BIG DATABASES Part A
•SlaveNode
•JobTracker
•Task Tracker
•Job
•Task
•task attempt.
3.What is RDBMS?
RDBMS stands for Relational Database Management System. RDBMS is the basis for SQL, and for
all modern database systems like MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft
Access.
4.Challenges of RDBMS
•RDBMS assumes a well-defined structure of data and assumes that the data is largely uniform.
•It needs the schema of your application and its properties (columns, types, etc.) to be defined up-
front before building the application. This does not match well with the agile development
approaches for highly dynamic applications.
•As the data starts to grow larger, you have to scale your database vertically, i.e. adding more
capacity to the existing servers.
Schema Less
Dynamic and Agile
Scales Horizontally
Better Performance
“A NoSQL (originally referring to “non SQL” or “non relational”) database provides a mechanism
for storage and retrieval of data that is modeled in means other than the tabular relations used in
relation databases (RDBMS).
MVIT/IT/IV/VII
13
IT-E79BIG DATABASES Part B
3. What is Big Data? Summarize the evolution of big data. (MAR 21)(or) Describe the evolution of
Bigdata(Jan 23)
4. Explain characteristics of big data and discuss the importance of big data analytics in various
business domains. (MAR 21)
6. Interpret the challenges and issues involved in Big data. (SEP 21)(or) List and explain the
characteristics, challenges and issues in bigdata(Dec 2023)
7. Discuss the use of Big data analytics and its' importance with suitable real world example. (MAR 22)
9. Illustrate the need for bigdata and briefly discuss the applications of Bigdata(Dec 2023)(or) what is
bigdata analytics? Explain four V’s of bigdata. Briefly discuss applications of Bigdata. (Dec 2024).
10. What are the benefits of bigdata? Discuss challenges under bigdata. How big data analytics can be
useful in the development of smart cities(Dec 2024).
UNIT –2
1. Describe a Data Frame in R with its basic function.(NOV 19)
2. Review the hybrid data modeling approach in detail. (MAR 21)
3. Describe about how to handle Big data analytics with R programming. (SEP 21)
4. Explain in detail 'about data computing modelling(SEP 21)(or) Explain R modeling
architecture(Jan 23)
5. Explain the concept of analyzing and exploring data with R language. (MAR 22)(or) List
the steps to explore data in R(Jan 23)
MVIT/IT/IV/VII 1
IT-E79BIG DATABASES Part B
3. Describe core architecture of Hadoop with suitable block diagram. Discuss the role of
each component in detail. (MAR 21) (or) Illustrate the architecture of Hadoop with
suitable block diagram. Discuss the role of each component in detail (Dec 2023)
4. Give the features of column oriented database management system, explain the
storage management of HBASE with example (MAR 21)
5. Describe the major components of Cassandra Data Model.(or) Write a detailed note
on Apache Cassandra (Jan 23)
6. Specify the similarities and differences between Hadoop, HBASE, and Cassandra
7. Outline the features of Hadoop and explain the functionalities of Hadoop(SEP 21).
8. How Hadoop streaming is suited with text processing explain? (Dec 2024).
9. Describe HBASE and write down the advantages of storing data with HBASE. (SEP 21)
11. Highlight the features of Apache Mahout in detail,(MAR 22)(or) Describe the feature of Apache
Mahout(Jan 23)
12. Discuss the NoSQL data stores and their characteristics features(Dec 2023) (or) Explain in detail
about an open source NoSQL Distributed database. (Dec 2024).
MVIT/IT/IV/VII 2
IT-E79BIG DATABASES Part B
UNIT –4
1. Elucidate compliance issues and its major effect on Big Data. (NOV 19)
(or) Describe bigdata compliance and list the basic rules that enable security in bigdata(Dec 2023)
3. Brief the role of data classification to determine the sensitivity of data . Create
appropriate security rules and compliance objectives for health care industry. (MAR
21) (or) Explain in detail about the bigdata security and its compliance (Dec 2024).
4. State the relation of big data analytics to cyber security. Give detailed des cription of
how business can utilize big data analytics to address cyber security threats. (MAR 21)
5. Illustrate about classifying data and protecting Big data analytics, (SEP 21).
6. Discuss In detail about Big. data In cyber defence.(SEP 21).(or) How Big data helps in cyber
defence(Jan 23)
7. Explain about the pragmatic steps to securing Big data in detail. (MAR 22)(or) Discuss
the pragmatic steps to secure Bigdata(Jan 23)(or) Why bigdata security is essential? Explain in detail
about bigdata security(Dec 2023)
8. Paraphrase about Intellectual Challenges in Big data. (MAR 22)(or) Write in detail about
the intellectual property challenge and the use of Bigdata in cyber defence. (Dec 2024).
UNIT –5
1. (a)Analyze the SimpleDB Data Model and Architecture.
(b) "Big data is dependent upon a salable and extensible information foundation" (NOV 19)
2. Examine the new analysis practices for big data. (NOV 19)(or) Discus the new analysis practices for
bigdata(Jan 23)(or) Outline the new analaysis practices for Bigdata in detail with a case study. (Dec 2023)
3. Analyze the impact of using Map Reduce technique on 'Count of URL access frequency' in
large clusters. (MAR 21) (or) Desribe in detail the mapper class, reducer class and scaling
out with an example (Dec 2024).
4. Discuss the role of big data in education industry to improve the operational effectiveness
MVIT/IT/IV/VII 3
IT-E79BIG DATABASES Part B
5. Describe. about RDBMS to NoSQL reviewing next generation non-relational databases. (SEP 21).(or)
Discuss briefly about the evolution from RDBMS to NoSQL(Dec 2024).
6. Define Big data Analytics, Explain in detail about new analysis practices for Big data.(SEP 21).
7. Paraphrase about simplified data processing on large clusters. (MAR 22)(or) Explicate simplifies
method of data processing on large clusters(Jan 23)(or) Demonstrate the implementation of data processing
on large cluster using map reduce with example (Dec 2023)
8. Explain in detail about the real-world use of Big data with an example. (MAR 22)
MVIT/IT/IV/VII 4
ipfS;.
5477194
B.Tech. DEGREE EXAMINATION, JANUARY 2023.
Seventh Semester
Information Technology
(2013 - 14 Regulations)•
3.What is "R"?.
,UNIT II;
^Or
25477194 5477194
5477194
B.Tech. DEGREE EXAMINATION, MARCH 2022.
Seventh Semester
Information Technology
5477194
B.Tech. DEGREE EXAMINATION, SEPTEMBER 2021.
Seventh Semester
Information Technology
BIG DATABASES
..
/
9. Comprehend cluster in Big data. UNIT IV
10. What are the benefits of MapReduce? 17. Illustrate about classifying data and protecting
Big data analytics, (11)
SECTION B - (5 x 11 = 55 marks)
Answer ALL questions, ONE question from each Unit. .o-
All questions carry equal marks. 18. Discuss In detail about Big. data In cyber
UNIT I defence. (11)
2 5477194 3 - 5477194
5477194
B.Tech. DEGREE EXAMINATION, MARCH 2021.
Seventh Semester
Information Technology
BIG DATABASES
2 5477194 3 5477194
UNIT V 5477194
19. (a) Analyze the SimpleDB Data Model and B.Tech. DEGREE EXAMINATION, NOVEMBER 2019.
Architecture. (8)
(b) "Big data is dependent upon a scalable and Seventh Semester
extensible information foundation"
Information Technology
Ccmment. (3)
Elective - BIG PATABASES
Or
(2013-14 Onwards Regulations)
20. Examine the new analysis practices for big data.
(ll}
Time : Three hours Maximum: 75 marks
4 5477194
7. Write the basic rules used for enabling the Big UNIT III
Data security.
15. (a) Explain about the basic parameters of
8. How the data can be protected? mapper and reducer function. (7)
9. What is master data structures? (b) Compare RDBMS with Hadoop MapReduce.
(4)
10. St" te the characteristics of NoSQL Databases.
PART B - (5 x 11 = 55 marks) Or
Answer ALL questions, choosing ONE from each Unit.
16 . (a) Describe the major components of Cassandra
. UNIT I
Data Model. (6)
11. (a) Discuss' on the four dimensions which are
related to the primary aspects of Big Data. (b) Specify the similarities and differences
(6) between Hadoop, HBASE, and Cassandra. (5)
(b) Why Big Data is important? Explain. (5)
Or UNIT IV
12. What are the issues and challenges associated 17. Elucidate compliance issues and its major effect
with Big Data? Explain. (11)
on Big Data. (11)
UNIT II
Or
13. Describe a Data Frame in R with its basic
function. (11) 18. Discuss about the following:
Or
(a) Pragmatic Steps to Secure Big Data. (6)
14. Narrate the different data types/objects in R with
(b) Protecting Big Data Analytics. (5)
suitable illustrations. (11)
2 5477194 3 5477194
a
Ji
HP
E;
crj
+J
0c
roE*;
L\.n3 b!
i.=
Z,- E ^'a d S
t d
E
tr
0)
-: =
Q
tr.:{
E; ir
9 7
.:i
l = E
E a
E
:
E
boi=
YE a
ai f; c >
x u:2 -* E
I *
;I=
Z-i: ! F {j
+ o. ?
!
== i i E
{J
<{l
i
; * a
0)
fJ
a
F-t
F-l
I kr!? 3 a q * 5 I 3 .ts o 2
{J
tr-l r--
#=
L.l ui m :
'-:
ii i
2 E ca 3 ,
" '.--;
-
(I
<{l
:i;: p <r 2
rol _r I| = F t h A 5f; o hD
-= ? c-., .a ! .c)
= o
a A
:J U
Cg
o
-:EI A Z or La ; cr tc
-u
= = Fq
i:<
.-,2 '! C- 7\ 6 'ri o c, a
o
m" Ir
-:)=tA=.' :=; ?
_ a l,
.= E : -C +l
(d
+)
'- --..L
i & 5 4 'l z
c)
a)
.^
Z-^iiirr w tr-
+J a <41
6)l
X
a
c)
+i ac
14 '-{
F-l
r-l
|
{J
z .a !$l
1l:l
O
tr o
,tr
-6 o
C)
./
!
C)
O
a. p. li
cg o
'o)
b! +J
iJ CC
o
-a (,) tf
d
o^ a c
-o
A) .r 6 -o(6
tJ
a) cd
.5
ax E c)
Frl -^
6
0) -O Fl
(,
F!a '2a
!.,1/
a 6
-0) +, c)>
.iA
a; X
a
Fli p* t'-l -l j=
,d .A
+ 6'l
b! .D Cd t: .6
(,)su dl
Cid
:(+i
c .qa
p" c)
r-l
.- ,r' a ii blE ti:
r.-l
<tl
.n4
l!
-!!
fJ u]
ii) a- irr ^
.d
€CJ Aa sc)
5oa
oaa-r q:r .:
dV
d.Q
di F] -dra cgu bO 0)-
(i
'v;! ^1 >o bo :u
.i-*Y-- .i .jr A
'J) --
Hd X.: ((! !cr
+)
.li ; i .. t-
!
u
O+) 0r '-
7.d :Yt{Aar (, .* o.E C6.:
6ri\Jh .io.1t4 a-=
Ad
!
a
aa !Xid
ddti
d!;:
P-
i!a
d \--,
UH.i A a.o
c N
a&Jv
'5a Qa o pO
h
^a
b[H ,/brt;ai | a^\ CA
OA ^,
PA
^ !! .D
U
ao ct r&
-!H > r_r.t at a - i:,Y
o-
u. a , t u .cg
a, I
-^) O- bpd aac)
() a/)
,icj-
^- >4?r'
?^-1= !r O o? .a
-lf
'44
,ilLd On c.
c0 /-9 c.= ss A
'It
o
>i
!vv
€ c) >'ii
aY +JA
!v-
- ! )--H
'i^ .9i
\4 ;> _::
Llit
Xa
i-rr = >
I
j 6l cd +
0O6lFr r--l Fl F,l