0% found this document useful (0 votes)

243 views20 pages

CCS334-Big-Data-Analytics UNIVERSITY QP

Uploaded by

sasi.oasys

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

243 views20 pages

CCS334-Big-Data-Analytics UNIVERSITY QP

Uploaded by

sasi.oasys

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

UNIT-1

UNDERSTANDING BIG DATA

Part A
1. What is big data analytics?
Big Data analytics is a process used to extract meaningful insights, such as
hidden patterns, unknown correlations, market trends, and customer preferences.
Big Data analytics provides various advantages—it can be used for better decision
making, preventing fraudulent activities, among other things.

2. What is big data?

Big Data is a massive amount of data sets that cannot be stored, processed, or
analyzed using traditional tools.
For example, in a regular Excel sheet, data is classified as structured data—with a
definite format. In contrast, emails fall under semi-structured, and your pictures and videos
fall under unstructured data. All this data combined makes up Big Data.

3. Why is big data analytics important?

Big data analytics assists organizations in harnessing their data and identifying
new opportunities. As a result, smarter business decisions are made, operations are more
efficient, profits are higher, and customers are happier.

4. How does big data analytics work?

Gather information. Once data has been collected and saved, it must be correctly
organized in order to produce reliable answers to analytical queries, especially when the data
is huge and unstructured. Then, clean and analyze the data.

5. Who uses big data analytics?

Industries that include big data analytics are Banking and Securities, Healthcare
Providers, Communications, Media and Entertainment, Education, Government, Retail and
Wholesale trade, Manufacturing Natural Resources, and Insurance.

6. What are the five types of big data analytics?

The five types of big data analytics are Prescriptive Analytics, Diagnostic
Analytics, Cyber Analytics, Descriptive Analytics, and Predictive Analytics.

1
7. What are the 3 types of big data?
There are three types of big data: Data that is structured, Data that is
unstructured, and Data that is semi-structured.

8. What are the advantages of big data?

Businesses can tailor products to customers based on big data instead of spending
a fortune on ineffective advertising. Businesses may use big data to study consumer
patterns by tracking POS transactions and internet purchases.

9. Why do we need big data analytics?

Organizations may harness their data and utilize big data analytics to findnew
possibilities. This results in wiser company decisions, more effective operations, more
profitability, and happier clients. Businesses that employ big data and advanced
analytics benefit in a variety of ways, including cost reduction.

10. What is big data in simple words?

Big data is a collection of large, complex, and voluminous data that traditional data
management tools cannot store or process.

11. What is the meaning of big data analytics?

Big data analytics refers to the complex process of analyzing big data to reveal
information such as correlations, hidden patterns, market trends, and customer preferences.
.
12. What are the applications of Big data?
 Transportation
 Advertising and Marketing
 Banking and Financial Services
 Government
 Media and Entertainment
 Meteorology
 Healthcare
 Cyber security and Education

13. What are the industry examples of big data?

 Retail
 Banking
 Manufacturing
 Education
 Government
 Health Care

14. What are 4 types of big data technologies?

 Big data technologies can be categorized into four main types:
 data storage,
 data mining,
 data analytics, and
 data visualization

2
15. What is unstructured data?
Unstructured data is the data which does not conforms to a data model and
has no easily identifiable structure such that it cannot be used by a computer
program easily. Unstructured data is not organised in a pre-defined manner or does
not have a pre-defined data model, thus it is not a good fit for a mainstream
relational database

16. What are the advantages and disadvantages of unstructured data?

Advantages of Unstructured Data:
 Its supports the data which lacks a proper format or sequence
 The data is not constrained by a fixed schema
 Very Flexible due to absence of schema.
 Data is portable and very scalable
 It can deal easily with the heterogeneity of sources.
Disadvantages of Unstructured data:
 It is difficult to store and manage unstructured data due to lack of schema
and structure and Ensuring security to data is difficult task.

17. What is web analytics?

Web analytics is the process of analyzing the behavior of visitors to a website.
This involves tracking, reviewing and reporting data to measure web activity, including the
use of a website and its components, such as webpages, images and videos.

18. What is Hadoop?

Hadoop is an Apache open source framework written in java that allows
distributed processing of large datasets across clusters of computers using simple
programming models. The Hadoop framework application works in an environment that
provides distributed storage and computation across clusters of computers. Hadoop is
designed to scale up from single server to thousands of machines, each offering local
computation and storage.

19. What is mapreduce?

MapReduce is a parallel programming model for writing distributed
applications devised at Google for efficient processing of large amounts of data (multi-
terabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a
reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an Apache
open-source framework.

20. What are the advantages and disadvantages of Hadoop?

Advantages of Hadoop
 Hadoop framework allows the user to quickly write and test distributed
systems.
 It is efficient, and it automatic distributes the data and work across the
machines and in turn, utilizes the underlying parallelism of the CPU cores.
Disadvantages of Hadoop:
 Not very effective for small data.
 Hard cluster management.
 Has stability issues and Security concerns
3
PART B
1. Explain in detail about the architecture of Hadoop and HDFS.
 Definition
 Hadoop Architecture
 Hadoop Distributed File System
 Modules and Features of Hadoop
 Hadoop Distributed File System
 Advantages of Hadoop
 Disadvantages:

2. Explain in detail about industry examples of big data.

 Definition
 Examples of big data.
1.) Retail
2.) Banking
3.) Manufacturing
4.) Education
5.) Government andHealth Care
3. Explain in detail about Unstructured data
 Definition
 Characteristics of Unstructured Data:
 Sources of Unstructured Data:
 Problems faced in storing unstructured data:
 Possible solution for storing Unstructured data:
 Extracting information from unstructured Data:
 Advantages of Unstructured Data.
 Disadvantages of Unstructured data:
4. Define web analytics and explain it.

 Definition
 Steps of web analytics
 Main categories of web analytics
 Web analytics tools and examples

5. Explain in detail about open source technologies.

 Opensource technologies (Cassandra,Greenplum,MongoDB,
CouchDB,MariaDB)
 Features
 Pros
 Cons

4
UNIT-2
NOSQL DATA MANAGEMENT

PART A

1. Define NoSQL database.

NoSQL Database is used to refer a non-SQL or non relational database.It
provides a mechanism for storage and retrieval of data other than tabular relations model
used in relational databases.NoSQL database doesn't use tables for storing data.It is
generally used to store big data and real-time web applications.

2. Define aggregate data models

The term aggregate means a collection of objects that we use to treat as a
unit. An aggregate is a collection of data that we interact with as a unit. These units of data
or aggregates form the boundaries for ACID operation.

3. Define key-value data model.

A key-value data model or database is also referred to as a key-value store.
It is a non-relational type of database.

4. Define graph databases.

A graph database is a type of NoSQL database that is designed to handle data
with complex relationships and interconnections.In a graph database, data is stored as
nodes and edges, where nodes represent entities and edges represent the relationships
between those entities.

5. Define schemaless databases

A schemaless database, like MongoDB, does not have these up-front
constraints, mapping to a more ‘natural’ database. Even when sitting on top of a data
lake, each document is created with a partial schema to aid retrieval.Any data, formatted
or not, can be stored in a non-tabular NoSQL type of database. At the same time, using
the right tools in the form of a schemaless database can unlock the value of all of your
structured and unstructured data types.

6. Define materialized views

A materialized view takes the regular view described above and materializes
it by proactively computing the results and storing them in a “virtual” table.A view
can be “materialized” by storing the tuples of the view in the database. Index
structures can be built on the materialized view.

7. Define distribution models

Aggregate oriented databases make distribution of data easier, since the distribution
mechanism has to move the aggregate and not have to worry about related data, as all
the related data is contained in the aggregate. There are two styles of distributing
data:Sharding and Replication:

5
8. Define Cassandra
Cassandra deals with unstructured data. Cassandra has a flexible schema.Tables or
column families are the entity of a keyspace. Row is a unit of replication in Cassandra.
Column is a unit of storage in Cassandra.

9. Define Cassandra clients.

The Cassandra client is used to connect, manage and develop your Cassandra
database. The database client is used to manage your Cassandra database with actions like
insert, delete and update table

10. What is Apache Cassandra?

Apache Cassandra is a highly scalable, high-performance distributed database
designed to handle large amounts of data across many commodity servers, providing high
availability with no single point of failure. It is a type of NoSQL database

11. List the types of NOSQL databases.

Types of NoSQL database: Types of NoSQL databases and the name of the databases
system that falls in that category are:
- Graph Databases: Examples – Amazon Neptune, Neo4j
Key value store: Examples – Memcached, Redis, Coherence
Tabular: Examples – Hbase, Big Table, Accumulo
Document based: Examples – MongoDB, CouchDB, Cloudant

12. What are the components of Cassandra?

Node ,Data center ,Cluster ,Commit log ,Mem-table SSTable ,Bloom filter

13. Differentiate between relational database and NOSQL database.

Relational Database NoSql Database

Supports powerful query language. Supports very simple

query language.

It has a fixed schema. No fixed schema.

Follows ACID (Atomicity, Consistency, It is only “eventually

Isolation, and Durability). consistent”.

Supports transactions. Does not support

transactions.

14. What is sharding and replication.

Sharding: Sharding distributes different data across multiple servers, so each
server acts as the single source for a subset of data.
Replication: Replication copies data across multiple servers, so each bit of data can
be found in multiple places.

6
15. What are the types of replication?
Replication comes in two forms,
Master-slave replication makes one node the authoritative copy that handles writes while
slaves synchronize with the master and may handle reads.
Peer-to-peer replication allows writes to any node; the nodes coordinate to synchronize
their copies of the data.

16. Define consistency .

Data Consistency in DBMS is the state of data that is consistent across all systems.
Integrity ensures that the data is correct. Consistency ensures that the data format is correct,
or that the data is correct with respect to other data.

17. What is SPOF?

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the
entire system from working. SPOFs are undesirable in any system with a goal of high
availability or reliability, be it a business practice, software application, or other industrial
system.

18. Differentiate between RDBMS and Cassandra.

RDBMS Cassandra

RDBMS deals with structured data Cassandra deals with unstructured data
and fixed schema and flexible schema

In RDBMS, a table is an array of In Cassandra, a table is a list of “nested

arrays. (ROW x COLUMN) key-value pairs”. (ROW x COLUMN
key x COLUMN value)

Database is the outermost Keyspace is the outermost container

container that contains data that contains data corresponding to an
corresponding to an application. application.

Tables are the entities of a Tables or column families are the entity
database. of a keyspace.

19. What is keyspace in Cassandra?

Keyspace is the outermost container for data in Cassandra. The basic attributes of a
Keyspace in Cassandra are – Replication factor,placement strategy,column families.

20. Is Cassandra a SQL database.

Cassandra is designed to be scalable and can handle large amounts of data. Cassandra is a
NoSQL database, which means it does not use the traditional table structure found in SQL databases.
This can make Cassandra more flexible and easier to use for certain types of data

7
PART B
1. Explain in detail about NOSQL databases.

 Definition
 Introduction and key features of NoSQL
 Features of NoSQL Databases
 Types of NoSQL database
 Advantages of NoSQL
 Disadvantages of NoSQL

2. Explain in detail about Aggregate Data Model.

 Definition
 Example of Aggregate Data Model:
 Consequences of Aggregate Orientation
 Advantage
 Disadvantage

3. Write short notes on key-value and document data models

 Definition
 Features:
 Advantages:
 Disadvantages:
 Some examples of key-value databases:
1. Couchbase
2. Amazon DynamoDB
3. Riak.
4. Aerospike
5. Berkeley DB

4. Explain detail about graph databases

 Definition:
 Types of Graph Databases:
 Property Graphs:
 RDF Graphs:
 Example of Graph Database:
 Advantages of Graph Database:
 Disadvantages of Graph Database:
 Future of Graph Database:
.
5. Explain in detail about Cassandra data model.
 Definition
 Components of Cassandra
 Column Family,Super Column
 Data Models of Cassandra and RDBMS
 Cassandra clients
 NoSQL vs. Relational Database
 Apache Cassandra
 Features of Cassandra

8
UNIT-III
MAP REDUCE APPLICATIONS

Part A
1. What is map reduce?
Map Reduce is a terminology that comes with Map Phase and Reducer Phase.
The map is used for Transformation while the Reducer is used for aggregation kind
of operation. The terminology for Map and Reduce is derived from some functional
programming languages like Lisp, Scala, etc.

2. What are the components of map reuce?

1. our Driver code,
2. Mapper(For Transformation), and
3. Reducer(For Aggregation).

3. What is shuffling and sorting?

The data are shuffled between/within nodes so that it moves out from the map and
get ready to process for reduce function. Sometimes, the shuffling of data can take
much computation time.The sorting operation is performed on input data for Reduce
function.

4. What is reduce function?

The Reduce function is assigned to each unique key. These keys are already
arranged in sorted order. The values associated with the keys can iterate the Reduce and
generates the corresponding output

5. What is YARN?
YARN stands for “Yet Another Resource Negotiator“. It was introduced in
Hadoop 2.0 to remove the bottleneck on Job Tracker which was present in Hadoop 1.0.
YARN architecture basically separates resource management layer from the processing
layer

6. What is MRunit?
MRUnit is a JUnit-based Java library that allows us to unit test Hadoop
MapReduce programs. This makes it easy to develop as well as to maintain Hadoop
MapReduce code bases. MRUnit supports testing Mappers and Reducers separately as well
as testing MapReduce computations as a whole.

7. What are the features of YARN?

YARN Features: YARN gained popularity because of the following features
o Scalability
o Compatibility
o Cluster Utilization
o Multi-tenancy

8. What are the three types of failures in mapreduce?

 Task failure
 TaskTracker Failure
 JobTracker Failure

9
9. What are the reasons for task failures?
 Limited memory: A task can fail if it runs out of memory while processing data
 Failures of disk: If the disk that stores data or intermediate results fails, tasks that
depend on that data may fail.

10. What are the types of job scheduling in mapreduce?

 FIFO
 Capacity scheduler
 Job scheduler

11. What are the types of Mapreduce?

 Input formats
 Output formats

12. What is mapping?

Mapping is the core technique of processing a list of data elements that come in pairs
of keys and values. The map function applies to individual elements defined as key-value
pairs of a list and produces a new list.
.
13. Define data locality.
Map-Reduce comes with a feature called Data-Locality. Data Locality is the
potential to move the computations closer to the actual data location on the
machines.
14. What are the industry examples of big data?
 Retail
 Banking
 Manufacturing
 Education
 Government
 Health Care
15. What are the data flow in mapreduce?
 Input reader
 Map function,
 Partition function
 Shuffling and sorting,
 Reduce function and
 Output writer

16. How to call mapreduce job?

To run this job ,with a single method call,submit() on a Job object (you can likewise
call waitForCompletion( ), which presents the activity on the off chance that it hasn’t been
submitted effectively, at that point sits tight for it to finish).

17. What is shuffling and sorting?

Shuffling is the process by which it transfers mappers intermediate output
to the reducer. Reducer gets 1 or more keys and associated values on the basis of

10
reducers. The intermediated key – value generated by mapper is sorted automatically
by key. In Sort phase merging and sorting of map output takes place.

18. What is heartbeat in bigdata?

A 'heartbeat' is a signal sent between a Data Node and Name Node. This signal is
taken as a sign of vitality. If there is no response to the signal, then it is understood that there
are certain health issues/ technical problems with the Data Node or the Task Tracker.

19. What is scaling out in big data?

Horizontal scaling, or scaling out or in, where you add more databases or divide
your large database into smaller nodes, using a data partitioning approach called sharding
which can be managed faster and more easily across servers.

20. What is Hadoop streaming?

Hadoop streaming is a utility that comes with the Hadoop distribution. The
utility allows you to create and run Map/Reduce jobs with any executable or script as the
mapper and/or the reducer.

PART B

1. Explain in detail about Hadoop MapReduce framework.

 Definition
 Mapreduce
 Steps of Data flow diagram
 Brief Working of Mapper
 Brief Working Of Reducer

2. Explain in detail about the architecture of YARN .

 Definition
 YARN
 Scalability: .
 Compatibility:
 Cluster Utilization:.
 Multi-tenancy
 Hadoop YARN Architecture
The main components of YARN architecture include:
1.Client
2. Resource Manager
3.Node Manager
4.Application Master
5.Container
 Advantages
 Disadvantages

3. Discuss the failures in classic Map-reduce

 Definition
 Failures in map reduce
11
 Task failure
 Tasktracker failure
 Jobtracker failure
 Reasons for task failure
 How to overcome Task failure.

4. Discuss about Job Scheduling in MapReduce

 Definition
 Map reduce algorithm
 Hadoop Schedulers
 Types of job scheduling in Mapreduce
 FIFO
 Capacity scheduler
 Fair Scheduler
 Advantages
 Disadvantages

5. Explain in details about MapReduce types.

 Definition:
 MapReduce Types
 The Java API:
 Input Formats
 Output Formats

12
UNIT IV

BASICS OF HADOOP

PART A

1. Define Hadoop.

Hadoop is an open-source framework that allows to store and process big data in a
distributed environment across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage.

2. What are the three components of Hadoop?

 Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit
of Hadoop.
 Hadoop MapReduce - Hadoop MapReduce is the processing unit of Hadoop.
 Hadoop YARN - Hadoop YARN is a resource management unit of Hadoop.

3. What are the MODULES OF HADOOP?

Hadoop is made up of 4 core modules:

 the Hadoop Distributed File System (HDFS),
 Yet Another Resource Negotiator (YARN),
 Hadoop Common and
 MapReduce

4. What is scaling out?

It is also referred as “scale out” is basically the addition of more
machines or setting up the cluster. In horizontal scaling instead of
increasing hardware capacity of individual machine you add more nodes
to existing cluster and most importantly, you can add more machines
without stopping the system.

5. What are the types of Scaling ?

Vertical scaling : When we add more resources to a single machine when the load increases.
For example you need 20gb of ram but currently your server has 10 GB of ram so you add
extra ram to the same server to meet the needs.

Horizontal scaling of scaling out : when you add more machines to match the resources need
it's called horizontal scaling. So if I have a machine of already 10 GB I'll add an extra
machine with 10 GB ram.

13
6. Define Hadoop streaming –
Hadoop Streaming is a part of the Hadoop Distribution System. It
facilitates ease of writing Map Reduce programs and codes. Hadoop Streaming
supports almost all types of programming languages such as Python, C++, Ruby,
Perl etc. The entire Hadoop Streaming framework runs on Java.

7. Define Avro.
Avro is a language-neutral data serialization system. It was developed by
Doug Cutting, the father of Hadoop. Since Hadoop writable classes lack language
portability, Avro becomes quite helpful, as it deals with data formats that can be
processed by multiple languages. Avro is a preferred tool to serialize data in
Hadoop.

8. What is Hadoop Pipes?

Hadoop Pipes is a C++ API that enables developers to write MapReduce programs
for Hadoop. The API allows the use of existing C++ libraries with Hadoop and enables the
development of high-performance MapReduce programs. Hadoop Pipes is a popular choice
for developers who need to process large data sets efficiently.

9. Define DFS.
DFS stands for the distributed file system, it is a concept of storing the file in
multiple nodes in a distributed manner. DFS actually provides the Abstraction for a single
large system whose storage is equal to the sum of storage of other nodes in a cluster.
10. What are the types of File-based data structures?
Two file formats:
1, Sequencefile
2, MapFile

11. What is java interface?

In Java, an interface specifies the behavior of a class by providing an abstract type. As

one of Java's core concepts, abstraction, polymorphism, and multiple inheritance are
supported through this technology. Interfaces are used in Java to achieve abstraction

12. What is compression?

Java object compression is done using the GZIPOutputStream class (this class
implements a stream filter for writing compressed data in the GZIP file format) and passes it
to the ObjectOutputStream class (this class extends OutputStream and implements
ObjectOutput, ObjectStreamConstants) .

13. What is serialization?

Serialization in Java is a mechanism of writing the state of an object into a byte-stream.

It is mainly used in Hibernate, RMI, JPA, EJB and JMS technologies

14
14. Define Hadoop integration.

Hadoop architecture is designed to be easily integrated with other systems.

Integration is very important because although we can process the data efficiently in Hadoop,
but we should also be able to send that result to another system to move the data to another
level.

15. What are the advantages and disadvantages of

Hadoop? Advantages:
 Cost
 Flexibility
 Speed
 scalability
 Fault tolerance

Disadvantages:
 Vulnerability
 Lack of security
 High up processing
 Supports only batch processing

16. What are the modes of operations.

 Standalone Mode
 Pseudo-distributed Mode
 Fully-Distributed Mode

17. What are the types of schedulers in Hadoop?

 FIFO (First In First Out) Scheduler.
 Capacity Scheduler.
 Fair Scheduler.

18. Where is Hadoop used in real life?

Hadoop can be used to analyze transaction data to detect fraudulent activities and
improve risk management. It can also be used to analyze market data to identify investment
opportunities and make informed trading decisions

19. Which databse is used in hadoop?

Hadoop is not a type of database, but rather a software ecosystem that allows for
massively parallel computing. It is an enabler of certain types NoSQL distributed databases
(such as HBase), which can allow for data to be spread across thousands of servers with little
reduction in performance.

20. What language that Hadoop does?

Java programming languageThe Hadoop framework itself is mostly written in the Java
programming language, with some native code in C and command line utilities written as
shell scripts.

15
PART B

1. Explain in detail about file based data structure.

 Definition
 File-based data structures
Two file formats:
1, Sequence file
2, Map File
 Read/write Sequence file/Map file
 Benefits of the Sequence file/Map file file format:
 Disadvantages of the Sequence file /Map file file format:

2. Explain in detail about integration of Hadoop.

 Definition
 R integration with Hadoop and Diagram
 R Hadoop Integration Method
 Hadoop Streaming
 RHIPE, ORCH

3. Explain in detail about HDFS architecture.

 Definition
 Types
 Read and write operation in HDFS
 Hadoop Streaming and pipes
 Design of Hadoop distributed file system (HDFS)
 Distributed File System Processing:
 Overview – HDFS
 Read / Write Operation In HDFS

4. Explain in detail about Hadoop I/O.

 Definition
 Data integrity
 Hadoop local file system
 Compression
 Serialization
 Avro

5. Explain in detail about Hadoop streaming and Pipes

 Definition
 Features and Code execution process
 Hadoop pipes
 Execution of streaming and pipes

16
UNIT-V
HADOOP RELATED TOOLS
PART A

1. Define Hbase.

Hbase is an open source and sorted map data built on Hadoop. It is column oriented
and horizontally scalable.It is based on Google's Big Table

2. Define hbase clients.

It describes the java client API for HBase that is used to perform CRUD operations on
HBase tables. HBase is written in Java and has a Java Native API. Therefore it provides
programmatic access to Data Manipulation Language (DML).

3. Define pig latin data model.

Pig Latin data model allows Pig to handle any kind of data. Pig Latin data model is
fully nested and can treat both atomic like integer, float, and non-atomic complex data types
such as Map and tuple.

4. What is pig latin?

Pig is a high-level data flow platform for executing Map Reduce programs of Hadoop.
It was developed by Yahoo. The language for Pig is pig Latin.

5. Define pig latin Scripts.

The Pig scripts get internally converted to Map Reduce jobs and get executed on data
stored in HDFS. Apart from that, Pig can also execute its job in Apache Tez or Apache
Spark.

6. What is pig latin statements?

The Pig Latin statements are used to process the data. It is an operator that accepts a
relation as an input and generates another relation as an output.

7. What are the types of modes in pig?

Apache Pig executes in two modes: Local Mode and MapReduce Mode.

17
8. Define HiveQL queries.

HiveQL is a query language for Hive to analyze and process structured data in a Meta-
store. It is a mixture of SQL-92, MySQL, and Oracle's SQL. It is very much similar to SQL
and highly scalable.

9. Define HIVE.

Hive is a data warehouse system which is used to analyze structured data. It is built
on the top of Hadoop. It was developed by Facebook.

10. What are steps involved in pig latin componets?

 Parser
 Optimizer
 Compiler
 Execution engine

11. Why Pig is used in big data?

Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is
used to process the large datasets. It provides a high-level of abstraction for processing over
the MapReduce. It provides a high-level scripting language, known as Pig Latin which is
used to develop the data analysis codes

12. What are the features of pig data model?

Allows programmers to write fewer lines of codes.Apache Pig multi-query approach

reduces the development time.Apache pig has a rich set of datasets for performing operations
like join, filter, sort, load, group, etc.Pig Latin language is very similar to SQL.

13. Define HiveQL data definition language.

Hive Data Definition Language (DDL) is a subset of Hive SQL statements that
describe the data structure in Hive by creating, deleting, or altering schema objects such as
databases, tables, views, partitions, and buckets. Most Hive DDL statements start with the
keywords CREATE , DROP , or ALTER .

14. Define HiveQL data manipulation language

Hive DML (Data Manipulation Language) commands are used to insert, update,
retrieve, and delete data from the Hive table once the table and database schema has been
defined using Hive DDL commands. The various Hive DML commands are: LOAD.

15. What are the types of file formats in Hive?

 TEXTFILE.
 SEQUENCEFILE.

18
 RCFILE.
 ORC.
 PARQUET.
 AVRO.
16. What are the clients of HBase?
Kundera – the object mapper. The REST client. The Thrift client. The Hadoop
ecosystem client.

17. What are HiveQueries?

Hive enables data summarization, querying, and analysis of data. Hive queries are
written in HiveQL, which is a query language similar to SQL. Hive allows you to project
structure on largely unstructured data. After you define the structure, you can use HiveQL to
query the data without knowledge of Java or MapReduce.

18. What is Pig Grunt?

Apache Pig Grunt is an interactive shell that enables users to enter Pig
Latin interactively and provides a shell to interact with HDFS and local file system
commands. You can enter Pig Latin commands directly into the Grunt shell for execution.

19. What is Praxis in big data?

Praxis is driven by the purpose of creating resources that will lead India's transformation
into the tech and data-driven digital world.

20. What are the data types is used in HIVE?

 Column Types
 Literals
 Null Values
 Complex Types

PART –B

1. Explain in detail about the architecture of HBASE with neat diagram.

 Definition
 Architecture diagram
 Features
 Hbase Read and Write
 Advantages and Disadvantages of HBASE.

2. Explain in detail about the architecture of PIG with neat diagram.

 Definition
 Architecture diagram
 Features

19
 Pig latin ,Conventions, Statements
 Advantages and Disadvantages of PIG
.
3. Explain in detail about the architecture of HIVE with neat diagram.

 Definition
 Architecture diagram
 Features
 Limitations
 Advantages and Disadvantages of HIVE

4. Explain in detail about the data types of HIVE with examples.

 Definition
 Types
 Syntax and examples.

5. Explain in detail about the Pig latin statements and conventions with examples.

 Definition
 Types
 Tabular column
 Syntax and examples

*****************************************************************

Big Data Unit 1
No ratings yet
Big Data Unit 1
21 pages
4.2 NoSQL Databases UNIT-1
No ratings yet
4.2 NoSQL Databases UNIT-1
35 pages
NoSQL Databases: A Developer's Guide
No ratings yet
NoSQL Databases: A Developer's Guide
36 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Big Data Tools & Techniques Guide
No ratings yet
Big Data Tools & Techniques Guide
69 pages
Virtualization and Five Step Process
No ratings yet
Virtualization and Five Step Process
19 pages
Cloud Computing
No ratings yet
Cloud Computing
27 pages
Lecture Notes-Cns by Suthoju Girija Rani
100% (1)
Lecture Notes-Cns by Suthoju Girija Rani
163 pages
AI's Impact on Mechanical Engineering
No ratings yet
AI's Impact on Mechanical Engineering
12 pages
Introduction To Data Mining Unit1
100% (1)
Introduction To Data Mining Unit1
37 pages
Stucor - CS3391 ND (1) 1 94
No ratings yet
Stucor - CS3391 ND (1) 1 94
94 pages
Bell LaPadula Model
No ratings yet
Bell LaPadula Model
8 pages
CS-701 BigDataHadoop Unit-1
No ratings yet
CS-701 BigDataHadoop Unit-1
23 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
55 pages
Unit 5
No ratings yet
Unit 5
68 pages
XML
No ratings yet
XML
40 pages
Cloud Computing Syllabus New Scheme
No ratings yet
Cloud Computing Syllabus New Scheme
4 pages
Cs9152 DBT Unit I Notes
100% (1)
Cs9152 DBT Unit I Notes
53 pages
Robert Johnson Marty M. Weiss Michael G
No ratings yet
Robert Johnson Marty M. Weiss Michael G
783 pages
OpenStack Cloud Computing Course
No ratings yet
OpenStack Cloud Computing Course
24 pages
KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
Unit - Ii Cloud Computing Architecture: Architecture and Event-Driven Architecture
No ratings yet
Unit - Ii Cloud Computing Architecture: Architecture and Event-Driven Architecture
30 pages
Module - 4
No ratings yet
Module - 4
6 pages
Cloud Application Development
No ratings yet
Cloud Application Development
32 pages
Message Authentication
No ratings yet
Message Authentication
47 pages
CS8603 Distributed System Model Question
50% (2)
CS8603 Distributed System Model Question
2 pages
BDA GTU Study Material E-Notes All-Units 03122021014217PM
No ratings yet
BDA GTU Study Material E-Notes All-Units 03122021014217PM
42 pages
Database Systems for Students
No ratings yet
Database Systems for Students
74 pages
Data Visualization Foundations Guide
100% (1)
Data Visualization Foundations Guide
12 pages
3-Cloud Infrastructure
No ratings yet
3-Cloud Infrastructure
8 pages
CCS334 BIG DATA ANALYTICS Session 1 Intr
No ratings yet
CCS334 BIG DATA ANALYTICS Session 1 Intr
18 pages
Smime
No ratings yet
Smime
9 pages
Bda - 2 Unit
No ratings yet
Bda - 2 Unit
12 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
Cryptography and Network Security Plan
No ratings yet
Cryptography and Network Security Plan
2 pages
Understanding NoSQL Databases and CAP Theorem
No ratings yet
Understanding NoSQL Databases and CAP Theorem
23 pages
NoSQL Databases: MongoDB & CAP Theorem
No ratings yet
NoSQL Databases: MongoDB & CAP Theorem
34 pages
Cryptography & Network Security Course
No ratings yet
Cryptography & Network Security Course
5 pages
Virtualization Technology Tutorial Guide
No ratings yet
Virtualization Technology Tutorial Guide
1 page
Cloud Architecture and Services Overview
No ratings yet
Cloud Architecture and Services Overview
34 pages
43 PPT On Apache Pig
No ratings yet
43 PPT On Apache Pig
16 pages
HPC for Medical Imaging & Vision
No ratings yet
HPC for Medical Imaging & Vision
3 pages
Big Data Analysis Workshop 2018
No ratings yet
Big Data Analysis Workshop 2018
4 pages
Unit 1 Total Data Visualization Techniques
No ratings yet
Unit 1 Total Data Visualization Techniques
22 pages
Installation of Hadoop in Windows10
No ratings yet
Installation of Hadoop in Windows10
4 pages
Unit - Big - Data - (DK - PPT) - Part - 1
No ratings yet
Unit - Big - Data - (DK - PPT) - Part - 1
70 pages
Unit V Case Studies
No ratings yet
Unit V Case Studies
37 pages
Types of Virtualization in Cloud Computing
No ratings yet
Types of Virtualization in Cloud Computing
8 pages
NoSQL Databases UNIT-3
No ratings yet
NoSQL Databases UNIT-3
20 pages
23PCA11 Unit 1 Cloud Computing
No ratings yet
23PCA11 Unit 1 Cloud Computing
49 pages
Big Data - SRM University PDF
No ratings yet
Big Data - SRM University PDF
29 pages
Blockchain-Based Access Control Algorithms For Big
No ratings yet
Blockchain-Based Access Control Algorithms For Big
5 pages
BDA Lab ManuaL
No ratings yet
BDA Lab ManuaL
83 pages
Big Data Analytics
No ratings yet
Big Data Analytics
20 pages
All Units CCS334 Big Data Analytics Q&A Unitwise
No ratings yet
All Units CCS334 Big Data Analytics Q&A Unitwise
20 pages
Big Data Analytics & NoSQL Guide
No ratings yet
Big Data Analytics & NoSQL Guide
20 pages
BDA Question Bank
No ratings yet
BDA Question Bank
20 pages
CCS334 Question Bank Big Data
No ratings yet
CCS334 Question Bank Big Data
20 pages
Ccs334 BDA Important Questions
No ratings yet
Ccs334 BDA Important Questions
31 pages
Bda Ak
No ratings yet
Bda Ak
107 pages
Introduction To ProfiSignal Klicks
No ratings yet
Introduction To ProfiSignal Klicks
69 pages
Assignment No 1
No ratings yet
Assignment No 1
6 pages
Obj Browse
0% (1)
Obj Browse
4 pages
Vulcan Channel Sampling for Mining
No ratings yet
Vulcan Channel Sampling for Mining
1 page
Oracle 1Z0-041.v2009-03-19
No ratings yet
Oracle 1Z0-041.v2009-03-19
28 pages
Chinook
No ratings yet
Chinook
5 pages
Emerson Ovation Connectivity
No ratings yet
Emerson Ovation Connectivity
3 pages
Ghion Tours: Dynamic Travel Website Development
No ratings yet
Ghion Tours: Dynamic Travel Website Development
44 pages
Crime RAG
No ratings yet
Crime RAG
6 pages
9 V May 2021
No ratings yet
9 V May 2021
7 pages
Construction Management Software SRS
0% (1)
Construction Management Software SRS
13 pages
Mohd Khan Cloud Engineer Q&A Resume Based All Technical, Personal Based
No ratings yet
Mohd Khan Cloud Engineer Q&A Resume Based All Technical, Personal Based
5 pages
SPI Data Take On Procedure R01 - Generic
100% (1)
SPI Data Take On Procedure R01 - Generic
22 pages
FDA Data Integrity: ALCOA Principles Explained
No ratings yet
FDA Data Integrity: ALCOA Principles Explained
2 pages
Deployment of IBM COGNOS BI
No ratings yet
Deployment of IBM COGNOS BI
17 pages
OpenStack Ceilometer Overview
No ratings yet
OpenStack Ceilometer Overview
25 pages
? Data Entry Basics - Comprehensive Notes
No ratings yet
? Data Entry Basics - Comprehensive Notes
101 pages
469493.1 - Step by Step Guide To Create Physical Standby Database Using RMAN
No ratings yet
469493.1 - Step by Step Guide To Create Physical Standby Database Using RMAN
9 pages
Database Part 1
No ratings yet
Database Part 1
44 pages
Dbms Lab Experiment List
No ratings yet
Dbms Lab Experiment List
3 pages
CSIT E-Commerce Project Report
No ratings yet
CSIT E-Commerce Project Report
15 pages
SQL for Data Science MOOC Overview
No ratings yet
SQL for Data Science MOOC Overview
9 pages
Intelligent Tech: Automate vs Informate
No ratings yet
Intelligent Tech: Automate vs Informate
15 pages
Labs Acces Control
No ratings yet
Labs Acces Control
15 pages
MCQ on Watson Studio and Spark
100% (1)
MCQ on Watson Studio and Spark
26 pages
e-Panchayat Project Report 2013
0% (1)
e-Panchayat Project Report 2013
86 pages
Oracle Coherence Vs Redis Report
No ratings yet
Oracle Coherence Vs Redis Report
58 pages
A Survey On Context-Aware Systems PDF
No ratings yet
A Survey On Context-Aware Systems PDF
16 pages
Software Developer Intern Job Description
No ratings yet
Software Developer Intern Job Description
2 pages
Introduction to Data Warehousing
No ratings yet
Introduction to Data Warehousing
113 pages