0% found this document useful (0 votes)
16 views

BDA Final Notes

BDA final and complete notes for preparation and excellence really easy to learn and understand.

Uploaded by

ronakkela2233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

BDA Final Notes

BDA final and complete notes for preparation and excellence really easy to learn and understand.

Uploaded by

ronakkela2233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Unit – 1

Q] Explain types of data


1] Structured Data: This type of data is highly organized and easy to store, access, and analyze.
It fits neatly into tables, columns, and rows, where both the fields (attributes) and their datatypes
are predefined.
Example: Data in relational databases such as MySQL, Oracle, or Microsoft SQL.
2] Semi-structured data : Semi-structured data has some level of organization, but not as rigid
as structured data. The fields or attributes are known, but the datatypes might not be strictly
defined or consistent. It often doesn't follow the strict tabular structure of a relational database.
Example: CSV (Comma Separated Values) files or JSON data
3] Unstructured Data: Unstructured data doesn't have a predefined structure or consistent
format. It doesn't fit easily into tables, and it may not have clear fields or attribute.
Example: Text files, images, videos, social media posts, or logs generated by a server

Q] What is distributed computing system


1] A distributed system is when multiple networked computers work together to achieve a
common goal, dividing the workload among them.
2] Its main purpose is to speed up tasks by using multiple computers simultaneously.
3] However, not all tasks can benefit from distributed computing.

Q] What is big data?


In very simple words, Big Data is data of very big size which cannot be processed with usual
tools. To process such data, we need to be distributed architecture. This data could be structured
or unstructured.
Q] Explain characteristics of Big data
Volume
1] The name Big Data itself is related to an enormous size.
2] Big Data is a vast 'volumes' of data generated from many sources daily, such as business
processes, machines, social media platforms, networks, human interactions, and many more.
3] Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.

Variety
1] Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources.
2] Data will only be collected from databases and sheets in the past, But these days the data will
comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.

Velocity
1] Velocity creates the speed by which the data is created in real-time.
2] It contains the linking of incoming data sets speeds, rate of change, and activity bursts.
3] Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.

Q] Why do we need big data


We need Big Data because modern devices and applications generate massive amounts of
information that traditional systems can't handle efficiently. Analyzing this data provides
valuable insights for decision-making. As data grows, processing can hit limits in CPU, RAM,
storage, or networks, requiring distributed computing to manage and process it faster.

Q] What is apache hadoop


1] Apache Hadoop is an open-source framework designed to handle large-scale data processing
using a distributed computing model.
2] Hadoop is under the Apache license, allowing free usage without licensing concerns.
3] Hadoop is powerful, popular, and widely supported, making it a key tool for managing Big
Data.
4] Hadoop is written in Java so that it can run on kinds of devices.

Three Characteristics of hadoop


1] Distributed: It uses multiple machines to solve problems.
2] Scalable: New machines can be easily added to the system.
3] Reliable: It continues to function even if some machines fail.

Q] Explain apache hadoop ecosystem

1] HDFS (Hadoop Distributed File System) : It is the core of Hadoop, providing massive
storage across multiple computers, allowing the storage of petabytes of data in files. It's based on
Google's File System.
2] YARN (Yet Another Resource Negotiator) : It manages the resources (CPU, memory)
across the network and runs distributed applications.
3] HBase : It is a NoSQL database that provides huge storage in the form of database tables. It's
ideal for managing large volumes of records, offering scalable and efficient storage for big
datasets.
4] MapReduce : It is a distributed computing framework that uses YARN to run tasks and has
an efficient sorting engine. Programs are written in two parts:
Map: Transforms raw data into key-value pairs.
Reduce: Groups and combines data based on keys.
5] Spark : Spark is a faster, more recent computational framework similar to MapReduce for
solving Big Data problems. It uses similar concepts but processes data more quickly and
efficiently. Spark also has its own large ecosystem, which will be covered in detail later.
6] Hive : Apache Hive lets you write SQL queries instead of complex MapReduce code. It
converts these SQL queries into MapReduce jobs, making it easier and faster to process large
structured or semi-structured data.
7] Pig latin : Pig Latin is a simple, SQL-like language for expressing ETL tasks step by step. Pig
is the engine that translates Pig Latin into MapReduce and runs it on Hadoop for big data
processing.
8] Mahout : Mahout is a library for distributed machine learning algorithms. It breaks down
complex tasks to run efficiently using MapReduce on multiple machines.
9] Apache Zookeeper : It is a coordination tool for distributed systems like HDFS, HBase,
Kafka, and YARN. It provides services for configuration management, synchronization, and
naming in large distributed environments.
10] Flume : It helps collect unstructured data from multiple sources and send it to a central
location like HDFS. It's useful for gathering data from sources like web server logs and
aggregating it in a single place.
11] Sqoop : It transfers data between Hadoop and SQL databases. It uses MapReduce to move
data efficiently across multiple machines in a distributed network.
12] Oozie : It is a workflow engine that manages and executes tasks in sequence. It helps
automate complex workflows, like importing data, processing with Hive, using Mahout for
predictions, and saving back to SQL databases.
Unit – 2

Q] What is HDFS?

1] HDFS is designed to store very large files efficiently.


2] It runs on clusters of ordinary (commodity) hardware, making it cost-effective.
3] It handles files that are hundreds of megabytes to terabytes in size, and can even manage
petabytes of data.
4] Data Access Pattern:
Write-Once: Files are typically written to HDFS only once.
Read-Many-Times: After writing, the files are read multiple times for various analyses.

Q] Explain why hadoop has DFS?


Ans

1] A single machine may take a long time to process a very large file (e.g., 40TB), such as 4
hours.
2] DFS splits the large file into smaller chunks and distributes these chunks across multiple
nodes (e.g., 4 nodes).Each node processes its assigned chunk simultaneously.
3] By working in parallel, DFS can process the entire 40TB file in a shorter time (e.g., 1 hour)
compared to a single system.
4] DFS can handle much larger files and more data by simply adding more nodes to the cluster.
5] DFS provides redundancy and fault tolerance, so if one node fails, other nodes can continue
processing.

Q] Explain the areas where HDFS is not good fit


Ans
1] Low-latency data access : HDFS isn’t good for applications needing very fast data access
(tens of milliseconds).
2] Many Small Files: HDFS struggles with managing a large number of small files due to
memory limits on the NameNode.
3] Multiple Writers/Modifications: HDFS doesn’t support simultaneous writing by multiple users
or changes at random points in a file; it only allows appending data at the end.

Q] Explain components of HDFS


Ans

NameNode

1] Manages all the slave nodes and assign work to them.


2] Manages the file system’s namespace and operations like opening, closing, and renaming files
and directories.
3] It should be deployed on reliable hardware which has the high end configuration not on

DataNode

1] It is a actual worker, Performs actual data operations such as reading, writing, and processing.
2] Handles data storage, replication, and deletion based on instructions from the NameNode.
3] Can be deployed on commodity hardware.

HDFS Daemons

1] Daemons are the processes running in background.


2] NameNode Daemon: Runs on the master node, stores metadata (e.g., file paths, block IDs) in
RAM for quick access, and also keeps a backup on disk. Requires a lot of RAM.
3] DataNode Daemon: Runs on slave nodes, Require high memory as data is actually stored
here.

Q] Explain how the data stored in HDFS


ANs

1] When a large file (e.g., 100TB) is uploaded, the NameNode (master node) divides it into
smaller blocks (e.g., 10TB each, though the default block size is 128 MB in Hadoop 2.x and
above).
2] These blocks are distributed across various DataNodes (slave nodes) in the cluster.
3] Each block is replicated multiple times for reliability. By default, each block has 3 replicas,
meaning each block is stored on three different DataNodes.
4] The number of replicas can be adjusted by editing the hdfs-site.xml configuration file.
5] The NameNode keeps track of all blocks and their locations.
6] It knows which DataNodes store which blocks and manages all data-related tasks.

Q] Explain the following terms


1] Heartbeat : A regular signal sent by DataNodes to the NameNode. It Indicates that the
DataNode is active and functioning. If the NameNode stops receiving heartbeats from a
DataNode, it considers that DataNode as failed or dead.

2] Balancing : The process of maintaining the proper distribution of data blocks across the
cluster.: If a DataNode fails, the blocks it held become unavailable, leading to under-
replication.The NameNode detects this imbalance and instructs other DataNodes to replicate the
lost blocks to restore the desired replication level and balance the data distribution.

3] Replication : The process of creating copies of data blocks for redundancy and fault
tolerance.Managed by DataNodes. DataNodes create and manage these replicas based on
instructions from the NameNode to ensure that data remains available even if some nodes fail.

Q] Explain features of HDFS


ANs

1] Distributed Data Storage: Splits data into blocks stored across multiple nodes, enabling
efficient and scalable data management.

2] Reduced Seek Time: Smaller blocks improve access speed and efficiency for large files.

3] High Availability: Data is replicated across multiple nodes, ensuring it remains accessible
even if some nodes fail.

4] High Reliability: The system continues to function and access data even if multiple nodes are
down.

5 ] High Fault Tolerance: Designed to handle hardware failures by replicating data and
monitoring node health.

Q] Explain sqoop with its features


Ans

1] Apache Sqoop is an open-source tool designed to transfer data from structured databases (like
SQL) into Hadoop for processing.
2] The data transferred to Hadoop can be processed using tools such as: MapReduce programs,
Hive, Pig, Spark
3] Sqoop can automatically create Hive tables from the data it imports from an RDBMS
(Relational Database Management System) table.
4] Sqoop is also capable of exporting data from Hadoop back into relational databases, useful for
moving processed data into operational systems.

Features

1] Bulk Import: Sqoop can import entire databases or individual tables into HDFS, supporting
large-scale data transfers.
2] Parallelization: It speeds up data transfer by parallelizing the process for better system
performance.
3] Direct Input: Sqoop directly imports data into HBase and Hive, making it easy to map
relational databases.
4] Efficient Data Analysis: It streamlines the process of analyzing imported data in Hadoop.
5] Load Mitigation: Sqoop reduces the load on external systems during data transfer.
6] Java Classes Generation: Sqoop generates Java classes for programmatic data interaction.

Q] Explain apache fulme with its feature


Ans

1] Apache Flume is ideal for streaming logs into the Hadoop environment.
2] Flume is designed to collect and aggregate vast amounts of log data efficiently.
3] It is a reliable, distributed service that ensures data collection across different sources.
4] Flume has an easy-to-use architecture based on streaming data flows.
5] It includes tunable reliability mechanisms, as well as recovery and failover options, to ensure
consistent performance.

Features of Apache flume

1] Scalability: Flume can scale from small environments (5 machines) to large ones (thousands
of machines), making it flexible.
2] High Performance: It offers high throughput and low latency for efficient data transfer.
3] Extensibility: Despite having a declarative configuration, Flume is easy to extend.
4] Fault-Tolerant: Flume is fault-tolerant, ensuring data reliability.
5] Stream-Oriented: It is optimized for handling continuous data streams.

Q] Difference between Apache sqoop and apache flume


Ans

Apache sqoop
1] It Works with relational databases and NoSQL databases
2] The Sqoop load is not driven by events.
3] Ideal for data in JDBC-compatible databases (e.g., MySQL, Oracle)
4] Imports data directly to HDFS
5] Connector-based architecture
6] Fetches structured data using connectors
7] Parallel data transfers and quick imports

Apache Flume
1] It Works with streaming data sources those are generated continuously in Hadoop
environments.
2] data loading is completely event-driven.
3] Best for bulk streaming data (e.g., logs)
4] Data flows to HDFS through channels
5] Agent-based architecture
6] Fetches streaming data from sources like logs
7] Collecting and aggregating data reliably

Q] Explain Data serialization


Ans

1] Data serialization is the process of converting data into a format that can be easily saved or
transmitted, and then converting it back to its original form when needed.
2] It allows data to be stored in databases or sent over networks regardless of the system being
used.
3] Serialization translates data into a stream of bytes, while deserialization converts it back.
4] Different formats like CSV, XML, Avro, and JSON are used to store and exchange data
efficiently.
5] Proper serialization helps avoid issues like incorrect data interpretation, ensuring accurate and
effective data handling across various systems.

Q] Explain common HDFS file formats


Ans

1] Sequence file : It is a binary file format that stores key-value pairs. It is compact and
splittable.

2] Avro : It is binary serialization format that encodes data with rich structures and supports
schema evolution.
3] Parquet: It is columnar storage file format that organizes data by columns for efficient
analytics.

4] ORC (Optimized Row Columnar): It is columnar format optimized for high read performance.

5] TextFile: It is cimple text format where each line represents a record.

6] HBase: It is NoSQL database with its own storage format, organizing data into tables with
rows and columns.

7] RCFile (Record Columnar File): It is columnar storage format designed for fast data loading
and query performance.

8] JSON and XML: Formats for semi-structured or hierarchical data.

Q] Write commands for mounting

Sudo su
Mount –t vboxsf sharedfolder /home/cloudera/Desktop/Windows

Q] Commands

1. hadoop fs -ls

Description: List the contents of a directory in HDFS.

2. hadoop fs –put

Description: Upload a file from the local file system to HDFS.

3. hadoop fs -get

Description: Download a file from HDFS to the local file system.


4. hadoop fs –mkdir

Description: Create a directory in HDFS

5. hadoop fs -rm

Description: Delete a file or directory from HDFS

6. hadoop fs -mv

Description: Move a file or directory within HDFS

7. hadoop fs -cp

Description: Copy files or directories within HDFS.

8. hadoop fs -cat

Description: Display the contents of a file in the terminal.

9. hadoop fs -tail

Description: Display the last kilobyte of a file to the terminal.

10. hadoop fs -du

Description: Display the size of files and directories in HDFS.

Unit – 3
Q] Explain Mapreduce
MapReduce is a programming model that processes large datasets by dividing them into smaller chunks,
processing them simultaneously, and then combining the results.

 Map: Data is split into parts and processed to generate key-value pairs.
 Reduce: Key-value pairs are grouped by key and combined to produce the final result.

Q] Why we need Mapreduce


 Scalability: MapReduce can handle huge datasets by running tasks across many machines at
once, making it ideal for big data.

 Fault Tolerance: If a machine fails, MapReduce automatically manages the failure and
reruns the task on another machine, keeping processing uninterrupted.

 Ease of Programming: It abstracts away the complex parts of distributed computing, so


developers can focus on writing the map and reduce functions without worrying about
parallelization, data distribution, or failures.

 Flexibility: MapReduce works with a wide variety of data types and tasks, making it
adaptable for different data processing needs.

 Performance: By processing data in parallel on multiple machines, MapReduce speeds up


data processing significantly compared to traditional, single-machine methods.

Q] Explain Mapreduce algorithm


Ans

MapReduce processes data through these three main stages:

1. Map Stage:
o The mapper function reads input data stored in the Hadoop Distributed File
System (HDFS), usually as files or directories.
o Each line of input is processed by the mapper, which breaks it down into smaller
data chunks.
o The mapper then generates intermediate key-value pairs, which are passed on for
the next stage.
2. Shuffle Stage:
o In this stage, the intermediate key-value pairs from the mappers are sorted and
grouped by key.
o This grouping organizes data so that each reducer can focus on a single key with
all associated values.
o Data is then transferred to the appropriate reducer nodes.
3. Reduce Stage:
o The reducer function processes the grouped data from the shuffle stage, applying
operations like aggregation or transformation.
o It generates the final output, which is stored back in HDFS for easy access.
Q] Example
Q] Explain anatomy of shuffling and sorting

 Shuffling: This is the process of moving the mapper’s output to the reducers as input. It
groups data by keys so each reducer receives all data for a given key. Shuffling starts even before
all map tasks are complete, speeding up the job.
 Sorting: The MapReduce framework automatically sorts all keys produced by the mappers
before they reach the reducers. This sorted order helps reducers differentiate when a new reduce
task should start (when the key changes), making the reduce phase more efficient.

Q] Explain following terminology


Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
DataNode − Node where data is presented in advance before any processing takes place.
MasterNode − Node where JobTracker runs and which accepts job requests from clients.
SlaveNode − Node where Map and Reduce program runs.
JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker − Tracks the task and reports status to JobTracker.
Job − A program is an execution of a Mapper and Reducer across a dataset.
Task − An execution of a Mapper or a Reducer on a slice of data.
Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.

Q] Example
Unit-4

BigSQL
Q] What is Big SQL
Ans
1] IBM Big SQL is a powerful SQL engine designed for Hadoop
2] It allows users to efficiently query and analyze large amounts of data from various sources.
3] It can access data from Hadoop (HDFS), traditional relational databases (RDBMS), NoSQL
databases, and cloud storage all in one go.
4] With Big SQL, we can run queries using a single connection, making it easy to work with
different types of data without needing multiple tools.
5] It also provides management tools for databases and integrates with popular analytics tools to
help visualize your data.

Q] How Big SQL Works?


Ans
 Big SQL's robust engine executes complex queries for relational data and Hadoop data.
 Big SQL provides an advanced SQL compiler and a cost-based optimizer for efficient query
execution.
 Combining these with a massive parallel processing (MPP) engine helps distribute query
execution across nodes in a cluster.

Q] Why Big SQL.


Ans

 Easy Data Migration: Quickly moves old data from traditional databases like Oracle into
Hadoop while keeping the original SQL syntax.

 Access to Multiple Data Sources: Allows querying data from various relational and NoSQL
databases without moving it to Hadoop.
 Single Query Access: Lets you run a single query to pull insights from different data sources,
simplifying data analysis.

HBase
Q] Explain limitation of Hadoop
Ans

 Batch Processing Only: Hadoop can only process data in large batches, which means it can't
handle real-time data processing. This makes it less suitable for situations where quick responses
are needed.

 Sequential Access: Data must be accessed in a specific order, so even for simple tasks,
Hadoop has to go through the entire dataset. This makes it slow and inefficient for small queries.

 Large Output Data: When Hadoop processes a huge dataset, it often creates another large
dataset that also needs to be processed sequentially, compounding the time and effort required.

Overall, these limitations mean that Hadoop may not be the best choice for applications that need
quick, random access to data.

Q] Explain HBase
Ans
1] HBase is a distributed, column-oriented database built on top of the Hadoop File System
(HDFS) and is part of the Hadoop ecosystem
2] HBase allows for quick random access, enabling users to read and write data in real time
without the need to process it sequentially.
3] Its architecture is similar to Google's Bigtable, which supports horizontal scalability, meaning
that as data grows, additional hardware can be added to accommodate it.
4] HBase stores data in columns rather than rows, making certain queries faster and more
efficient.
5] By leveraging the fault tolerance of HDFS, HBase ensures that data remains safe and
accessible, even in the event of hardware failures.

Q] Explain Storage mechanism in HBase


Ans
◦ HBase is a column-oriented database and the tables in it are sorted by row.
◦ The table schema defines only column families, which are the key value pairs.
◦ A table have multiple column families and each column family can have any number of
columns.
◦ Subsequent column values are stored contiguously on the disk. Each cell value of the table
has a timestamp.
In short, in an HBase:
◦ Table is a collection of rows.
◦ Row is a collection of column families.
◦ Column family is a collection of columns.
◦ Column is a collection of key value pairs.

Q] Explain feature of HBase


Ans
◦ HBase is linearly scalable.
◦ It has automatic failure support.
◦ It provides consistent read and writes.
◦ It integrates with Hadoop, both as a source and a destination.
◦ It has easy java API for client.
◦ It provides data replication across clusters.

Q] Where to use HBase


Ans
1] HBase is used when you need quick, random access to large amounts of data.
2] It’s great for managing huge datasets that don’t fit in traditional databases, like when you
have billions of rows.
3] HBase is also perfect for non-relational data that doesn't follow a fixed structure, and it
integrates well with Hadoop for handling big data.
4] It's ideal for real-time analytics, where you need to quickly read and update data.

Q] Explain Application of HBase


Ans
HBase is used for applications that need to handle a lot of data and require fast, random access.
It’s ideal for write-heavy tasks. Companies like Facebook, Twitter, Yahoo, and Adobe use
HBase for managing large datasets efficiently.
Q] Explain architecture of HBase
Ans

HBase architecture consists of three main components: the client library, the master server, and
region servers.

1. Master Server: It assigns regions (parts of tables) to region servers, manages load
balancing, and takes care of tasks like creating tables and column families. It uses
ZooKeeper for assigning and discovering region servers.
2. Region Servers: These handle data operations, like reading and writing data, for the
regions they manage. They also split large tables into regions and store data in memory
(Memstore) before saving it permanently in HFiles.
3. ZooKeeper: Zookeeper is an open-source project that ensures coordination between
HBase components, tracks region servers, and helps handle failures.
Q] Commands

Creating a Table

 Command: create 'reviews', 'summary', 'reviewer', 'details'


 Description: This command creates a new HBase table named reviews with three column
families: summary, reviewer, and details.

Listing HBase Tables

 Command: list
 Description: This command lists all the tables present in your HBase system, including
the newly created reviews table.

Inspecting Table Properties

 Command: describe 'reviews'


 Description: This command provides details about the reviews table, including its
schema, column families, and configurations.

Inserting Data

Description: These commands insert values into specific cells in the reviews table. Each
command specifies the table name, row key, column identifier (column family), and the
value to insert.

 Examples:

put 'reviews', '101', 'summary:product', 'hat' # Inserts 'hat' into 'summary:product' for
row '101'
put 'reviews', '101', 'summary:rating', '5' # Inserts '5' into 'summary:rating' for row
'101'

put 'reviews', '112', 'summary:product', 'dress' # Inserts 'dress' into 'summary:product'


for row '112'

put 'reviews', '112', 'summary:rating', '3' # Inserts '3' into 'summary:rating' for row
'112'

put 'reviews', '112', 'reviewer:name', 'Tina' # Inserts 'Tina' into 'reviewer:name' for
row '112'

Retrieving a Row

 Command: get 'reviews', '101'


 Description: This command retrieves all columns for the row with the key 101 in the
reviews table.

Counting Rows

 Command: count 'reviews'


 Description: This command counts the total number of rows in the reviews table

Scanning the Table

 Command: scan 'reviews'


 Description: This command scans the entire reviews table, retrieving all rows and their
respective columns.

Deleting a Specific Cell

 Command: delete 'reviews', '112', 'reviewer:name'


 Description: This command deletes the name column under the reviewer column family
for the row with key 112.

Deleting All Cells in a Row

 Command: deleteall 'reviews', '112'


 Description: This command deletes all cells associated with the row key 112 from the
reviews table.
Hive
Q] What is Hive?
Ans
1] Hive is a tool built on top of Hadoop that helps process and analyze large amounts of
structured data.
2] It simplifies working with big data by allowing users to write queries similar to SQL, making
it easy to summarize and analyze data.
3] Originally developed by Facebook, it is now managed by Apache and used by companies like
Amazon.

4] Hive is not a traditional database, it is not designed for transaction processing (OLTP), and it
doesn't support real-time queries or updates at the row level.

Q] Explain features of Hive

Ans

 It stores schema in a database and processed data into HDFS.


 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Q] Characteristics of Hive
Ans
1] Structured Data: Hive works with structured data stored in tables and databases. You create
tables first and then load data into them.
2] Query Optimization: Hive offers features for query optimization, unlike MapReduce,
making it faster and more efficient for querying large datasets.
3] SQL-like Language: Hive uses a language similar to SQL (called HQL), making it easy for
users familiar with databases to interact with big data.
4] Partitioning: To improve query performance, Hive can partition data by using directory
structures.
5] Metastore: Hive uses a metastore, typically a relational database, to store schema information
about the tables.
6] Multiple Access Methods: You can interact with Hive via Web GUI, JDBC, or command
line, with the CLI being the most common method.

7] File Formats: Hive supports different file formats like TEXTFILE, SEQUENCEFILE, ORC,
and RCFILE for storing data.

8] Metadata Storage: Hive uses a Derby database for single-user metadata storage and MySQL
for multiple users.

Q] Explain architecture of Hive


Ans

The architecture of Hive consists of three main parts:

1. Hive Clients: These are the interfaces that users or applications use to interact with Hive.
Different clients exist for different needs:
o Thrift client for Thrift-based applications.
o JDBC drivers for Java-based applications.
o ODBC drivers for other applications.

2. Hive Services: This is the core layer that handles all client requests. It includes:

 The Command Line Interface (CLI) for executing queries and Data Definition
Language (DDL) operations.
 The Main Driver, which receives client requests (via JDBC, ODBC, etc.), processes
them, and communicates with other Hive components like the Meta Store and the File
System for further processing.
3. Hive Storage and Computing: This layer interacts with the actual data stored in the Hadoop
Distributed File System (HDFS). It includes:

 Meta Store: Stores schema and metadata information for the Hive tables.
 File System: Stores the query results and table data in HDFS.
 Job Client: Executes MapReduce or other jobs to process the data.

Q] Job execution flow


Ans
1] A query is submitted via the User Interface (UI) to Driver
2] The Driver interacts with the Compiler to create a plan for executing the query.
3] The Compiler requests metadata (such as table structure) from the Meta Store
4] The Meta Store sends this information back to the Compiler.
5] The Compiler sends the execution plan to the Driver . Then passes it to the Execution Engine
(EE).
6] The Execution Engine contacts the Name Node to get metadata about where data is stored in
Data Nodes. It fetches the actual data from the Data Nodes, where the table data is stored.
7] The EE performs DDL operations (like CREATE, DROP) and processes the query by
communicating with Hadoop components (Name Node, Data Nodes).

8] The fetched data is sent from the EE back to the Driver, which then sends the results to the UI
for display.

Q] Explain different modes of Hive


Ans

Hive operates in two modes based on the data size and the Hadoop setup:

1. Local Mode:
o Used when Hadoop is in pseudo-distributed mode with just one data node.
o Ideal for small datasets that fit on a single local machine.
o Processing is faster for smaller data since it runs on the local machine.
2. MapReduce Mode:
o Used for large datasets spread across multiple data nodes in a Hadoop cluster.
o Hive queries are executed using MapReduce, distributing the workload across the
cluster.
o Suitable for handling big data and distributed processing.
In short, Local mode is for small, local datasets, while MapReduce mode is for large,
distributed datasets.

By default, it works on Map Reduce mode and for local mode you can have the following
setting.

Hive to work in local mode set

SET mapred.job.tracker=local;

Q] Commands

Starting Hive : Hive

1] Create Database
CREATE DATABASE database_name;
Description: Creates a new database in Hive.
Example : CREATE DATABASE my_database;

2] Display all the database


Show Database
List all the database in Hive

3] Use Database
4] Create Table
CREATE TABLE table_name (column1 datatype, column2 datatype, ...)
row format delimited fields terminated by ‘,’;
Description : Creates a new table with specified columns and data types.
Example : CREATE TABLE users (id INT, name STRING, age INT)
row format delimited fields terminated by ‘,’;

5] Describe Table
Describe table_name
Description : Displays the schema information of a table.
Describe users;

6] Suppose we have employee.txt and we have to load employee data into our table
load data local inpath ‘/home/cloudera/employee.txt’ into table employee;
load data local inpath ‘/home/cloudera/project.txt’ into table project;

7] Select query
Select * from employee;
Select * from project;
Select * from employee where salary>=40000;

8] Join
select * from employee join project on employee.emp_id=project.emp_id;

9] Group By
select location, avg(salary) from employee group by location;

10] Order by
Select * from employee order by dept;

Q] Limitation of Hive
Ans

 Hive is suitable for batch processing but doesn’t suitable for real-time data handling.
 Update and delete are not allowed, but we can delete in bulk i.e. we can delete the entire
table but not individual observation.
 Hive is not suitable for OLTP(Online Transactional Processing) operations

Pig
Q] Explain Pig
Ans
1] Apache Pig is a tool used for processing large amounts of data, particularly in the Hadoop
ecosystem.
2] It provides a high-level of abstraction for processing over the MapReduce.
3] It provides a high-level scripting language, known as Pig Latin which is used to develop the
data analysis codes.
4] The Pig Engine, a part of Apache Pig, automatically translates these scripts into MapReduce
tasks that run behind the scenes, so users don't have to worry about the details.
5] The results of the data processing are stored in HDFS (Hadoop Distributed File System).

Q] Need of Pig
Ans

 Shorter Development Time: Writing complex data processing tasks in Pig Latin takes much
less time—about 10 lines of code instead of 200 lines in Java.

 Ease of Use: It’s easier for programmers without a Java background to use Pig, especially
those familiar with SQL.

 Multi-Query Approach: Pig allows users to run multiple queries together, making it more
efficient.

 Built-in Functions: Pig offers many built-in operators and supports complex data types (like
tuples and bags), enhancing data manipulation capabilities.

Q] Features of Pig

 Rich Set of Operators: Provides operators for filtering, joining, sorting, and aggregating
data.

 Ease of Use: Designed to be easy to learn and write, especially for those familiar with SQL.

 Extensibility: Allows users to create custom processes and user-defined functions (UDFs) in
languages like Python and Java.

 Simple Join Operations: Facilitates straightforward join operations between datasets.

 Concise Code: Reduces the amount of code needed for data processing compared to
traditional MapReduce.

 Pipeline Splits: Supports splitting processes within data pipelines for better performance.

 Integration with Hadoop Ecosystem: Works well with other Hadoop components like Hive,
Spark, and ZooKeeper.
 Multivalued and Nested Data Structures: Capable of handling complex data structures,
including nested and multivalued data.

 Structured and Unstructured Data Analysis: Can process and analyze both structured and
unstructured data efficiently.

Q] Explain types of Data models in Pig


Ans
• Atom: It is a atomic data value which is used to store as a string. The main use of this model
is that it can be used as a number and as well as a string.
• Tuple: It is an ordered set of the fields.
• Bag: It is a collection of the tuples.
• Map: It is a set of key/value pairs.

Q] Explain apache pig execution and script mode

Execution Modes:

1. Local Mode: Runs on the local machine without Hadoop or HDFS. It’s mainly used for
testing small datasets.
2. MapReduce Mode: Runs on Hadoop, using data stored in HDFS. Here, Pig Latin scripts
trigger MapReduce jobs to process data in the backend.

Script Execution Modes:

1. Interactive Mode (Grunt Shell): Allows real-time execution of Pig Latin commands in
the Grunt shell, with results displayed immediately.
2. Batch Mode (Script): Executes a Pig script file (.pig extension) in a single run, useful for
automating tasks.
3. Embedded Mode (UDF): Allows custom functions (User Defined Functions) in Java or
other languages to be embedded within Pig scripts, extending Pig’s capabilities.

Q] Explain UDF
Ans

User Defined Functions (UDFs) in Apache Pig allow users to extend Pig's functionality by
writing custom processing logic. Here's a brief overview:
 Language Support: Full support is provided in Java, which allows for efficient UDFs
that can handle various processing tasks. Limited support is available for other
programming languages.
 Types of UDFs:
1. Filter Functions: Used in filter statements to apply conditions, taking a Pig value
as input and returning a Boolean result.
2. Eval Functions: Used in FOREACH-GENERATE statements to transform data,
taking a Pig value and returning a Pig result.
3. Algebraic Functions: Designed to operate on inner bags in a FOREACH-
GENERATE statement, enabling full MapReduce operations on grouped data.
 Piggybank: A repository of Java UDFs that allows users to access and contribute their
own UDFs, fostering a community of shared functions.

Q] Commands

LOAD

 Description: Loads data from a specified location into a Pig relation.


 data = LOAD 'input/data.txt' USING PigStorage(',') AS (name:chararray, age:int);

DUMP

 Description: Outputs the contents of a relation to the console.


 Dump data

DESCRIBE

 Description: Displays the schema of a relation, showing the data types and structure.
 DESCRIBE data;

ILLUSTRATE

 Explanation: Demonstrates how data flows through a set of operations in a Pig script
 Illustrate data

GROUP

 Explanation: Groups data by a specified field, creating a new relation containing


grouped
 grouped_data = GROUP data BY age;

GROUPING BY MULTIPLE COLUMNS

 Explanation: Groups data based on multiple fields to allow complex aggregation.


 grouped_multi = GROUP data BY (age, name);
GROUP ALL

 Explanation: Groups all records into a single group, allowing for aggregate operations
on the entire dataset.
 all_grouped = GROUP data ALL;

GROUPING TWO RELATIONS USING COGROUP

 Explanation: Groups records from two different relations based on a specified field,
allowing for operations on both datasets simultaneously.
 cogrouped_data = COGROUP data1 BY age, data2 BY age;

JOIN

 Explanation: Combines two or more relations based on a common field, producing a


new relation with matched records
 joined_data = JOIN data1 BY id, data2 BY id;
 Relation3_name = JOIN data1 BY id LEFT OUTER, data2 BY id;

CROSS OPERATOR

 Explanation: Performs a Cartesian product (cross join) between two relations, resulting
in all possible combinations of records.
 crossed_data = CROSS data1, data2;
Unit-5
Q] What is big data visualization
Ans

1] Big data visualization involves transforming large, complex data sets into visual formats like
charts, graphs, and maps, making the data easier to analyze and interpret.
2] It simplifies the process of identifying patterns, trends, and insights that might be difficult to
detect in raw data.
3] Techniques range from simple visualizations like line charts and pie charts to more advanced
ones like heat maps, tree maps, and 3D graphs, depending on the complexity and goal.
4] Since big data is vast and often can't fit into a single screen, specialized visualization tools
help extract meaningful insights from massive data sets, much like refining crude oil into usable
fuel.
5] This makes data more accessible, even to those who may not be comfortable working directly
with raw data or SQL queries.

Q] Explain why is data visualization important in big data?


Ans

1] Data visualization is crucial in big data because it simplifies complex data sets, making it
easier for decision-makers to quickly interpret insights and make informed choices.
2] Visualization tools present data without losing accuracy, allowing control over precision and
aggregation levels.
3] They enable the creation of dashboards and reports that consolidate all relevant information in
one place, enhancing communication across an organization.
4] This is valuable across industries, from healthcare to finance, where clear, actionable insights
are needed to drive efficiency and informed decision-making.
Q] Explain types of big data visulization

1] Line Chart: A line chart shows how something changes over time. For example, tracking your
monthly savings for a year can be displayed with months on the x-axis and your savings on the
y-axis. The points are connected to form a line showing the trend.

2] Bar Chart: A bar chart is used to compare categories. For instance, if you want to see how
many people like different types of movies (comedy, action, drama), each category is shown as a
bar, and the length of the bar represents how many people like that type of movie.

3] Pie Chart: A pie chart shows how a whole is divided into parts. For example, if 40% of
students prefer pizza, 30% prefer burgers, and 30% prefer sandwiches, a pie chart divides a circle
into slices that represent each food preference.

4] Histogram: A histogram shows the distribution of data. For example, if you wanted to see how
many students scored in certain ranges on a test, the histogram groups the scores into ranges (0-
20, 21-40, etc.) and shows how many students fall into each range.

5] Heat Map: A heat map uses color to show data patterns. For example, in a heat map of daily
temperatures in different cities, colors represent temperature ranges—blue for cold, red for hot—
giving a quick visual of temperature changes across locations.

6] Scatter Plot: A scatter plot shows the relationship between two variables using dots. For
example, if you’re comparing the height of trees to their stem thickness, each dot represents a
tree, with its height on one axis and stem thickness on the other, helping you see if there’s a
connection between the two.
Q] Calculate the box plot
Ans

10, 23, 12, 28, 17, 24, 8, 30, 15, 20, 25, 19, 26

Step 1 : ordered the give data set in ascending order

8, 10, 12, 15, 17, 19, 20, 23, 24, 25, 26, 28, 30

Step 2 : Find the median. The value that is in the middle

median = 20

Step 3 : Consider the left side. They are 8, 10, 12, 15, 17, 19 and find the median.

The median is between 12 and 15. so the median is calculated by taking mean of 12 and 15

median = 12+15/2 = 13.5


lower quartile = Q1 = 13.5

Step 4 : Consider the right side. They are 23, 24, 25, 26, 28, 30 and find the median.

The median is between 25 and 26. so the median is calculated by taking mean of 25 and 26

median = 25+26/2 = 25.5


upper quartile = Q3 = 25.5
Minimum = 10
lower quartile = 13.5
median = 20
upper quartile = 25.5
maximum = 30

Q] Explain relationship between mean , median, mode


Ans

 Negative Skew (Left Skew):

 The median is towards the top of the data (closer to larger values).
 Median is greater than mean.
 The upper quartile is smaller than the lower quartile.

 No Skew (Symmetric):

 The median is in the center of the data.


 Median equals the mean.
 Upper quartile is equal to the lower quartile.

 Positive Skew (Right Skew):

 The median is towards the bottom of the data (closer to smaller values).
 Median is less than mean.
 The upper quartile is larger than the lower quartile.
Q] List challenges of visualizing large amount of data
Ans

 Perceptual Scalability: Our eyes can't catch all the important details from a huge amount of
data, and even large screens have trouble displaying it all clearly.

 Real-time Scalability: We expect to see data in real-time, but processing massive datasets
takes time, making real-time updates difficult.

 Interactive Scalability: Interacting with large datasets helps us understand them better, but
as the data grows, visualizing it can slow down the system, sometimes causing it to freeze or
crash.

Unit-6
Q] What is Recommended system?
 Recommendation systems predict what items (like movies, products, or music) a user
might prefer or rate highly.
 They are widely used in platforms such as YouTube, IMDb, Amazon, and Flipkart to
personalize content and improve user experience.
 These systems analyze user behavior to suggest items users are likely to enjoy or find
relevant.

Q] Explain collaborative filtering


 Collaborative filtering predicts what users might like based on preferences of similar
users.
 It identifies users with similar tastes by comparing their ratings or likes on items
 let’s assume I have user U1, who likes movies m1,m2,m4. user U2 who likes movies
m1,m3,m4, and user U3 who likes movie m1. So our job is to recommend which are the
new movie to watch for the user U3 next.
 So here we can see users U1, U2, and U3 all like movie m1, but U1 and U2 also like m4,
collaborative filtering might suggest m4 to U3, assuming similar preferences. This
method is widely used in recommending movies, news, and other content.

Q] Explain types of collaborative filtering


There are two type of collaborative filtering
1. User-user-based collaborative filtering

user-user collaborative filtering is one kind of recommendation method which


looks for similar users based on the items users have already liked or positively
interacted with.

Let’s take a one e.g. to understand user-user collaborative filtering.


Let’s assume given matrix A which contains user id and item id and rating or
movies

Compute a User User similarity follow these steps, so find a similarity between
two users we can use cosine similarity.

2. Item item based collaborative filtering

Item-item similarity compares items (e.g., movies) based on how similar they are,
helping to recommend items that are close in preference to ones a user already likes.

A similarity matrix is created for items, often using cosine similarity.

To recommend new items to a user, we look at the items they already enjoy and
find similar items in the matrix.
 Let’s suppose we have to recommend new items to user10, and we know a
user10 already likes/watch item7,8,1. Now we go to the item-item similarity
matrix, we take the most similar item to items7,8,1 based on the similarity
values.
 let’s suppose the most similar item for item7 is {item9, item4, item10}, the
Most similar item to item8 is {item19, item 4, item10} and the Most similar
item to item 1 is {item9, item14, item10}
 Now we take a very common item from every set of items and the common
items are {item9, item4, item10, item 19, item 14} and we recommend these
all items to user10.

Q] Explain Cosine similarity


Cosine similarity measures how similar two documents are, regardless of their length.
It calculates the cosine of the angle between two word-frequency vectors in multi-dimensional space.
It ranges from -1 (completely dissimilar) to 1 (completely similar), with 0 meaning no similarity.
Commonly used in natural language processing to compare documents based on word usage, not size.
apple orange Banana
Document1 1 1 1
Document2 1 2 0

Q] Explain classification algorithm


 Classification algorithms are used to categorize data into a class or category.
 It can be performed on both structured or unstructured data.
 Classification can be of three types: binary classification, multiclass classification, multilabel
classification.
Q] Explain clustering and its types
 Clustering is a method of grouping unlabeled data points into clusters, where points in the
same cluster are more similar to each other than to those in different clusters.
 The goal is to identify patterns or groupings within the data without any prior labeling.

Types of Clustering:

1. Hard Clustering:
o Each data point is assigned to one specific cluster.
o There is no overlap between clusters; a point either fully belongs to a cluster or it
doesn’t.
o Example: Grouping customers into distinct segments based on spending
behavior, where each customer fits into only one category.
2. Soft Clustering:
o Data points can belong to multiple clusters with varying probabilities.
o Each point has a likelihood of being part of different clusters rather than being
strictly assigned to one.
o Example: Assigning customers a probability of fitting into different buying
behavior clusters based on their purchase history.

Q] List challenges face in classification algorithm

1. Scalability: It’s hard for some algorithms to work well when there is a lot of data.
2. Dimensionality: Having too many features (like characteristics or measurements) can
confuse algorithms and lower their accuracy.
3. Imbalanced Data: Sometimes, one group of data is much smaller than others, making it
difficult for the algorithm to recognize it properly.
4. Computational Complexity: Some algorithms take a lot of time and computing power,
which can slow things down.
5. Data Quality: If the data is noisy or has mistakes, it can lead to wrong results.
6. Feature Selection: It’s tough to choose the most important features from a large set, but
it’s necessary for getting good results.
7. Interpretability: Complex models can be hard to understand, which is important in
sensitive areas like healthcare.
8. Resource Constraints: Limited memory and processing power can make it hard to
analyze big data effectively.
9. Concept Drift: Changes in data patterns over time require algorithms to adjust to keep
their accuracy.
10. Privacy and Security: It’s crucial to protect sensitive information while analyzing data.

Q] Explain Naïve bayes theorem

Naive Bayes is a classification algorithm based on Bayes' Theorem, which calculates the
probability of an event based on prior knowledge of conditions related to that event. The "naive"
part comes from the assumption that all features (or predictors) are independent of each other,
which is rarely true in real-life data but simplifies calculations significantly.

Bayes' Theorem formula in probability terms:


where:

 P(A|B) Is That The Likelihood Of Thus Hypothesis A Given The Information B. This Is
Often Known As The Posterior Likelihood.
 P(B|A) Is That The Likelihood Of Information B As Long As The Hypothesis A Was
True.
 P(A) Is That The Likelihood Of Hypothesis A Being True (Regardless Of The Data).
Thus, This Is Often Known As The Previous Likelihood Of A.
 P(B) Is The Likelihood Of The Information (Regardless Of The Hypothesis).
 P(A|B) Or P(B|A) Are Conditional Chances P(B|A) = P(A And B)/P(A)

Types of Naive Bayes Classifiers:

1. Multinomial Naive Bayes - suitable for discrete data (e.g., word frequencies in text
classification).
2. Bernoulli Naive Bayes - used for binary/boolean data.
3. Gaussian Naive Bayes - used for continuous data, assuming a normal distribution.

Pros:

 Fast and effective, especially with small datasets.


 Works well for multiclass prediction problems.
 Requires less training data compared to other models if the independence assumption holds.

Cons:

 Assumes independence among features, which is often not true in real-world data, potentially
reducing accuracy.

Q] Explain text mining and its application


Text data mining, also called text analytics, is the process of finding useful information from unstructured
text, like emails, social media posts, and reviews.
 Sentiment Analysis: This checks if the feelings in the text are positive, negative, or neutral.
Companies use it to monitor customer feedback and manage their reputation.

 Topic Modeling: This finds main topics in a group of texts, helping to organize and
categorize large amounts of information, like news articles.

 Text Classification: This sorts documents into specific categories, which is useful for things
like spam detection or tagging content.

 Information Retrieval: This helps find relevant information in large text collections based
on user searches, like how search engines work.

 Text Summarization: This creates a brief summary of a longer text, helping to quickly
understand the main points.

 Language Translation: This translates text between languages using techniques that help
make sense of the words.

 Fraud Detection: This looks at text data from things like insurance claims to find signs of
fraud.

 Healthcare Analytics: This analyzes medical records to gather useful information for
diagnosing and treating patients.

 Social Media Analytics: This studies text from social media to understand trends, feelings,
and user behavior.

Q] Explain traditional data mining and its application


Traditional data mining is the process of finding patterns and insights in structured data using statistical
and machine learning methods.

1. Data Collection: Gather relevant information from different sources like databases or
spreadsheets.
2. Data Preprocessing: Clean and prepare the data for analysis, which includes fixing
missing values and adjusting the data format.
3. Exploratory Data Analysis (EDA): Look at the data to understand its features and find
any noticeable patterns.
4. Feature Selection/Engineering: Choose important variables for analysis and create new
ones that could help improve the results.
5. Model Building: Use various algorithms to create models that can predict outcomes or
describe data. Common methods include decision trees and logistic regression.
6. Model Evaluation: Check how well the models perform using measures like accuracy
and precision.
7. Model Deployment: Use the models to make predictions or gain insights from new data.
Q] Difference between text mining and data mining

Q] Explain information retrieval


Information retrieval (IR) is the process of finding documents or information that meet a user's
needs.
The main goal of IR is to help users find the information they're looking for, whether it's text
documents, images, or other multimedia content.
Users express their information needs using specific queries or keywords that the system can
understand.
Large collections of documents are organized so that the system can quickly identify which
documents might be relevant to the user's request.
When a user submits a query, the system matches it against the stored documents to find the best
matches. However, this matching isn't always perfect, and some information may be lost in the
process of converting user requests and documents into a format the system can use.

Q] Explain search enginge


A search engine is a software tool that helps users find information on the internet based on their
queries. When a user enters a query, the search engine scans and retrieves a list of relevant web pages.

For example, if a student wants to learn C++ programming, they might enter a query like "C++ tutorial
GeeksforGeeks" into a search engine like Google.
The search engine then quickly scans its vast index of web pages to find content related to "C++ tutorial
GeeksforGeeks."

After analyzing the query, it sorts the most relevant and useful links, such as GeeksforGeeks tutorials on
C++, and presents them in an ordered list

Q] Explain how search engine works

Search engines operate through three main steps: crawling, indexing, and ranking.

1. Crawling: Search engines use programs called crawlers to scan and discover publicly
available information on websites. Crawlers visit each page, read the HTML, and
understand the content, structure, and update time. Crawling is crucial because if search
engines can’t access your site, it won’t show up in search results.
2. Indexing: After crawling, the information is organized and stored in an index, which is
like a database of web content. The index contains key information like page titles,
descriptions, keywords, and links. Indexing is essential because only indexed content can
appear in search results.
3. Ranking: Ranking determines the order in which search results appear based on relevance.

There are three steps that explain how ranking works:

Step 1: Analyze user query – This is where the search engine tries to understand what
kind of information the user wants. It breaks down the search into keywords. For
example, if you type "how to make a chocolate cupcake," the search engine knows you're
looking for specific instructions, so it shows recipes and guides. It can also handle similar
meanings (like "change" and "replace") and corrects spelling mistakes.

Step 2: Finding matching pages – Next, the search engine looks through its index to
find the best pages that match the query. For instance, if you search "dark wallpaper," it
will likely show images instead of text, since that's probably what you're looking for.

Step 3: Present the results to the users – Finally, the search engine shows you a list of
results, usually with ten main links. It may also display ads, quick answers, or other extra
information to help you find what you need.

Q] How we determine performance of search engine

The performance of a search engine relies on:

1. Effectiveness: The relevance and accuracy of search results.


2. Efficiency: The speed of response time and the number of queries processed
(throughput).
Q] List and explain usage of search engine

 Searching for Information: Search engines allow users to find a wealth of information on
any topic. For instance, someone looking to buy a mobile phone might search for “best mobile
phones in 2021.” The search engine provides a list of options, complete with features, reviews,
and prices, helping users make informed decisions.

 Searching Images and Videos: Users can specifically search for visual content such as
images and videos. For example, a person interested in nature can search for "flowers" to find a
variety of pictures and videos. Search engines categorize these visual assets, making it easy for
users to find exactly what they need.

 Searching Locations: Search engines are invaluable for finding geographical locations. For
instance, if someone is visiting Goa and wants to locate Palolem beach, they can simply enter
"Palolem beach" in the search bar. The search engine will provide directions, maps, and
information about the best routes to reach their destination.

 Searching People: Search engines help users find individuals by searching their names or
social media profiles. This feature is especially useful for reconnecting with friends, networking,
or conducting research on public figures.

 Shopping: Search engines play a crucial role in online shopping. Users can search for
specific products, and the search engine will return a list of websites that sell the item, often
displaying prices, reviews, and shipping options. This allows consumers to compare deals and
find the best offers available.

 Entertainment: Search engines are widely used for entertainment purposes. Users can search
for movies, music, games, and trailers. For example, if someone wants to watch a movie called
"Ram," they can search for it and receive a list of streaming services or websites where it can be
viewed or purchased.

 Education: Search engines serve as an educational resource, enabling users to learn about a
vast array of topics. Whether someone wants to learn how to cook, explore programming
languages, or find home decoration tips, search engines provide access to tutorials, articles, and
videos, functioning as an open school that offers free learning opportunities.

Q] Explain evaluation of search engine

1. Result Evaluation: This involves assessing the search results returned by the engine for
specific queries. Evaluators can classify results as relevant or not, rank them by
relevancy, score them on a scale (1-5), or compare pairs to determine which is more
relevant. This helps fine-tune the algorithm for better accuracy.
2. Recommendation Analysis: Evaluators examine related results to determine their
relevance to the original query. This is especially important for eCommerce sites, where
recommendations can encourage additional purchases. Improved recommendation
systems enhance user experience and drive revenue.
3. Query Categorization: This evaluates how well the search engine differentiates between
similar queries, such as "apple products" (fruit vs. Apple Inc.). By categorizing queries,
evaluators train the algorithm to understand user intent, which is crucial for delivering
relevant results.
4. Caption Evaluation: This focuses on the effectiveness of captions and taglines
associated with search results. Evaluators from the target demographic assess which
captions resonate with users and drive engagement, providing insights into what works
and what doesn’t.
5. Ad Relevance: This assesses the relevance of paid advertisements in relation to user
queries. Evaluators ensure that ads and their landing pages align with the search intent.
Relevant ads improve user experience and prevent dissatisfaction with the search results.

Q] Explain advance search engine technologies

Spatiotemporal data, which combines location and time information, plays a crucial role in the
transportation industry, especially with the rise of big data technologies. Here are key
applications:

1. Real-time Vehicle Tracking: GPS devices in vehicles generate massive data that allows
companies to monitor locations and optimize routes in real time.
2. Traffic Management: Analyzing large datasets from traffic sensors and cameras helps
manage congestion, improve traffic flow, and suggest alternative routes.
3. Geospatial Applications: New applications visualize vehicle movements over time,
aiding in route planning and accident analysis.
4. Spatial Databases: Efficient storage and retrieval of large spatiotemporal datasets
require advanced spatial databases that can handle dynamic data.

Overall, big data enhances the transportation industry and other sectors by enabling real-time decision-
making, improving efficiency, and providing valuable insights into complex spatial phenomena.

You might also like