BDA Final Notes
BDA Final Notes
Variety
1] Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources.
2] Data will only be collected from databases and sheets in the past, But these days the data will
comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
Velocity
1] Velocity creates the speed by which the data is created in real-time.
2] It contains the linking of incoming data sets speeds, rate of change, and activity bursts.
3] Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
1] HDFS (Hadoop Distributed File System) : It is the core of Hadoop, providing massive
storage across multiple computers, allowing the storage of petabytes of data in files. It's based on
Google's File System.
2] YARN (Yet Another Resource Negotiator) : It manages the resources (CPU, memory)
across the network and runs distributed applications.
3] HBase : It is a NoSQL database that provides huge storage in the form of database tables. It's
ideal for managing large volumes of records, offering scalable and efficient storage for big
datasets.
4] MapReduce : It is a distributed computing framework that uses YARN to run tasks and has
an efficient sorting engine. Programs are written in two parts:
Map: Transforms raw data into key-value pairs.
Reduce: Groups and combines data based on keys.
5] Spark : Spark is a faster, more recent computational framework similar to MapReduce for
solving Big Data problems. It uses similar concepts but processes data more quickly and
efficiently. Spark also has its own large ecosystem, which will be covered in detail later.
6] Hive : Apache Hive lets you write SQL queries instead of complex MapReduce code. It
converts these SQL queries into MapReduce jobs, making it easier and faster to process large
structured or semi-structured data.
7] Pig latin : Pig Latin is a simple, SQL-like language for expressing ETL tasks step by step. Pig
is the engine that translates Pig Latin into MapReduce and runs it on Hadoop for big data
processing.
8] Mahout : Mahout is a library for distributed machine learning algorithms. It breaks down
complex tasks to run efficiently using MapReduce on multiple machines.
9] Apache Zookeeper : It is a coordination tool for distributed systems like HDFS, HBase,
Kafka, and YARN. It provides services for configuration management, synchronization, and
naming in large distributed environments.
10] Flume : It helps collect unstructured data from multiple sources and send it to a central
location like HDFS. It's useful for gathering data from sources like web server logs and
aggregating it in a single place.
11] Sqoop : It transfers data between Hadoop and SQL databases. It uses MapReduce to move
data efficiently across multiple machines in a distributed network.
12] Oozie : It is a workflow engine that manages and executes tasks in sequence. It helps
automate complex workflows, like importing data, processing with Hive, using Mahout for
predictions, and saving back to SQL databases.
Unit – 2
Q] What is HDFS?
1] A single machine may take a long time to process a very large file (e.g., 40TB), such as 4
hours.
2] DFS splits the large file into smaller chunks and distributes these chunks across multiple
nodes (e.g., 4 nodes).Each node processes its assigned chunk simultaneously.
3] By working in parallel, DFS can process the entire 40TB file in a shorter time (e.g., 1 hour)
compared to a single system.
4] DFS can handle much larger files and more data by simply adding more nodes to the cluster.
5] DFS provides redundancy and fault tolerance, so if one node fails, other nodes can continue
processing.
NameNode
DataNode
1] It is a actual worker, Performs actual data operations such as reading, writing, and processing.
2] Handles data storage, replication, and deletion based on instructions from the NameNode.
3] Can be deployed on commodity hardware.
HDFS Daemons
1] When a large file (e.g., 100TB) is uploaded, the NameNode (master node) divides it into
smaller blocks (e.g., 10TB each, though the default block size is 128 MB in Hadoop 2.x and
above).
2] These blocks are distributed across various DataNodes (slave nodes) in the cluster.
3] Each block is replicated multiple times for reliability. By default, each block has 3 replicas,
meaning each block is stored on three different DataNodes.
4] The number of replicas can be adjusted by editing the hdfs-site.xml configuration file.
5] The NameNode keeps track of all blocks and their locations.
6] It knows which DataNodes store which blocks and manages all data-related tasks.
2] Balancing : The process of maintaining the proper distribution of data blocks across the
cluster.: If a DataNode fails, the blocks it held become unavailable, leading to under-
replication.The NameNode detects this imbalance and instructs other DataNodes to replicate the
lost blocks to restore the desired replication level and balance the data distribution.
3] Replication : The process of creating copies of data blocks for redundancy and fault
tolerance.Managed by DataNodes. DataNodes create and manage these replicas based on
instructions from the NameNode to ensure that data remains available even if some nodes fail.
1] Distributed Data Storage: Splits data into blocks stored across multiple nodes, enabling
efficient and scalable data management.
2] Reduced Seek Time: Smaller blocks improve access speed and efficiency for large files.
3] High Availability: Data is replicated across multiple nodes, ensuring it remains accessible
even if some nodes fail.
4] High Reliability: The system continues to function and access data even if multiple nodes are
down.
5 ] High Fault Tolerance: Designed to handle hardware failures by replicating data and
monitoring node health.
1] Apache Sqoop is an open-source tool designed to transfer data from structured databases (like
SQL) into Hadoop for processing.
2] The data transferred to Hadoop can be processed using tools such as: MapReduce programs,
Hive, Pig, Spark
3] Sqoop can automatically create Hive tables from the data it imports from an RDBMS
(Relational Database Management System) table.
4] Sqoop is also capable of exporting data from Hadoop back into relational databases, useful for
moving processed data into operational systems.
Features
1] Bulk Import: Sqoop can import entire databases or individual tables into HDFS, supporting
large-scale data transfers.
2] Parallelization: It speeds up data transfer by parallelizing the process for better system
performance.
3] Direct Input: Sqoop directly imports data into HBase and Hive, making it easy to map
relational databases.
4] Efficient Data Analysis: It streamlines the process of analyzing imported data in Hadoop.
5] Load Mitigation: Sqoop reduces the load on external systems during data transfer.
6] Java Classes Generation: Sqoop generates Java classes for programmatic data interaction.
1] Apache Flume is ideal for streaming logs into the Hadoop environment.
2] Flume is designed to collect and aggregate vast amounts of log data efficiently.
3] It is a reliable, distributed service that ensures data collection across different sources.
4] Flume has an easy-to-use architecture based on streaming data flows.
5] It includes tunable reliability mechanisms, as well as recovery and failover options, to ensure
consistent performance.
1] Scalability: Flume can scale from small environments (5 machines) to large ones (thousands
of machines), making it flexible.
2] High Performance: It offers high throughput and low latency for efficient data transfer.
3] Extensibility: Despite having a declarative configuration, Flume is easy to extend.
4] Fault-Tolerant: Flume is fault-tolerant, ensuring data reliability.
5] Stream-Oriented: It is optimized for handling continuous data streams.
Apache sqoop
1] It Works with relational databases and NoSQL databases
2] The Sqoop load is not driven by events.
3] Ideal for data in JDBC-compatible databases (e.g., MySQL, Oracle)
4] Imports data directly to HDFS
5] Connector-based architecture
6] Fetches structured data using connectors
7] Parallel data transfers and quick imports
Apache Flume
1] It Works with streaming data sources those are generated continuously in Hadoop
environments.
2] data loading is completely event-driven.
3] Best for bulk streaming data (e.g., logs)
4] Data flows to HDFS through channels
5] Agent-based architecture
6] Fetches streaming data from sources like logs
7] Collecting and aggregating data reliably
1] Data serialization is the process of converting data into a format that can be easily saved or
transmitted, and then converting it back to its original form when needed.
2] It allows data to be stored in databases or sent over networks regardless of the system being
used.
3] Serialization translates data into a stream of bytes, while deserialization converts it back.
4] Different formats like CSV, XML, Avro, and JSON are used to store and exchange data
efficiently.
5] Proper serialization helps avoid issues like incorrect data interpretation, ensuring accurate and
effective data handling across various systems.
1] Sequence file : It is a binary file format that stores key-value pairs. It is compact and
splittable.
2] Avro : It is binary serialization format that encodes data with rich structures and supports
schema evolution.
3] Parquet: It is columnar storage file format that organizes data by columns for efficient
analytics.
4] ORC (Optimized Row Columnar): It is columnar format optimized for high read performance.
6] HBase: It is NoSQL database with its own storage format, organizing data into tables with
rows and columns.
7] RCFile (Record Columnar File): It is columnar storage format designed for fast data loading
and query performance.
Sudo su
Mount –t vboxsf sharedfolder /home/cloudera/Desktop/Windows
Q] Commands
1. hadoop fs -ls
2. hadoop fs –put
3. hadoop fs -get
5. hadoop fs -rm
6. hadoop fs -mv
7. hadoop fs -cp
8. hadoop fs -cat
9. hadoop fs -tail
Unit – 3
Q] Explain Mapreduce
MapReduce is a programming model that processes large datasets by dividing them into smaller chunks,
processing them simultaneously, and then combining the results.
Map: Data is split into parts and processed to generate key-value pairs.
Reduce: Key-value pairs are grouped by key and combined to produce the final result.
Fault Tolerance: If a machine fails, MapReduce automatically manages the failure and
reruns the task on another machine, keeping processing uninterrupted.
Flexibility: MapReduce works with a wide variety of data types and tasks, making it
adaptable for different data processing needs.
1. Map Stage:
o The mapper function reads input data stored in the Hadoop Distributed File
System (HDFS), usually as files or directories.
o Each line of input is processed by the mapper, which breaks it down into smaller
data chunks.
o The mapper then generates intermediate key-value pairs, which are passed on for
the next stage.
2. Shuffle Stage:
o In this stage, the intermediate key-value pairs from the mappers are sorted and
grouped by key.
o This grouping organizes data so that each reducer can focus on a single key with
all associated values.
o Data is then transferred to the appropriate reducer nodes.
3. Reduce Stage:
o The reducer function processes the grouped data from the shuffle stage, applying
operations like aggregation or transformation.
o It generates the final output, which is stored back in HDFS for easy access.
Q] Example
Q] Explain anatomy of shuffling and sorting
Shuffling: This is the process of moving the mapper’s output to the reducers as input. It
groups data by keys so each reducer receives all data for a given key. Shuffling starts even before
all map tasks are complete, speeding up the job.
Sorting: The MapReduce framework automatically sorts all keys produced by the mappers
before they reach the reducers. This sorted order helps reducers differentiate when a new reduce
task should start (when the key changes), making the reduce phase more efficient.
Q] Example
Unit-4
BigSQL
Q] What is Big SQL
Ans
1] IBM Big SQL is a powerful SQL engine designed for Hadoop
2] It allows users to efficiently query and analyze large amounts of data from various sources.
3] It can access data from Hadoop (HDFS), traditional relational databases (RDBMS), NoSQL
databases, and cloud storage all in one go.
4] With Big SQL, we can run queries using a single connection, making it easy to work with
different types of data without needing multiple tools.
5] It also provides management tools for databases and integrates with popular analytics tools to
help visualize your data.
Easy Data Migration: Quickly moves old data from traditional databases like Oracle into
Hadoop while keeping the original SQL syntax.
Access to Multiple Data Sources: Allows querying data from various relational and NoSQL
databases without moving it to Hadoop.
Single Query Access: Lets you run a single query to pull insights from different data sources,
simplifying data analysis.
HBase
Q] Explain limitation of Hadoop
Ans
Batch Processing Only: Hadoop can only process data in large batches, which means it can't
handle real-time data processing. This makes it less suitable for situations where quick responses
are needed.
Sequential Access: Data must be accessed in a specific order, so even for simple tasks,
Hadoop has to go through the entire dataset. This makes it slow and inefficient for small queries.
Large Output Data: When Hadoop processes a huge dataset, it often creates another large
dataset that also needs to be processed sequentially, compounding the time and effort required.
Overall, these limitations mean that Hadoop may not be the best choice for applications that need
quick, random access to data.
Q] Explain HBase
Ans
1] HBase is a distributed, column-oriented database built on top of the Hadoop File System
(HDFS) and is part of the Hadoop ecosystem
2] HBase allows for quick random access, enabling users to read and write data in real time
without the need to process it sequentially.
3] Its architecture is similar to Google's Bigtable, which supports horizontal scalability, meaning
that as data grows, additional hardware can be added to accommodate it.
4] HBase stores data in columns rather than rows, making certain queries faster and more
efficient.
5] By leveraging the fault tolerance of HDFS, HBase ensures that data remains safe and
accessible, even in the event of hardware failures.
HBase architecture consists of three main components: the client library, the master server, and
region servers.
1. Master Server: It assigns regions (parts of tables) to region servers, manages load
balancing, and takes care of tasks like creating tables and column families. It uses
ZooKeeper for assigning and discovering region servers.
2. Region Servers: These handle data operations, like reading and writing data, for the
regions they manage. They also split large tables into regions and store data in memory
(Memstore) before saving it permanently in HFiles.
3. ZooKeeper: Zookeeper is an open-source project that ensures coordination between
HBase components, tracks region servers, and helps handle failures.
Q] Commands
Creating a Table
Command: list
Description: This command lists all the tables present in your HBase system, including
the newly created reviews table.
Inserting Data
Description: These commands insert values into specific cells in the reviews table. Each
command specifies the table name, row key, column identifier (column family), and the
value to insert.
Examples:
put 'reviews', '101', 'summary:product', 'hat' # Inserts 'hat' into 'summary:product' for
row '101'
put 'reviews', '101', 'summary:rating', '5' # Inserts '5' into 'summary:rating' for row
'101'
put 'reviews', '112', 'summary:rating', '3' # Inserts '3' into 'summary:rating' for row
'112'
put 'reviews', '112', 'reviewer:name', 'Tina' # Inserts 'Tina' into 'reviewer:name' for
row '112'
Retrieving a Row
Counting Rows
4] Hive is not a traditional database, it is not designed for transaction processing (OLTP), and it
doesn't support real-time queries or updates at the row level.
Ans
Q] Characteristics of Hive
Ans
1] Structured Data: Hive works with structured data stored in tables and databases. You create
tables first and then load data into them.
2] Query Optimization: Hive offers features for query optimization, unlike MapReduce,
making it faster and more efficient for querying large datasets.
3] SQL-like Language: Hive uses a language similar to SQL (called HQL), making it easy for
users familiar with databases to interact with big data.
4] Partitioning: To improve query performance, Hive can partition data by using directory
structures.
5] Metastore: Hive uses a metastore, typically a relational database, to store schema information
about the tables.
6] Multiple Access Methods: You can interact with Hive via Web GUI, JDBC, or command
line, with the CLI being the most common method.
7] File Formats: Hive supports different file formats like TEXTFILE, SEQUENCEFILE, ORC,
and RCFILE for storing data.
8] Metadata Storage: Hive uses a Derby database for single-user metadata storage and MySQL
for multiple users.
1. Hive Clients: These are the interfaces that users or applications use to interact with Hive.
Different clients exist for different needs:
o Thrift client for Thrift-based applications.
o JDBC drivers for Java-based applications.
o ODBC drivers for other applications.
2. Hive Services: This is the core layer that handles all client requests. It includes:
The Command Line Interface (CLI) for executing queries and Data Definition
Language (DDL) operations.
The Main Driver, which receives client requests (via JDBC, ODBC, etc.), processes
them, and communicates with other Hive components like the Meta Store and the File
System for further processing.
3. Hive Storage and Computing: This layer interacts with the actual data stored in the Hadoop
Distributed File System (HDFS). It includes:
Meta Store: Stores schema and metadata information for the Hive tables.
File System: Stores the query results and table data in HDFS.
Job Client: Executes MapReduce or other jobs to process the data.
8] The fetched data is sent from the EE back to the Driver, which then sends the results to the UI
for display.
Hive operates in two modes based on the data size and the Hadoop setup:
1. Local Mode:
o Used when Hadoop is in pseudo-distributed mode with just one data node.
o Ideal for small datasets that fit on a single local machine.
o Processing is faster for smaller data since it runs on the local machine.
2. MapReduce Mode:
o Used for large datasets spread across multiple data nodes in a Hadoop cluster.
o Hive queries are executed using MapReduce, distributing the workload across the
cluster.
o Suitable for handling big data and distributed processing.
In short, Local mode is for small, local datasets, while MapReduce mode is for large,
distributed datasets.
By default, it works on Map Reduce mode and for local mode you can have the following
setting.
SET mapred.job.tracker=local;
Q] Commands
1] Create Database
CREATE DATABASE database_name;
Description: Creates a new database in Hive.
Example : CREATE DATABASE my_database;
3] Use Database
4] Create Table
CREATE TABLE table_name (column1 datatype, column2 datatype, ...)
row format delimited fields terminated by ‘,’;
Description : Creates a new table with specified columns and data types.
Example : CREATE TABLE users (id INT, name STRING, age INT)
row format delimited fields terminated by ‘,’;
5] Describe Table
Describe table_name
Description : Displays the schema information of a table.
Describe users;
6] Suppose we have employee.txt and we have to load employee data into our table
load data local inpath ‘/home/cloudera/employee.txt’ into table employee;
load data local inpath ‘/home/cloudera/project.txt’ into table project;
7] Select query
Select * from employee;
Select * from project;
Select * from employee where salary>=40000;
8] Join
select * from employee join project on employee.emp_id=project.emp_id;
9] Group By
select location, avg(salary) from employee group by location;
10] Order by
Select * from employee order by dept;
Q] Limitation of Hive
Ans
Hive is suitable for batch processing but doesn’t suitable for real-time data handling.
Update and delete are not allowed, but we can delete in bulk i.e. we can delete the entire
table but not individual observation.
Hive is not suitable for OLTP(Online Transactional Processing) operations
Pig
Q] Explain Pig
Ans
1] Apache Pig is a tool used for processing large amounts of data, particularly in the Hadoop
ecosystem.
2] It provides a high-level of abstraction for processing over the MapReduce.
3] It provides a high-level scripting language, known as Pig Latin which is used to develop the
data analysis codes.
4] The Pig Engine, a part of Apache Pig, automatically translates these scripts into MapReduce
tasks that run behind the scenes, so users don't have to worry about the details.
5] The results of the data processing are stored in HDFS (Hadoop Distributed File System).
Q] Need of Pig
Ans
Shorter Development Time: Writing complex data processing tasks in Pig Latin takes much
less time—about 10 lines of code instead of 200 lines in Java.
Ease of Use: It’s easier for programmers without a Java background to use Pig, especially
those familiar with SQL.
Multi-Query Approach: Pig allows users to run multiple queries together, making it more
efficient.
Built-in Functions: Pig offers many built-in operators and supports complex data types (like
tuples and bags), enhancing data manipulation capabilities.
Q] Features of Pig
Rich Set of Operators: Provides operators for filtering, joining, sorting, and aggregating
data.
Ease of Use: Designed to be easy to learn and write, especially for those familiar with SQL.
Extensibility: Allows users to create custom processes and user-defined functions (UDFs) in
languages like Python and Java.
Concise Code: Reduces the amount of code needed for data processing compared to
traditional MapReduce.
Pipeline Splits: Supports splitting processes within data pipelines for better performance.
Integration with Hadoop Ecosystem: Works well with other Hadoop components like Hive,
Spark, and ZooKeeper.
Multivalued and Nested Data Structures: Capable of handling complex data structures,
including nested and multivalued data.
Structured and Unstructured Data Analysis: Can process and analyze both structured and
unstructured data efficiently.
Execution Modes:
1. Local Mode: Runs on the local machine without Hadoop or HDFS. It’s mainly used for
testing small datasets.
2. MapReduce Mode: Runs on Hadoop, using data stored in HDFS. Here, Pig Latin scripts
trigger MapReduce jobs to process data in the backend.
1. Interactive Mode (Grunt Shell): Allows real-time execution of Pig Latin commands in
the Grunt shell, with results displayed immediately.
2. Batch Mode (Script): Executes a Pig script file (.pig extension) in a single run, useful for
automating tasks.
3. Embedded Mode (UDF): Allows custom functions (User Defined Functions) in Java or
other languages to be embedded within Pig scripts, extending Pig’s capabilities.
Q] Explain UDF
Ans
User Defined Functions (UDFs) in Apache Pig allow users to extend Pig's functionality by
writing custom processing logic. Here's a brief overview:
Language Support: Full support is provided in Java, which allows for efficient UDFs
that can handle various processing tasks. Limited support is available for other
programming languages.
Types of UDFs:
1. Filter Functions: Used in filter statements to apply conditions, taking a Pig value
as input and returning a Boolean result.
2. Eval Functions: Used in FOREACH-GENERATE statements to transform data,
taking a Pig value and returning a Pig result.
3. Algebraic Functions: Designed to operate on inner bags in a FOREACH-
GENERATE statement, enabling full MapReduce operations on grouped data.
Piggybank: A repository of Java UDFs that allows users to access and contribute their
own UDFs, fostering a community of shared functions.
Q] Commands
LOAD
DUMP
DESCRIBE
Description: Displays the schema of a relation, showing the data types and structure.
DESCRIBE data;
ILLUSTRATE
Explanation: Demonstrates how data flows through a set of operations in a Pig script
Illustrate data
GROUP
Explanation: Groups all records into a single group, allowing for aggregate operations
on the entire dataset.
all_grouped = GROUP data ALL;
Explanation: Groups records from two different relations based on a specified field,
allowing for operations on both datasets simultaneously.
cogrouped_data = COGROUP data1 BY age, data2 BY age;
JOIN
CROSS OPERATOR
Explanation: Performs a Cartesian product (cross join) between two relations, resulting
in all possible combinations of records.
crossed_data = CROSS data1, data2;
Unit-5
Q] What is big data visualization
Ans
1] Big data visualization involves transforming large, complex data sets into visual formats like
charts, graphs, and maps, making the data easier to analyze and interpret.
2] It simplifies the process of identifying patterns, trends, and insights that might be difficult to
detect in raw data.
3] Techniques range from simple visualizations like line charts and pie charts to more advanced
ones like heat maps, tree maps, and 3D graphs, depending on the complexity and goal.
4] Since big data is vast and often can't fit into a single screen, specialized visualization tools
help extract meaningful insights from massive data sets, much like refining crude oil into usable
fuel.
5] This makes data more accessible, even to those who may not be comfortable working directly
with raw data or SQL queries.
1] Data visualization is crucial in big data because it simplifies complex data sets, making it
easier for decision-makers to quickly interpret insights and make informed choices.
2] Visualization tools present data without losing accuracy, allowing control over precision and
aggregation levels.
3] They enable the creation of dashboards and reports that consolidate all relevant information in
one place, enhancing communication across an organization.
4] This is valuable across industries, from healthcare to finance, where clear, actionable insights
are needed to drive efficiency and informed decision-making.
Q] Explain types of big data visulization
1] Line Chart: A line chart shows how something changes over time. For example, tracking your
monthly savings for a year can be displayed with months on the x-axis and your savings on the
y-axis. The points are connected to form a line showing the trend.
2] Bar Chart: A bar chart is used to compare categories. For instance, if you want to see how
many people like different types of movies (comedy, action, drama), each category is shown as a
bar, and the length of the bar represents how many people like that type of movie.
3] Pie Chart: A pie chart shows how a whole is divided into parts. For example, if 40% of
students prefer pizza, 30% prefer burgers, and 30% prefer sandwiches, a pie chart divides a circle
into slices that represent each food preference.
4] Histogram: A histogram shows the distribution of data. For example, if you wanted to see how
many students scored in certain ranges on a test, the histogram groups the scores into ranges (0-
20, 21-40, etc.) and shows how many students fall into each range.
5] Heat Map: A heat map uses color to show data patterns. For example, in a heat map of daily
temperatures in different cities, colors represent temperature ranges—blue for cold, red for hot—
giving a quick visual of temperature changes across locations.
6] Scatter Plot: A scatter plot shows the relationship between two variables using dots. For
example, if you’re comparing the height of trees to their stem thickness, each dot represents a
tree, with its height on one axis and stem thickness on the other, helping you see if there’s a
connection between the two.
Q] Calculate the box plot
Ans
10, 23, 12, 28, 17, 24, 8, 30, 15, 20, 25, 19, 26
8, 10, 12, 15, 17, 19, 20, 23, 24, 25, 26, 28, 30
median = 20
Step 3 : Consider the left side. They are 8, 10, 12, 15, 17, 19 and find the median.
The median is between 12 and 15. so the median is calculated by taking mean of 12 and 15
Step 4 : Consider the right side. They are 23, 24, 25, 26, 28, 30 and find the median.
The median is between 25 and 26. so the median is calculated by taking mean of 25 and 26
The median is towards the top of the data (closer to larger values).
Median is greater than mean.
The upper quartile is smaller than the lower quartile.
No Skew (Symmetric):
The median is towards the bottom of the data (closer to smaller values).
Median is less than mean.
The upper quartile is larger than the lower quartile.
Q] List challenges of visualizing large amount of data
Ans
Perceptual Scalability: Our eyes can't catch all the important details from a huge amount of
data, and even large screens have trouble displaying it all clearly.
Real-time Scalability: We expect to see data in real-time, but processing massive datasets
takes time, making real-time updates difficult.
Interactive Scalability: Interacting with large datasets helps us understand them better, but
as the data grows, visualizing it can slow down the system, sometimes causing it to freeze or
crash.
Unit-6
Q] What is Recommended system?
Recommendation systems predict what items (like movies, products, or music) a user
might prefer or rate highly.
They are widely used in platforms such as YouTube, IMDb, Amazon, and Flipkart to
personalize content and improve user experience.
These systems analyze user behavior to suggest items users are likely to enjoy or find
relevant.
Compute a User User similarity follow these steps, so find a similarity between
two users we can use cosine similarity.
Item-item similarity compares items (e.g., movies) based on how similar they are,
helping to recommend items that are close in preference to ones a user already likes.
To recommend new items to a user, we look at the items they already enjoy and
find similar items in the matrix.
Let’s suppose we have to recommend new items to user10, and we know a
user10 already likes/watch item7,8,1. Now we go to the item-item similarity
matrix, we take the most similar item to items7,8,1 based on the similarity
values.
let’s suppose the most similar item for item7 is {item9, item4, item10}, the
Most similar item to item8 is {item19, item 4, item10} and the Most similar
item to item 1 is {item9, item14, item10}
Now we take a very common item from every set of items and the common
items are {item9, item4, item10, item 19, item 14} and we recommend these
all items to user10.
Types of Clustering:
1. Hard Clustering:
o Each data point is assigned to one specific cluster.
o There is no overlap between clusters; a point either fully belongs to a cluster or it
doesn’t.
o Example: Grouping customers into distinct segments based on spending
behavior, where each customer fits into only one category.
2. Soft Clustering:
o Data points can belong to multiple clusters with varying probabilities.
o Each point has a likelihood of being part of different clusters rather than being
strictly assigned to one.
o Example: Assigning customers a probability of fitting into different buying
behavior clusters based on their purchase history.
1. Scalability: It’s hard for some algorithms to work well when there is a lot of data.
2. Dimensionality: Having too many features (like characteristics or measurements) can
confuse algorithms and lower their accuracy.
3. Imbalanced Data: Sometimes, one group of data is much smaller than others, making it
difficult for the algorithm to recognize it properly.
4. Computational Complexity: Some algorithms take a lot of time and computing power,
which can slow things down.
5. Data Quality: If the data is noisy or has mistakes, it can lead to wrong results.
6. Feature Selection: It’s tough to choose the most important features from a large set, but
it’s necessary for getting good results.
7. Interpretability: Complex models can be hard to understand, which is important in
sensitive areas like healthcare.
8. Resource Constraints: Limited memory and processing power can make it hard to
analyze big data effectively.
9. Concept Drift: Changes in data patterns over time require algorithms to adjust to keep
their accuracy.
10. Privacy and Security: It’s crucial to protect sensitive information while analyzing data.
Naive Bayes is a classification algorithm based on Bayes' Theorem, which calculates the
probability of an event based on prior knowledge of conditions related to that event. The "naive"
part comes from the assumption that all features (or predictors) are independent of each other,
which is rarely true in real-life data but simplifies calculations significantly.
P(A|B) Is That The Likelihood Of Thus Hypothesis A Given The Information B. This Is
Often Known As The Posterior Likelihood.
P(B|A) Is That The Likelihood Of Information B As Long As The Hypothesis A Was
True.
P(A) Is That The Likelihood Of Hypothesis A Being True (Regardless Of The Data).
Thus, This Is Often Known As The Previous Likelihood Of A.
P(B) Is The Likelihood Of The Information (Regardless Of The Hypothesis).
P(A|B) Or P(B|A) Are Conditional Chances P(B|A) = P(A And B)/P(A)
1. Multinomial Naive Bayes - suitable for discrete data (e.g., word frequencies in text
classification).
2. Bernoulli Naive Bayes - used for binary/boolean data.
3. Gaussian Naive Bayes - used for continuous data, assuming a normal distribution.
Pros:
Cons:
Assumes independence among features, which is often not true in real-world data, potentially
reducing accuracy.
Topic Modeling: This finds main topics in a group of texts, helping to organize and
categorize large amounts of information, like news articles.
Text Classification: This sorts documents into specific categories, which is useful for things
like spam detection or tagging content.
Information Retrieval: This helps find relevant information in large text collections based
on user searches, like how search engines work.
Text Summarization: This creates a brief summary of a longer text, helping to quickly
understand the main points.
Language Translation: This translates text between languages using techniques that help
make sense of the words.
Fraud Detection: This looks at text data from things like insurance claims to find signs of
fraud.
Healthcare Analytics: This analyzes medical records to gather useful information for
diagnosing and treating patients.
Social Media Analytics: This studies text from social media to understand trends, feelings,
and user behavior.
1. Data Collection: Gather relevant information from different sources like databases or
spreadsheets.
2. Data Preprocessing: Clean and prepare the data for analysis, which includes fixing
missing values and adjusting the data format.
3. Exploratory Data Analysis (EDA): Look at the data to understand its features and find
any noticeable patterns.
4. Feature Selection/Engineering: Choose important variables for analysis and create new
ones that could help improve the results.
5. Model Building: Use various algorithms to create models that can predict outcomes or
describe data. Common methods include decision trees and logistic regression.
6. Model Evaluation: Check how well the models perform using measures like accuracy
and precision.
7. Model Deployment: Use the models to make predictions or gain insights from new data.
Q] Difference between text mining and data mining
For example, if a student wants to learn C++ programming, they might enter a query like "C++ tutorial
GeeksforGeeks" into a search engine like Google.
The search engine then quickly scans its vast index of web pages to find content related to "C++ tutorial
GeeksforGeeks."
After analyzing the query, it sorts the most relevant and useful links, such as GeeksforGeeks tutorials on
C++, and presents them in an ordered list
Search engines operate through three main steps: crawling, indexing, and ranking.
1. Crawling: Search engines use programs called crawlers to scan and discover publicly
available information on websites. Crawlers visit each page, read the HTML, and
understand the content, structure, and update time. Crawling is crucial because if search
engines can’t access your site, it won’t show up in search results.
2. Indexing: After crawling, the information is organized and stored in an index, which is
like a database of web content. The index contains key information like page titles,
descriptions, keywords, and links. Indexing is essential because only indexed content can
appear in search results.
3. Ranking: Ranking determines the order in which search results appear based on relevance.
Step 1: Analyze user query – This is where the search engine tries to understand what
kind of information the user wants. It breaks down the search into keywords. For
example, if you type "how to make a chocolate cupcake," the search engine knows you're
looking for specific instructions, so it shows recipes and guides. It can also handle similar
meanings (like "change" and "replace") and corrects spelling mistakes.
Step 2: Finding matching pages – Next, the search engine looks through its index to
find the best pages that match the query. For instance, if you search "dark wallpaper," it
will likely show images instead of text, since that's probably what you're looking for.
Step 3: Present the results to the users – Finally, the search engine shows you a list of
results, usually with ten main links. It may also display ads, quick answers, or other extra
information to help you find what you need.
Searching for Information: Search engines allow users to find a wealth of information on
any topic. For instance, someone looking to buy a mobile phone might search for “best mobile
phones in 2021.” The search engine provides a list of options, complete with features, reviews,
and prices, helping users make informed decisions.
Searching Images and Videos: Users can specifically search for visual content such as
images and videos. For example, a person interested in nature can search for "flowers" to find a
variety of pictures and videos. Search engines categorize these visual assets, making it easy for
users to find exactly what they need.
Searching Locations: Search engines are invaluable for finding geographical locations. For
instance, if someone is visiting Goa and wants to locate Palolem beach, they can simply enter
"Palolem beach" in the search bar. The search engine will provide directions, maps, and
information about the best routes to reach their destination.
Searching People: Search engines help users find individuals by searching their names or
social media profiles. This feature is especially useful for reconnecting with friends, networking,
or conducting research on public figures.
Shopping: Search engines play a crucial role in online shopping. Users can search for
specific products, and the search engine will return a list of websites that sell the item, often
displaying prices, reviews, and shipping options. This allows consumers to compare deals and
find the best offers available.
Entertainment: Search engines are widely used for entertainment purposes. Users can search
for movies, music, games, and trailers. For example, if someone wants to watch a movie called
"Ram," they can search for it and receive a list of streaming services or websites where it can be
viewed or purchased.
Education: Search engines serve as an educational resource, enabling users to learn about a
vast array of topics. Whether someone wants to learn how to cook, explore programming
languages, or find home decoration tips, search engines provide access to tutorials, articles, and
videos, functioning as an open school that offers free learning opportunities.
1. Result Evaluation: This involves assessing the search results returned by the engine for
specific queries. Evaluators can classify results as relevant or not, rank them by
relevancy, score them on a scale (1-5), or compare pairs to determine which is more
relevant. This helps fine-tune the algorithm for better accuracy.
2. Recommendation Analysis: Evaluators examine related results to determine their
relevance to the original query. This is especially important for eCommerce sites, where
recommendations can encourage additional purchases. Improved recommendation
systems enhance user experience and drive revenue.
3. Query Categorization: This evaluates how well the search engine differentiates between
similar queries, such as "apple products" (fruit vs. Apple Inc.). By categorizing queries,
evaluators train the algorithm to understand user intent, which is crucial for delivering
relevant results.
4. Caption Evaluation: This focuses on the effectiveness of captions and taglines
associated with search results. Evaluators from the target demographic assess which
captions resonate with users and drive engagement, providing insights into what works
and what doesn’t.
5. Ad Relevance: This assesses the relevance of paid advertisements in relation to user
queries. Evaluators ensure that ads and their landing pages align with the search intent.
Relevant ads improve user experience and prevent dissatisfaction with the search results.
Spatiotemporal data, which combines location and time information, plays a crucial role in the
transportation industry, especially with the rise of big data technologies. Here are key
applications:
1. Real-time Vehicle Tracking: GPS devices in vehicles generate massive data that allows
companies to monitor locations and optimize routes in real time.
2. Traffic Management: Analyzing large datasets from traffic sensors and cameras helps
manage congestion, improve traffic flow, and suggest alternative routes.
3. Geospatial Applications: New applications visualize vehicle movements over time,
aiding in route planning and accident analysis.
4. Spatial Databases: Efficient storage and retrieval of large spatiotemporal datasets
require advanced spatial databases that can handle dynamic data.
Overall, big data enhances the transportation industry and other sectors by enabling real-time decision-
making, improving efficiency, and providing valuable insights into complex spatial phenomena.