Big Data NOTES
Big Data NOTES
Q1. What is Big Data and give types of big data. (5) Dec 2023
1. The term 'Big Data' means huge volume, high velocity and a variety of data. This big data is increasing
tremendously day by day.
2. Traditional data management systems and existing tools are facing difficulties to process such a Big
Data.
3. Big data is the most important technologies in modern world.
4. Big is a collection of large datasets that cannot be processed using traditional computing techniques.
5. Big Data includes huge volume, high velocity and extensible variety of data.
Big Data Types:
1. Structured Data: Organized in predictable formats like tables, spreadsheets, and databases. Think of
neatly arranged books on shelves, categorized by genre and author.
2. Semi-structured Data: Contains some organization but allows flexibility, like XML and JSON files.
Imagine books with detailed summaries and tags instead of rigid chapters.
3. Unstructured Data: Lacks defined format, including text, images, videos, and social media posts.
Imagine a treasure trove of handwritten notes, diaries, and sketches alongside the books.
Q2. What are three Vs of Big Data? Give two examples of big data case studies. Indicate which Vs are
satisfied by these case studies. (5) May 2023
1. Volume:
The name ‘Big Data’ itself is related to a size which is enormous.
Volume is a huge amount of data.
To determine the value of data, size of data plays a very crucial role. If the volume of data is very large,
then it is actually considered as a ‘Big Data’.
Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes (6.2 billion GB) per
month. Also, by the year 2020 we will have almost 40000 Exabytes of data.
2. Velocity:
Velocity refers to the high speed of accumulation of data.
In Big Data velocity data flows in from sources like machines, networks, social media, mobile phones.
There is a massive and continuous flow of data. This determines the potential of data that how fast the
data is generated and processed to meet the demands.
Sampling data can help in dealing with the issue like ‘velocity’.
Example: There are more than 3.5 billion searches per day are made on Google. Also, Facebook users
are increasing by 22%(Approx.) year by year.
3. Variety:
It refers to nature of data that is structured, semi-structured and unstructured data.
It also refers to heterogeneous sources.
Variety is basically the arrival of data from new sources that are both inside and outside of an
enterprise. It can be structured, semi-structured and unstructured.
Q3. List & explain Big data: 1) Characteristics 2) Types 3) Challenges
Big Data Characteristics:
1. Volume: Refers to the sheer amount of data generated and stored, often characterized by its massive
scale.
2. Veracity: Reflects the reliability and trustworthiness of the data, emphasizing the need to ensure
accuracy and consistency.
3. Variety: Encompasses the diverse types and sources of data, including structured, unstructured, and
semi-structured data formats.
4. Value: Indicates the significance and usefulness of the data in driving insights, decision-making, and
creating business value.
5. Velocity: Describes the speed at which data is generated, processed, and analyzed, highlighting the
importance of real-time or near-real-time data processing capabilities.
Big Data Types:
1. Structured Data: Organized in predictable formats like tables, spreadsheets, and databases. Think of
neatly arranged books on shelves, categorized by genre and author.
2. Semi-structured Data: Contains some organization but allows flexibility, like XML and JSON files.
Imagine books with detailed summaries and tags instead of rigid chapters.
3. Unstructured Data: Lacks defined format, including text, images, videos, and social media posts.
Imagine a treasure trove of handwritten notes, diaries, and sketches alongside the books.
Challenges of Big Data:
1. Storage and Management: Finding cost-effective ways to store, organize, and access massive datasets
2. Analysis and Processing: Developing computational power and methods to analyze diverse data formats
efficiently.
3. Privacy and Security: Protecting sensitive information within vast datasets against breaches and misuse
4. Integration and Interoperability: Combining data from diverse sources and ensuring they work
together seamlessly.
Chapter 2 : Introduction to Big Data Frameworks
Q1. What are the advantages and limitations of Hadoop (5) Dec 2023
Advantages:
1. Scalability:
Hadoop is a highly scalable model. A large amount of data is divided into multiple inexpensive machines in
a cluster which is processed parallelly.
2. Flexibility:
Hadoop is designed in such a way that it can deal with any kind of dataset like structured, Semi-Structured
and Un-structured very efficiently.
3. Speed:
Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop Distributed File System) as
massive number of file blocks are processed parallelly which makes Hadoop faster.
4. Fault Tolerance:
In Hadoop data is replicated on various DataNodes in a Hadoop cluster which ensures the availability of
data if somehow any of your systems got crashed.
Limitations:
1. Small files: Inefficient for many small files due to overhead and metadata management.
2. Real-time processing: Not optimized for real-time data analysis, better suited for batch processing.
3. Iterative processing: MapReduce framework not ideal for iterative algorithms.
4. Security: Inherits security vulnerabilities from Java and lacks default encryption.
Q2. Explain CAP theorem and explain how NoSQL systems guarantees BASE property. (5) Dec May
2023.
1. The CAP theorem also known as Brewer's theorem, states that in a distributed system, it is impossible
to simultaneously achieve all three properties: consistency, availability and partition tolerance.
2. This theorem highlights the trade-offs that distributed systems face when designing for these
properties.
3. Consistency ensures that all nodes in the system see the same data at the same time, availability
guarantees that every request receives a response, and tolerance allows the system to continue
operating despite network failures.
How NoSQL systems gurantee BASE property:
1. NoSQL databases, like MongoDB and Cassandra, adhere to the BASE (Basically Available, Soft
state, Eventually consistent) properties instead of the ACID (Atomicity, Consistency, Isolation,
Durability) properties of traditional databases.
2. NoSQL systems prioritize availability and partition tolerance over strong consistency, aiming for
eventual consistency after network partitions are resolved.
Q3. NoSQL data stores with example. (5)(10) May Dec 2023
1. Document Data Store:
A document database is a type of NoSQL database that can be used to store and query data as JSON-
like documents.
JavaScript Object Notation (JSON) is an open data interchange format that is both human and
machine-readable.
Developers can use JSON documents in their code and save them directly into the document database.
The flexible, semi-structured, and hierarchical nature of documents and document databases allows
them to evolve with applications’ needs. Example: MongoDB
2. Column Family Data store:
Column family data stores arranges data into logically related columns grouped into "column
families."
Here instead of organizing information into rows, it does in columns.
It includes column families, keys, keyspace, columns.
Columns are somehow of the same type and gain from more efficient compression, which makes
reads faster than before. Example: Hbase, Big Table
3. Key-Value Stores:
A key-value store is a non-relational database.
The simplest form of a NoSQL database is a key-value store.
Every data element in the database is stored in key-value pairs.
The data can be retrieved by using a unique key allotted to each element in the database. Example:
Cassandra, DynamoDB
4. Graph-Based databases:
Graph-based databases focus on the relationship between the elements.
It stores the data in the form of nodes in the database.
The connections between the nodes are called links or relationships. Example: Neo4j, Polyglot
Q4. Explain the distributed storage system of Hadoop with the help of a neat diagram (10) May 2023
Distributed Storage System of Hadoop (HDFS)
In HDFS, data is stored across multiple machines in a cluster to handle very large amounts of data
efficiently.
The system consists of two main components: NameNode and DataNode.
1. NameNode:
The NameNode acts as the master server in the HDFS architecture.
It manages the file system namespace, controls client access to files, and oversees file operations like
renaming opening and closing files.
The NameNode stores metadata about the data, such as file names, sizes, and block locations.
2. DataNode:
DataNodes are commodity hardware nodes that store and manage the actual data blocks.
For every node in the cluster, there is a corresponding DataNode responsible for read-write operations on
the file system.
DataNodes perform tasks like block creation deletion and replication based on instructions from the
NameNode.
3. Block:
Files in HDFS are divided into segments called blocks.
Blocks ore the minimum unit of data that HDFS can read or write.
The default block size in HDFS is 64MB but can be adjusted based on configuration needs.
Hadoop Distributed File System
Q5. List five services of Apache hadoop. Explain Different Hadoop Components (Same).
1. HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large data
sets of structured or unstructured data across various nodes and thereby maintaining the metadata in the
form of log files.
HDFS consists of two core components i.e. Name node and Data Node
2. YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the
resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop
System.
Consists of three major components i.e. Resource Manager, Nodes Manager and Application Manager
3. MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the
processing’s logic and helps to write applications which transform big data sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce()
4. PIG:
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of MapReduce are
taken care of. After the processing, pig stores the result in HDFS.
5. HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets.
However, its query language is called as HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL
datatypes are supported by Hive thus, making the query processing easier.
6. HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop
Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively.
Q1. Describe the pseudocode for one-step matrix multiplication using mapreduce. (10) Dec 2023
Apply the same to determine the product of matrices M and N: (10) Dec 2023
M =1 2 3 4 5 6 N= 1 2 3 4 5 6
Show output of each stage distinctly.
Q3. Show any 5 different relational algebra operations with example. (10) Dec 2023
Q4. For the graph given below use Clique percolation and find all communities. (10) Dec May 2023
Q5. Discuss Matrix-Matrix Multiplication. Perform Matrix Multiplication with 1-step Map Reduce
method. (10) May 2023
Q6. Explain Grouping and Aggregation algorithm using MapReduce. Support your answer with a
suitable example. (10) May 2023
Q7. Explain Map Reduce.
Q2. Summarize Bloom’s filter with example and its applications. (10) Dec 2023
1. A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a
member of a set.
2. For example, checking availability of username is set membership problem, where the set is the list of all
registered username.
3. It is probabilistic in nature that means, there might be some False Positive results.
4. False positive means, it might tell that given username is already taken but actually it’s not.
Applications:
Medium uses bloom filters for recommending post to users by filtering post which have been seen by
user.
Quora implemented a shared bloom filter in the feed backend to filter out stories that people have seen
before.
The Google Chrome web browser used to use a Bloom filter to identify malicious URLs
Google BigTable, Apache HBase and Apache Cassandra, and Postgresql use Bloom filters to reduce the
disk lookups for non-existent rows or columns
Q3. Explain the DGIM algorithm. State the rules used in DGIM that must be followed. (10) Dec 2023
1. DGIM algorithm (Datar-Gionis-Indyk-Motwani Algorithm)
2. In the simplest version DGIM algo has an N-bit window it is represented using O(log2N) bits.
3. So this algorithm’s error rate no more than 50%.
4. The two basic components of this algo are Timestamp and Bucket.
5. The first bit is assigned timestamp 1, the second bit is assigned timestamp 2 and so on.
6. The windows are divided into buckets consisting of 1’s and 0's.
Rules:
Every bucket should contain at least a single 1 in it
Right side of the bucket should strictly start from 1
Length of the bucket is equal to the number of I's in it
Every bucket length should be in powers of 2
As we move to left, the bucket size should not decrease
No more than two buckets can have same size
Q4. Give two applications for counting the number of 1’s in a long stream of binary values. Using a
stream of binary digits, illustrate how DGIM will find the number of 1’s. (10) May 2023
https://www.youtube.com/watch?v=Z_MLrbI1s2E
Applications:
Social Media Analytics: In social media analytics, the DGIM algorithm can be applied to estimate
the popularity or engagement level of posts, hashtags, or topics over time.
Network Traffic Monitoring: In network traffic monitoring systems, the DGIM algorithm can be
used to estimate the number of active connections or the volume of traffic passing through a network
link within a certain time window.
Q5. Suppose a data stream consists of integers 1,3,5,4,6,1,5,9,3,2. Let the hash function used be: (10)
May 2023
h(x)= x+1 mod 16
h(x)= 2x+3 mod 16
h(x) = 3x + 1 mod 16
Show how the Flajolet-Martin algorithm will estimate the number of distinct elements in the
stream. (10) May 2023
Chapter 5 : Big Data Mining Algorithms
Q1. Explain PCY algorithm and its 2 types with neat labeled diagram. Dec 2023
1. The PCY Algorithm uses that space for an array of integers that generalizes the idea of a Bloom filter.
2. This algorithm exploits the observation that there may be much unused space in main memory on the first
pass.
3. If there are a million items and gigabytes of main memory, we do not need more than 10% of the main
memory for the two tables.
4. The PCY algorithm uses hashing to efficiently count item set frequencies and reduce overall
computational cost.
5. The basic idea is to use a hash function to map itemsets to hash buckets, followed by a hash table to count
the frequency of itemsets in each bucket.
https://youtu.be/OLu-Bsx-e0Q?si=3RrOrmB8_9WzgdK6
Types not given in book. Two types of PCY Algorithm can be Multistage and Multihash algorithm.
Q4. Explain CURE algorithm, clearly stating its advantages over traditional clustering algorithm. (10)
May 2023
1. It is a hierarchical based clustering technique, that adopts a middle ground between the centroid
based and the all-point extremes.
2. Hierarchical clustering is a type of clustering, that starts with a single point cluster, and moves to
merge with another cluster, until the desired number of clusters are formed.
3. It is used for identifying the spherical and non-spherical clusters.
4. It is useful for discovering groups and identifying interesting distributions in the underlying data.
5. Instead of using one point centroid, as in most of data mining algorithms, CURE uses a set of well-
defined representative points, for efficiently handling the clusters and eliminating the outliers.
Advantages:
1. Handling Arbitrary Shape Clusters: CURE addresses this issue by using a hierarchical approach that
employs a combination of partitioning and hierarchical clustering.
2. Scalability: CURE is designed to be more scalable compared to some traditional clustering algorithms.
3. Outlier Robustness: CURE's use of representative points helps mitigate the impact of outliers by
focusing on the dense regions of clusters
4. Parameter Insensitivity: CURE is relatively less sensitive to parameter choices due to its hierarchical
nature and the use of representative points.
Final Pass: After the second pass, the second hash table is also encapsulated as a bitmap, and that bitmap is
stored in the main memory.
MultiHash Algorithm:
1. In some scenario a single pass can provide huge benefits like use of two or more passes of multistage
algorithm. Such behaviour of PCY algorithm is known as Multihash algorithm.
2. It suggest to use two hash function, two separate hash tables along With shared main memory of first pass
rather than using two different hash table to two successive pass.
3. In the second step of the multihash algorithm, each hash table is transformed into a bitmap. When we have
two hash tables, each one gets converted into its own bitmap. Surprisingly, the space needed for two
bitmaps is the same as that required for just one bitmap.
4. The third step of the multihash algorithm, for the pairs (let's call them i and j) to be considered in this step,
they both need to be frequent. Additionally, their pair {i, j} must be placed in a bucket that's considered
frequent in both hash tables, ensuring coordination between the two tables.
Q1. Explain collaborative filtering. How it is different from content based filtering?(10) May Dec 2023
1. Collaborative filtering is a recommendation system that suggests items to a user based on similar users'
preferences.
2. This system does not use the attributes or features of the items to make recommendations but instead
uses the past behavior of users to identify similar users and recommend items that similar users have
liked.
3. There are two main types of collaborative filtering:
a. User-based collaborative filtering: This method finds similar users based on their past interactions
with items and then recommends items that similar users have liked. For example, if two users have
similar viewing histories on Netflix, the system may recommend the same movie to both users.
b. Item-based collaborative filtering: This method finds similar items based on how users have
interacted with them and then recommends those similar items to a user. For example, if a user has
liked several movies of a particular genre, the system may recommend other movies of that genre to
the user.
Difference:
Content-based filtering is suitable for items that are easy to describe and compare by their features,
while collaborative filtering can offer more flexibility for items with diverse or complex features.
Content-based filtering is suitable for providing personalized recommendations that match user
preferences and interests, while collaborative filtering can provide surprising and diverse
recommendations that expose users to new or popular items.
Q2. Define Hub and Authority. Compute the Hub and Authority scores for the web: (10) Dec 2023
Practice HITS Algorithm and Hub and Authority score sum ( very imp numerical 6th chp )
Q3. Structure of the web (5) Dec 2023
Q4. Explain what characteristics of social media makes it suitable for Big Data. (5) May 2023
1. Since one of the essential characteristics of Big Data originated from social media is that it is real-time or
near-real-time.
2. This gives to the exploratory analysis a wide perspective on what is happening and what is about to happen
at a certain time in a certain area.
3. Each fundamental trait of Big Data can be understood as a parameter for quantitative, qualitative and
exploratory information analysis.
Volume - There are two types of data that social media platforms collect: structured and unstructured For
social scientists the total mass of the data allows the definition of multiple classes, criteria and the refining
of analysis sets and subsets.
Variety - The data formats vary from text documents, tables to video data, audio data and many more. This
lifts the data analysis to a higher complexity level; therefore, the statistical models will also be adjusted in
order to obtain viable information.
Velocity - Speed is a key aspect in trend and real-life phenomena analysis. The faster the data is generated,
shared and understood the more information it can reveal.
Veracity - For the seasoned data analyst it is essential to be able to evaluate the truthfulness, the accuracy
and honesty of the data put to analysis.
Q5. Recall HITS algorithm. Generate Hub and Authority score after 2 iterations for the graph given
here. (10) May 2023
Practice HITS Algorithm and Hub and Authority score sum ( very imp numerical 6th chp )