0% found this document useful (0 votes)

33 views14 pages

Big Data NOTES

notes for BigData semester 8th for BE

Uploaded by

patole.yash56

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views14 pages

Big Data NOTES

notes for BigData semester 8th for BE

Uploaded by

patole.yash56

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

BigData Notes

Chapter 1 : Introduction to Big Data

Q1. What is Big Data and give types of big data. (5) Dec 2023
1. The term 'Big Data' means huge volume, high velocity and a variety of data. This big data is increasing
tremendously day by day.
2. Traditional data management systems and existing tools are facing difficulties to process such a Big
Data.
3. Big data is the most important technologies in modern world.
4. Big is a collection of large datasets that cannot be processed using traditional computing techniques.
5. Big Data includes huge volume, high velocity and extensible variety of data.
Big Data Types:
1. Structured Data: Organized in predictable formats like tables, spreadsheets, and databases. Think of
neatly arranged books on shelves, categorized by genre and author.
2. Semi-structured Data: Contains some organization but allows flexibility, like XML and JSON files.
Imagine books with detailed summaries and tags instead of rigid chapters.
3. Unstructured Data: Lacks defined format, including text, images, videos, and social media posts.
Imagine a treasure trove of handwritten notes, diaries, and sketches alongside the books.

Q2. What are three Vs of Big Data? Give two examples of big data case studies. Indicate which Vs are
satisfied by these case studies. (5) May 2023
1. Volume:
 The name ‘Big Data’ itself is related to a size which is enormous.
 Volume is a huge amount of data.
 To determine the value of data, size of data plays a very crucial role. If the volume of data is very large,
then it is actually considered as a ‘Big Data’.
 Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes (6.2 billion GB) per
month. Also, by the year 2020 we will have almost 40000 Exabytes of data.
2. Velocity:
 Velocity refers to the high speed of accumulation of data.
 In Big Data velocity data flows in from sources like machines, networks, social media, mobile phones.
 There is a massive and continuous flow of data. This determines the potential of data that how fast the
data is generated and processed to meet the demands.
 Sampling data can help in dealing with the issue like ‘velocity’.
 Example: There are more than 3.5 billion searches per day are made on Google. Also, Facebook users
are increasing by 22%(Approx.) year by year.
3. Variety:
 It refers to nature of data that is structured, semi-structured and unstructured data.
 It also refers to heterogeneous sources.
 Variety is basically the arrival of data from new sources that are both inside and outside of an
enterprise. It can be structured, semi-structured and unstructured.
Q3. List & explain Big data: 1) Characteristics 2) Types 3) Challenges
Big Data Characteristics:
1. Volume: Refers to the sheer amount of data generated and stored, often characterized by its massive
scale.
2. Veracity: Reflects the reliability and trustworthiness of the data, emphasizing the need to ensure
accuracy and consistency.
3. Variety: Encompasses the diverse types and sources of data, including structured, unstructured, and
semi-structured data formats.
4. Value: Indicates the significance and usefulness of the data in driving insights, decision-making, and
creating business value.
5. Velocity: Describes the speed at which data is generated, processed, and analyzed, highlighting the
importance of real-time or near-real-time data processing capabilities.
Big Data Types:
1. Structured Data: Organized in predictable formats like tables, spreadsheets, and databases. Think of
neatly arranged books on shelves, categorized by genre and author.
2. Semi-structured Data: Contains some organization but allows flexibility, like XML and JSON files.
Imagine books with detailed summaries and tags instead of rigid chapters.
3. Unstructured Data: Lacks defined format, including text, images, videos, and social media posts.
Imagine a treasure trove of handwritten notes, diaries, and sketches alongside the books.
Challenges of Big Data:
1. Storage and Management: Finding cost-effective ways to store, organize, and access massive datasets
2. Analysis and Processing: Developing computational power and methods to analyze diverse data formats
efficiently.
3. Privacy and Security: Protecting sensitive information within vast datasets against breaches and misuse
4. Integration and Interoperability: Combining data from diverse sources and ensuring they work
together seamlessly.
Chapter 2 : Introduction to Big Data Frameworks

Q1. What are the advantages and limitations of Hadoop (5) Dec 2023
Advantages:
1. Scalability:
Hadoop is a highly scalable model. A large amount of data is divided into multiple inexpensive machines in
a cluster which is processed parallelly.
2. Flexibility:
Hadoop is designed in such a way that it can deal with any kind of dataset like structured, Semi-Structured
and Un-structured very efficiently.
3. Speed:
Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop Distributed File System) as
massive number of file blocks are processed parallelly which makes Hadoop faster.
4. Fault Tolerance:
In Hadoop data is replicated on various DataNodes in a Hadoop cluster which ensures the availability of
data if somehow any of your systems got crashed.
Limitations:
1. Small files: Inefficient for many small files due to overhead and metadata management.
2. Real-time processing: Not optimized for real-time data analysis, better suited for batch processing.
3. Iterative processing: MapReduce framework not ideal for iterative algorithms.
4. Security: Inherits security vulnerabilities from Java and lacks default encryption.

Q2. Explain CAP theorem and explain how NoSQL systems guarantees BASE property. (5) Dec May
2023.

1. The CAP theorem also known as Brewer's theorem, states that in a distributed system, it is impossible
to simultaneously achieve all three properties: consistency, availability and partition tolerance.
2. This theorem highlights the trade-offs that distributed systems face when designing for these
properties.
3. Consistency ensures that all nodes in the system see the same data at the same time, availability
guarantees that every request receives a response, and tolerance allows the system to continue
operating despite network failures.
How NoSQL systems gurantee BASE property:
1. NoSQL databases, like MongoDB and Cassandra, adhere to the BASE (Basically Available, Soft
state, Eventually consistent) properties instead of the ACID (Atomicity, Consistency, Isolation,
Durability) properties of traditional databases.
2. NoSQL systems prioritize availability and partition tolerance over strong consistency, aiming for
eventual consistency after network partitions are resolved.
Q3. NoSQL data stores with example. (5)(10) May Dec 2023
1. Document Data Store:
 A document database is a type of NoSQL database that can be used to store and query data as JSON-
like documents.
 JavaScript Object Notation (JSON) is an open data interchange format that is both human and
machine-readable.
 Developers can use JSON documents in their code and save them directly into the document database.
 The flexible, semi-structured, and hierarchical nature of documents and document databases allows
them to evolve with applications’ needs. Example: MongoDB
2. Column Family Data store:
 Column family data stores arranges data into logically related columns grouped into "column
families."
 Here instead of organizing information into rows, it does in columns.
 It includes column families, keys, keyspace, columns.
 Columns are somehow of the same type and gain from more efficient compression, which makes
reads faster than before. Example: Hbase, Big Table
3. Key-Value Stores:
 A key-value store is a non-relational database.
 The simplest form of a NoSQL database is a key-value store.
 Every data element in the database is stored in key-value pairs.
 The data can be retrieved by using a unique key allotted to each element in the database. Example:
Cassandra, DynamoDB
4. Graph-Based databases:
 Graph-based databases focus on the relationship between the elements.
 It stores the data in the form of nodes in the database.
 The connections between the nodes are called links or relationships. Example: Neo4j, Polyglot

Q4. Explain the distributed storage system of Hadoop with the help of a neat diagram (10) May 2023
Distributed Storage System of Hadoop (HDFS)
In HDFS, data is stored across multiple machines in a cluster to handle very large amounts of data
efficiently.
The system consists of two main components: NameNode and DataNode.
1. NameNode:
 The NameNode acts as the master server in the HDFS architecture.
 It manages the file system namespace, controls client access to files, and oversees file operations like
renaming opening and closing files.
 The NameNode stores metadata about the data, such as file names, sizes, and block locations.
2. DataNode:
 DataNodes are commodity hardware nodes that store and manage the actual data blocks.
 For every node in the cluster, there is a corresponding DataNode responsible for read-write operations on
the file system.
 DataNodes perform tasks like block creation deletion and replication based on instructions from the
NameNode.
3. Block:
 Files in HDFS are divided into segments called blocks.
 Blocks ore the minimum unit of data that HDFS can read or write.
 The default block size in HDFS is 64MB but can be adjusted based on configuration needs.
Hadoop Distributed File System

Q5. List five services of Apache hadoop. Explain Different Hadoop Components (Same).
1. HDFS:
 HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large data
sets of structured or unstructured data across various nodes and thereby maintaining the metadata in the
form of log files.
 HDFS consists of two core components i.e. Name node and Data Node
2. YARN:
 Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the
resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop
System.
 Consists of three major components i.e. Resource Manager, Nodes Manager and Application Manager
3. MapReduce:
 By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the
processing’s logic and helps to write applications which transform big data sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce()
4. PIG:
 It is a platform for structuring the data flow, processing and analyzing huge data sets.
 Pig does the work of executing commands and in the background, all the activities of MapReduce are
taken care of. After the processing, pig stores the result in HDFS.
5. HIVE:
 With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets.
However, its query language is called as HQL (Hive Query Language).
 It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL
datatypes are supported by Hive thus, making the query processing easier.
6. HBase:
 It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop
Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively.

Q6. Explain CAP Theorem. How it is different from ACID Properties?

CAP theorem answer above.
Difference:
1. The use of the word consistency in CAP and its use in ACID do not refer to the same identical concept.
2. In CAP, the term consistency refer s to the consistency of the values in different copies of the same data
item in a replicated distributed system.
3. In ACID, it refers to the fact that a transaction will not violate the integrity constraints specified on the
database schema.
4. The CAP theorem states that distributed databases can have at most two of the three properties:
consistency, availability, and partition tolerance. As a result, database systems prioritize only two
properties at a time.
Chapter 3 : MapReduce Paradigm

Q1. Describe the pseudocode for one-step matrix multiplication using mapreduce. (10) Dec 2023
Apply the same to determine the product of matrices M and N: (10) Dec 2023
M =1 2 3 4 5 6 N= 1 2 3 4 5 6
Show output of each stage distinctly.
Q3. Show any 5 different relational algebra operations with example. (10) Dec 2023
Q4. For the graph given below use Clique percolation and find all communities. (10) Dec May 2023

Q5. Discuss Matrix-Matrix Multiplication. Perform Matrix Multiplication with 1-step Map Reduce
method. (10) May 2023

Q6. Explain Grouping and Aggregation algorithm using MapReduce. Support your answer with a
suitable example. (10) May 2023
Q7. Explain Map Reduce.

Chapter 4 : Mining Big Data Streams

Q1. Elaborate issues of stream processing. (5) Dec 2023

1. Continuous – data streams are generated continuously, and there is no defined end
2. Unbounded – there is no limit to the amount of generated data streaming
3. Time-Sensitive – data streams are processed as they are generated in near-real time to support instant
Analytics
4. High-Volume – often generated at a very high rate that makes them challenging to process
5. Heterogeneous – data streams can come from a variety of sources and be of different types

Q2. Summarize Bloom’s filter with example and its applications. (10) Dec 2023
1. A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a
member of a set.
2. For example, checking availability of username is set membership problem, where the set is the list of all
registered username.
3. It is probabilistic in nature that means, there might be some False Positive results.
4. False positive means, it might tell that given username is already taken but actually it’s not.
Applications:
 Medium uses bloom filters for recommending post to users by filtering post which have been seen by
user.
 Quora implemented a shared bloom filter in the feed backend to filter out stories that people have seen
before.
 The Google Chrome web browser used to use a Bloom filter to identify malicious URLs
 Google BigTable, Apache HBase and Apache Cassandra, and Postgresql use Bloom filters to reduce the
disk lookups for non-existent rows or columns
Q3. Explain the DGIM algorithm. State the rules used in DGIM that must be followed. (10) Dec 2023
1. DGIM algorithm (Datar-Gionis-Indyk-Motwani Algorithm)
2. In the simplest version DGIM algo has an N-bit window it is represented using O(log2N) bits.
3. So this algorithm’s error rate no more than 50%.
4. The two basic components of this algo are Timestamp and Bucket.
5. The first bit is assigned timestamp 1, the second bit is assigned timestamp 2 and so on.
6. The windows are divided into buckets consisting of 1’s and 0's.

Rules:
 Every bucket should contain at least a single 1 in it
 Right side of the bucket should strictly start from 1
 Length of the bucket is equal to the number of I's in it
 Every bucket length should be in powers of 2
 As we move to left, the bucket size should not decrease
 No more than two buckets can have same size

Q4. Give two applications for counting the number of 1’s in a long stream of binary values. Using a
stream of binary digits, illustrate how DGIM will find the number of 1’s. (10) May 2023
https://www.youtube.com/watch?v=Z_MLrbI1s2E
Applications:
 Social Media Analytics: In social media analytics, the DGIM algorithm can be applied to estimate
the popularity or engagement level of posts, hashtags, or topics over time.
 Network Traffic Monitoring: In network traffic monitoring systems, the DGIM algorithm can be
used to estimate the number of active connections or the volume of traffic passing through a network
link within a certain time window.

Q5. Suppose a data stream consists of integers 1,3,5,4,6,1,5,9,3,2. Let the hash function used be: (10)
May 2023
h(x)= x+1 mod 16
h(x)= 2x+3 mod 16
h(x) = 3x + 1 mod 16
Show how the Flajolet-Martin algorithm will estimate the number of distinct elements in the
stream. (10) May 2023
Chapter 5 : Big Data Mining Algorithms

Q1. Explain PCY algorithm and its 2 types with neat labeled diagram. Dec 2023
1. The PCY Algorithm uses that space for an array of integers that generalizes the idea of a Bloom filter.
2. This algorithm exploits the observation that there may be much unused space in main memory on the first
pass.
3. If there are a million items and gigabytes of main memory, we do not need more than 10% of the main
memory for the two tables.
4. The PCY algorithm uses hashing to efficiently count item set frequencies and reduce overall
computational cost.
5. The basic idea is to use a hash function to map itemsets to hash buckets, followed by a hash table to count
the frequency of itemsets in each bucket.
https://youtu.be/OLu-Bsx-e0Q?si=3RrOrmB8_9WzgdK6

Types not given in book. Two types of PCY Algorithm can be Multistage and Multihash algorithm.

Q2. Cure algorithm. (5) Dec 2023

CURE(Clustering Using Representatives)
1. It is a hierarchical based clustering technique, that adopts a middle ground between the centroid based
and the all-point extremes.
2. Hierarchical clustering is a type of clustering, that starts with a single point cluster, and moves to merge
with another cluster, until the desired number of clusters are formed.
3. It is used for identifying the spherical and non-spherical clusters.
4. It is useful for discovering groups and identifying interesting distributions in the underlying data.
5. Instead of using one point centroid, as in most of data mining algorithms, CURE uses a set of well-
defined representative points, for efficiently handling the clusters and eliminating the outliers.
Q3. Explain clearly with diagrams the PCY method of finding frequent itemsets (pairs) in a large data
set. (10) May 2023
1. Hashing: The PCY algorithm employs a hashing technique to count itemset occurrences. It uses two
hash tables: one for counting individual item occurrences and another for counting itemset pairs.
2. Counting Singletons: Initially, the algorithm scans the dataset once to count the occurrences of
individual items. Each item is hashed into the first hash table, which counts its occurrences.
3. Counting Pairs: Next, the algorithm scans the dataset again to count pairs of items. For each
transaction, it generates pairs of items and hashes them into the second hash table.
4. Pruning: After counting pairs, the algorithm prunes infrequent pairs by checking their hash values
against the first hash table. If the hash values of both items in a pair point to counts above a certain
threshold (min_support), the pair is considered a candidate frequent pair.
5. Generating Frequent Itemsets: Finally, the algorithm generates frequent itemsets from the
remaining candidate pairs. These are the pairs that passed the pruning step.
https://youtu.be/OLu-Bsx-e0Q?si=3RrOrmB8_9WzgdK6

Q4. Explain CURE algorithm, clearly stating its advantages over traditional clustering algorithm. (10)
May 2023
1. It is a hierarchical based clustering technique, that adopts a middle ground between the centroid
based and the all-point extremes.
2. Hierarchical clustering is a type of clustering, that starts with a single point cluster, and moves to
merge with another cluster, until the desired number of clusters are formed.
3. It is used for identifying the spherical and non-spherical clusters.
4. It is useful for discovering groups and identifying interesting distributions in the underlying data.
5. Instead of using one point centroid, as in most of data mining algorithms, CURE uses a set of well-
defined representative points, for efficiently handling the clusters and eliminating the outliers.
Advantages:
1. Handling Arbitrary Shape Clusters: CURE addresses this issue by using a hierarchical approach that
employs a combination of partitioning and hierarchical clustering.
2. Scalability: CURE is designed to be more scalable compared to some traditional clustering algorithms.
3. Outlier Robustness: CURE's use of representative points helps mitigate the impact of outliers by
focusing on the dense regions of clusters
4. Parameter Insensitivity: CURE is relatively less sensitive to parameter choices due to its hierarchical
nature and the use of representative points.

Q5. Explain multi hash and multi stage

MultiStage Algorithm:
1. The Multistage Algorithm: The Multistage Algorithm is the improved version of the PCY algorithm
that uses certain consecutive hash tables to decrease farther the count of candidate pairs.
2. The contradiction in both of the algorithms is that multistage takes more than two passes to discover the
frequent pairs.
First pass: The frequent buckets are identified and encapsulated by a bitmap, again the same as in PCY.
Second pass: The multistage algorithm uses supplementary hash tables to lessen the number of candidate
pairs.

Final Pass: After the second pass, the second hash table is also encapsulated as a bitmap, and that bitmap is
stored in the main memory.

MultiHash Algorithm:
1. In some scenario a single pass can provide huge benefits like use of two or more passes of multistage
algorithm. Such behaviour of PCY algorithm is known as Multihash algorithm.
2. It suggest to use two hash function, two separate hash tables along With shared main memory of first pass
rather than using two different hash table to two successive pass.
3. In the second step of the multihash algorithm, each hash table is transformed into a bitmap. When we have
two hash tables, each one gets converted into its own bitmap. Surprisingly, the space needed for two
bitmaps is the same as that required for just one bitmap.
4. The third step of the multihash algorithm, for the pairs (let's call them i and j) to be considered in this step,
they both need to be frequent. Additionally, their pair {i, j} must be placed in a bucket that's considered
frequent in both hash tables, ensuring coordination between the two tables.

Pass 1 of multihash Pass 2 of multihash

Chapter 6 : Big Data Analytics Applications

Q1. Explain collaborative filtering. How it is different from content based filtering?(10) May Dec 2023
1. Collaborative filtering is a recommendation system that suggests items to a user based on similar users'
preferences.
2. This system does not use the attributes or features of the items to make recommendations but instead
uses the past behavior of users to identify similar users and recommend items that similar users have
liked.
3. There are two main types of collaborative filtering:
a. User-based collaborative filtering: This method finds similar users based on their past interactions
with items and then recommends items that similar users have liked. For example, if two users have
similar viewing histories on Netflix, the system may recommend the same movie to both users.
b. Item-based collaborative filtering: This method finds similar items based on how users have
interacted with them and then recommends those similar items to a user. For example, if a user has
liked several movies of a particular genre, the system may recommend other movies of that genre to
the user.
Difference:
 Content-based filtering is suitable for items that are easy to describe and compare by their features,
while collaborative filtering can offer more flexibility for items with diverse or complex features.
 Content-based filtering is suitable for providing personalized recommendations that match user
preferences and interests, while collaborative filtering can provide surprising and diverse
recommendations that expose users to new or popular items.

Q2. Define Hub and Authority. Compute the Hub and Authority scores for the web: (10) Dec 2023

Practice HITS Algorithm and Hub and Authority score sum ( very imp numerical 6th chp )
Q3. Structure of the web (5) Dec 2023

Q4. Explain what characteristics of social media makes it suitable for Big Data. (5) May 2023
1. Since one of the essential characteristics of Big Data originated from social media is that it is real-time or
near-real-time.
2. This gives to the exploratory analysis a wide perspective on what is happening and what is about to happen
at a certain time in a certain area.
3. Each fundamental trait of Big Data can be understood as a parameter for quantitative, qualitative and
exploratory information analysis.
 Volume - There are two types of data that social media platforms collect: structured and unstructured For
social scientists the total mass of the data allows the definition of multiple classes, criteria and the refining
of analysis sets and subsets.
 Variety - The data formats vary from text documents, tables to video data, audio data and many more. This
lifts the data analysis to a higher complexity level; therefore, the statistical models will also be adjusted in
order to obtain viable information.
 Velocity - Speed is a key aspect in trend and real-life phenomena analysis. The faster the data is generated,
shared and understood the more information it can reveal.
 Veracity - For the seasoned data analyst it is essential to be able to evaluate the truthfulness, the accuracy
and honesty of the data put to analysis.
Q5. Recall HITS algorithm. Generate Hub and Authority score after 2 iterations for the graph given
here. (10) May 2023

Practice HITS Algorithm and Hub and Authority score sum ( very imp numerical 6th chp )

Q6. Define edit distance with example

Edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is
measured by counting the minimum number of operations required to transform one string into the other.
The three basic operations of single-character edits are:
 Insertion: Adding a single character to one of the strings.
 Deletion: Removing a single character from one of the strings.
 Substitution: Replacing a character in one of the strings with another character.
Example:
Let’s suppose we have str1=”GEEXSFRGEEKKS” and str2=”GEEKSFORGEEKS”
Now to convert str1 into str2 we would require 3 minimum operations:
Operation 1: Replace ‘X‘ to ‘K‘
Operation 2: Insert ‘O‘ between ‘F‘ and ‘R‘
Operation 3: Remove second last character i.e. ‘K‘

Q7. What is the role and effect of page rank?

1. PageRank is a link analysis algorithm and it assigns a numerical weighting to each pages of a hyperlinked
set of documents.
2. It was designed evaluate quality and quantity of links to a page.
3. Page rank is a function that assigns a real number to each page in the web (or at least to that portion of
the web that has been crawled and its links discovered)
4. The intent is that the higher the page rank of a page the more important it is.
5. Think that a web is a directed graph, where pages are the nodes and there is an arc form page p1 to page
p2.

Q8. Write at least four different distance measures.

1. Euclidean Distance
Euclidean distance is a suitable measure for assessing similarity or dissimilarity between points in a
continuous space. It works effectively with numeric data points that have equal importance across
dimension
2. Manhattan Distance
Manhattan distance is an appropriate measure when movement is restricted to a grid-like path, such as
city navigation. It is particularly useful in distance-based clustering algorithms involving categorical or
binary data.
3. Minkowski Distance
Minkowski distance is a flexible metric that adapts to different scenarios based on its parameter value.
When the parameter (p) is set to 1, it becomes equivalent to Manhattan distance, and when set to 2, it
becomes Euclidean distance.
4. Hamming Distance
Hamming distance is mainly used in text mining and error detection. It is particularly effective in
measuring the similarity or dissimilarity between binary strings, such as DNA sequences or error-
correction codes.
Q9. How recommendation is done based on properties of the product? Explain with example
1. Product recommendation is a popular application of machine learning that aims to personalize the customer
shopping experience.
2. By analyzing customer behavior, preferences, and purchase history, a recommendation engine can suggest
products more likely to interest a particular customer.
3. The task of proposing a product or products to a consumer based on his purchasing history is known as
"product recommendation" in machine learning.
4. Types of Recommendation System:
• Content-based filtering: Recommends items based on their similarity to items the user has previously liked.
For example, if a user has previously watched several action movies, a content-based recommendation system
would suggest other action movies to the user based on the genre, actors, and other similar attributes of the
movies they have previously watched.
• Collaborative filtering: Recommends items based on the preferences of similar users.
For example, if two users have similar viewing histories on Netflix, the system may recommend the same
movie to both users or if a user has liked several movies of a particular genre, the system may recommend other
movies of that genre to the user.
• Hybrid: combines both content-based and collaborative filtering to make recommendations.
For Example, a hybrid system can use memory-based methods to make recommendations quickly and easily
while using model-based methods to learn the underlying patterns in the data and make recommendations for
new users or items.

Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Big Data With Hadoop
No ratings yet
Big Data With Hadoop
26 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
NoSQL Databases and Big Data Storage Systems
No ratings yet
NoSQL Databases and Big Data Storage Systems
4 pages
Big Data Notes
No ratings yet
Big Data Notes
68 pages
Unit 1 (1)
No ratings yet
Unit 1 (1)
89 pages
Self Prepared
No ratings yet
Self Prepared
147 pages
09 - Cloud-Enabling Technologies - v2
No ratings yet
09 - Cloud-Enabling Technologies - v2
45 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Big Data Study Material Part 1 (Unit i)-1
No ratings yet
Big Data Study Material Part 1 (Unit i)-1
38 pages
Big Data Analysis Unit 1-5 Extended
No ratings yet
Big Data Analysis Unit 1-5 Extended
35 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
29 pages
BDAV Internal Notes
No ratings yet
BDAV Internal Notes
61 pages
NGT 11-2018,19
No ratings yet
NGT 11-2018,19
70 pages
07-BigData-DataAnalysis
No ratings yet
07-BigData-DataAnalysis
66 pages
Lecture 02
No ratings yet
Lecture 02
60 pages
NGT QB Ans
No ratings yet
NGT QB Ans
43 pages
Bda Answer Bank: 1. Characteristics of Big Data 5V
No ratings yet
Bda Answer Bank: 1. Characteristics of Big Data 5V
28 pages
KCS061 Solution
No ratings yet
KCS061 Solution
28 pages
CC - Lecture 6-Data
No ratings yet
CC - Lecture 6-Data
44 pages
BDA IA1 QB Solved complete - Copy
No ratings yet
BDA IA1 QB Solved complete - Copy
22 pages
PPT 2.2.1
No ratings yet
PPT 2.2.1
26 pages
BDH Answer Bank
No ratings yet
BDH Answer Bank
21 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
NGT NOV-19 (Sol) (E-next.in)
No ratings yet
NGT NOV-19 (Sol) (E-next.in)
33 pages
TIE- 21CS71 SIMP with Key Answers (1)
No ratings yet
TIE- 21CS71 SIMP with Key Answers (1)
19 pages
BDA Handy Notes
No ratings yet
BDA Handy Notes
19 pages
Module 1
No ratings yet
Module 1
54 pages
Big Data Imp-1
No ratings yet
Big Data Imp-1
16 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
Science BSC Information Technology Semester 5 2019 November Next Generation Technologies Cbcs
No ratings yet
Science BSC Information Technology Semester 5 2019 November Next Generation Technologies Cbcs
21 pages
Super Important Questions For BDA
100% (1)
Super Important Questions For BDA
26 pages
Bda Ut1 Que Ans
No ratings yet
Bda Ut1 Que Ans
13 pages
Bda QB
No ratings yet
Bda QB
18 pages
Be Sem 7 Ia 1 Question Bank
No ratings yet
Be Sem 7 Ia 1 Question Bank
4 pages
NGT Paper
No ratings yet
NGT Paper
25 pages
BIG DATA 2023
No ratings yet
BIG DATA 2023
18 pages
A_Review_of_Machine_Learning_Techniques
No ratings yet
A_Review_of_Machine_Learning_Techniques
6 pages
C QB
No ratings yet
C QB
12 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
test 1 big data
No ratings yet
test 1 big data
17 pages
BDA Assignment1 BE6 20
No ratings yet
BDA Assignment1 BE6 20
10 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
BDA CW Chapter 3
No ratings yet
BDA CW Chapter 3
9 pages
2 emerging
No ratings yet
2 emerging
10 pages
BDA-2
No ratings yet
BDA-2
6 pages
H13-611
No ratings yet
H13-611
14 pages
BDA_answers[1]
No ratings yet
BDA_answers[1]
6 pages
ucPDF (14)
No ratings yet
ucPDF (14)
10 pages
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
No ratings yet
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
6 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
BD unit 1
No ratings yet
BD unit 1
5 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Mittal School of Business: Course Code: CAP348 Course Title: Introduction To Big Data
No ratings yet
Mittal School of Business: Course Code: CAP348 Course Title: Introduction To Big Data
6 pages
BDT Viva Questions
No ratings yet
BDT Viva Questions
2 pages
BDA Answers-1
No ratings yet
BDA Answers-1
15 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
Ranjana Cloud Computing(Lab Report)
No ratings yet
Ranjana Cloud Computing(Lab Report)
20 pages
Big Data and Hadoop Overview
100% (1)
Big Data and Hadoop Overview
17 pages
Big Data Analysis
No ratings yet
Big Data Analysis
9 pages
Essentials of Data Engineering -- Saini, Dr_ Mukesh -- 2024 -- Bb50f635b916a3edd2d60d5109fbb873 -- Anna’s Archive (1)
No ratings yet
Essentials of Data Engineering -- Saini, Dr_ Mukesh -- 2024 -- Bb50f635b916a3edd2d60d5109fbb873 -- Anna’s Archive (1)
431 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
100+ Bigdata Solved MCQs With PDF Download
No ratings yet
100+ Bigdata Solved MCQs With PDF Download
10 pages
2018, Sriyanong - A Text Preprocessing Framework for Text Mining on Big Data Infrastructure
No ratings yet
2018, Sriyanong - A Text Preprocessing Framework for Text Mining on Big Data Infrastructure
5 pages
Components of A Big Data Architecture
No ratings yet
Components of A Big Data Architecture
3 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
Lab1 InstallationOfBigInsight
No ratings yet
Lab1 InstallationOfBigInsight
72 pages
YarnHdfs Administration
No ratings yet
YarnHdfs Administration
10 pages
bda case study UBER
No ratings yet
bda case study UBER
7 pages
CSE2013 Cloud Computing Module 1
No ratings yet
CSE2013 Cloud Computing Module 1
109 pages
Aravind - Senior Azure Data Engineer
No ratings yet
Aravind - Senior Azure Data Engineer
5 pages
2022 Dec. ITT401-A
No ratings yet
2022 Dec. ITT401-A
2 pages
Big DATA Analytics: C.Ranichandra & N.C.Senthilkumar
No ratings yet
Big DATA Analytics: C.Ranichandra & N.C.Senthilkumar
46 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
Impala Overview: Goals: General-Purpose SQL Query Engine
No ratings yet
Impala Overview: Goals: General-Purpose SQL Query Engine
39 pages
Developing Big Data Solutions On Microsoft Azure HDInsight
No ratings yet
Developing Big Data Solutions On Microsoft Azure HDInsight
346 pages
R01 1
No ratings yet
R01 1
7 pages
Mapreduce Lab
No ratings yet
Mapreduce Lab
36 pages
C - 41 Cloud Computing in Big Data Features and Issues
No ratings yet
C - 41 Cloud Computing in Big Data Features and Issues
8 pages
Data Science Course Content
No ratings yet
Data Science Course Content
24 pages
COLL Report Typesafe Apache Spark
No ratings yet
COLL Report Typesafe Apache Spark
24 pages
ADB Audience Manager Security Overview
No ratings yet
ADB Audience Manager Security Overview
12 pages
The Design of Cross-Border E-Commerce Recommendation System Based On Big Data Technology
No ratings yet
The Design of Cross-Border E-Commerce Recommendation System Based On Big Data Technology
4 pages
Work Experience: Synechron Technologies
No ratings yet
Work Experience: Synechron Technologies
2 pages
Business Analytics
No ratings yet
Business Analytics
9 pages
Case Study: Digital Transformation at New India Assurance Co. LTD
No ratings yet
Case Study: Digital Transformation at New India Assurance Co. LTD
6 pages
Big Data Computing - Week-1
No ratings yet
Big Data Computing - Week-1
3 pages
Big Data Computing - Assignment 1
No ratings yet
Big Data Computing - Assignment 1
3 pages
Dsebl ZG522
No ratings yet
Dsebl ZG522
4 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet

Big Data NOTES

Uploaded by

Big Data NOTES

Uploaded by

BigData Notes

Chapter 1 : Introduction to Big Data

Q6. Explain CAP Theorem. How it is different from ACID Properties?

Chapter 4 : Mining Big Data Streams

Q1. Elaborate issues of stream processing. (5) Dec 2023

Q2. Cure algorithm. (5) Dec 2023

Q5. Explain multi hash and multi stage

Pass 1 of multihash Pass 2 of multihash

Q6. Define edit distance with example

Q7. What is the role and effect of page rank?

Q8. Write at least four different distance measures.

You might also like