Assignment 2
Sumaira Rafique(22L-8268)
Subject: Big Data
Question 1: Finding User Access Counts in 2023
Answer:
In this task, we want to count how many times each user accessed their Facebook account during
the year 2023. Here's a simple explanation of the solution using a MapReduce algorithm:
Mapper Pseudo-Code:
Mapper(Map):
for each line in the input:
parse login name, login time, and check if it's in 2023
if it's in 2023,
emit (login name, 1)
The Mapper goes through each log entry, checks if it's in the year 2023, and emits the
user's name as the key and a 1 as the value to indicate an access event.
Combiner Pseudo-Code:
Combiner(Count):
sum = 0
for each count in Count:
sum += count
emit (login name, sum)
The Combiner takes the emitted (login name, 1) pairs from the Mapper, and it adds up the
1s to get the total number of accesses by each user. It emits (login name, total access
count).
Reducer Pseudo-Code:
Reducer(SumCounts):
total = 0
for each (login name, count) in SumCounts:
total += count
emit (login name, total)
The Reducer then processes the data by summing up the counts for each user. It emits the
user's name and their total access count.
This MapReduce algorithm helps us efficiently count the number of times each user accessed
their Facebook account during the specified year.
Question 2: Counting Distinct Logins for Each Page
Answer:
In this task, we want to find out how many distinct users accessed each Facebook page. Here's a
straightforward explanation of the MapReduce algorithm:
Mapper Pseudo-Code:
Mapper(Map):
for each line in the input:
parse login name and pages accessed
for each page in pages accessed:
emit (page, login name)
The Mapper goes through each log entry and extracts the page that was accessed and the
login name associated with it. It emits (page, login name) pairs.
Reducer Pseudo-Code:
Reducer(CountUniqueLogins):
unique_logins = set()
for each login name in login names:
unique_logins.add(login name)
emit (page, len(unique_logins))
The Reducer takes the emitted (page, login name) pairs from the Mapper. It collects all
the login names associated with each page and adds them to a set to ensure distinct
logins. Then, it counts the size of the set to determine how many distinct users accessed
each page. It emits (page, count of unique logins).
This MapReduce algorithm efficiently helps us count the number of distinct users who accessed
each Facebook page.
Question 3: Calculating Time Spent on Facebook Pages by Each
User in 2023
Answer:
In this task, we aim to determine the percentage of time each user spent on Facebook pages in
the year 2023. This will be done using the MapReduce framework with a Pair method. Here's a
simplified explanation of how to approach this:
Mapper Pseudo-Code:
Mapper(Map):
for each line in the input:
parse login name, login time, logout time, and pages accessed
if the year is 2023:
for each page in pages accessed:
time_spent = logout time - login time
emit((login name, page), time_spent)
In the Mapper, we examine each log entry. If it's within the year 2023, we calculate the
time spent on each page by subtracting the logout time from the login time. We then emit
((login name, page), time_spent) pairs.
Reducer Pseudo-Code:
Reducer(SumTimeSpent):
total_time_per_page = 0
for each (login name, page) and time spent in input:
total_time_per_page += time spent
for each (login name, page) and time spent in input:
percentage = (time spent / total_time_per_page) * 100
emit((login name, page), percentage)
In the Reducer, we sum up the time spent on each page for each user. Then, for each
user-page pair, we compute the percentage by dividing the individual time spent by the
total time spent on that page. The result is ((login name, page), percentage) pairs.
This MapReduce algorithm efficiently helps us calculate the percentage of time each user spent
on each Facebook page in the specified year.
Question 4: Using the Stripe Approach
Answer:
In this task, we employed the Stripe approach to identify patterns of co-occurrence between
Facebook pages. This approach helps us discover which pages are accessed together most
frequently. Here's how it works:
Mapper Pseudo-Code:
Mapper(Map):
for each line in the input:
parse pages accessed page_
set = set(pages accessed)
for page1 in page_set:
for page2 in page_set:
if page1 != page2:
emit(page1, {page2: 1})
In the Mapper, we examine each log entry and create a set of pages accessed during that
session. Then, we look at all pairs of pages within the same session (excluding the same
page) and emit a count of 1 to indicate the co-occurrence of those pages.
Combiner Pseudo-Code:
Combiner(CombineCoOccurrence):
combined_co_occurrence = {}
for each page1 and co-occurrence_dict in input:
for page2, count in co-occurrence_dict.items():
if page2 in combined_co_occurrence:
combined_co_occurrence[page2] += count
else,
create a new entry
emit(page1, combined_co_occurrence)
In the Combiner, we take the emitted (page1, {page2: 1}) pairs from the Mapper and
combine the co-occurrence counts. For example, if 'page1' and 'page2' co-occurred
multiple times within sessions, we aggregate those counts under 'page1' as {page2:
count}.
Reducer Pseudo-Code:
Reducer(CombineCoOccurrence):
combined_co_occurrence = {}
for each page1 and co-occurrence_dict in input:
for page2, count in co-occurrence_dict.items():
if page2 in combined_co_occurrence:
combined_co_occurrence[page2] += count
else,
create a new entry
emit(page1, combined_co_occurrence)
The Reducer then takes the combined co-occurrence counts and further aggregates them.
The result is a representation of which pages are often accessed together, helping us
identify patterns of co-occurrence.
This Stripe approach can be very useful for understanding user behavior and discovering
associations between different pages on Facebook. It's a way to explore the relationships and
preferences of users when it comes to accessing various pages on the platform.