Big Data Assignment Solutions
Question 1 — Big Data Pipeline for Student Academic Performance Prediction
(50 marks)
Problem statement & objectives:
Goal: predict or identify students at risk of failing by analyzing heterogeneous student data
(attendance, internal marks, lab performance, assignment scores, demographics, prior
history) stored as CSV / JSON / XML across multiple sources.
Deliverable: a pipeline to ingest, clean, integrate, run large-scale analytics and produce a
ranked list of at-risk students plus features for further ML modeling.
Data ingestion & schema / ETL design:
Sources: CSV, JSON, XML
Ingestion: Apache NiFi / Flume to HDFS
Cleaning & Normalization: unify schema, fill missing keys, normalize IDs and timestamps,
store curated datasets in Parquet/ORC formats.
Using MapReduce or Hive to identify students at risk:
HiveQL to compute aggregates and risk scores.
MapReduce for custom parsing if necessary.
Hive preferred for simplicity and integration with BI tools.
Data structures & partitioning:
Columnar formats (Parquet/ORC) for analytics, partition by time or department, bucketing
by student_id for joins.
Tool justification:
Hive for SQL-style analytics, MapReduce for custom logic.
HBase for low-latency profile lookups, NiFi/Flume/Kafka for ingestion.
Question 2 — Social Media Sentiment Analysis using Apache Pig (50 marks)
Problem & data understanding:
Goal: classify tweets/comments as positive, negative, or neutral for brand sentiment
tracking.
Data: unstructured text, noisy, multilingual.
Processing with Apache Pig:
Steps: ingestion, cleaning, tokenization, stopword removal, lexicon join, sentiment scoring,
aggregation.
Pig handles tokenization with TOKENIZE, filtering with FILTER, and aggregation with
GROUP.
Sample Pig Latin script:
REGISTER 'sentiment_udfs.py' USING jython AS s_udf;
... (script omitted for brevity; includes cleaning, tokenization, lexicon join, scoring,
classification).
Pig vs RDBMS:
Pig better for large, semi-structured, batch workloads.
RDBMS better for small, structured, low-latency queries.
Question 3 — Comparative Analysis – Pig vs Hive for Retail Sales Analytics (50
marks)
Dataset: sales.csv, products.csv, customers.csv.
Tasks: top-selling products, monthly revenue trends.
Pig approach: procedural ETL with JOIN, GROUP, SUM, ORDER, LIMIT.
Hive approach: SQL queries with GROUP BY, ORDER BY, LIMIT.
Comparison:
Pig excels in ETL and complex record transformations.
Hive excels in declarative analytics and BI integration.
Recommendation: Hive for retail analytics due to SQL expressiveness and optimizer
performance.
Question 4 — Smart City Sensor Data Analysis (50 marks)
Architecture:
Sensors -> Kafka/NiFi -> HDFS/Time-series DB -> Spark Streaming/Flink ->
Hive/Elasticsearch/Grafana.
Data ingestion:
Kafka for real-time, NiFi/Flume for batch.
Partition Kafka by sensor_id or region.
Data cleaning:
Missing value imputation, outlier removal, time alignment, calibration.
Real-time vs Batch:
Real-time for alerts (traffic congestion, pollution spikes).
Batch for long-term trends, model training.
Algorithm selection:
Linear-time algorithms for streaming, scalable partitioning, approximation for latency
trade-offs.