0% found this document useful (0 votes)
28 views3 pages

Big Data Assignment

The document outlines a series of assignments focused on big data analytics, including a pipeline for predicting student academic performance, sentiment analysis of social media using Apache Pig, a comparative analysis of Pig vs Hive for retail sales, and smart city sensor data analysis. It details the objectives, data ingestion methods, processing techniques, and tool justifications for each assignment. The emphasis is on utilizing various big data technologies like Apache NiFi, Hive, and Kafka for efficient data handling and analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views3 pages

Big Data Assignment

The document outlines a series of assignments focused on big data analytics, including a pipeline for predicting student academic performance, sentiment analysis of social media using Apache Pig, a comparative analysis of Pig vs Hive for retail sales, and smart city sensor data analysis. It details the objectives, data ingestion methods, processing techniques, and tool justifications for each assignment. The emphasis is on utilizing various big data technologies like Apache NiFi, Hive, and Kafka for efficient data handling and analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Big Data Assignment Solutions

Question 1 — Big Data Pipeline for Student Academic Performance Prediction


(50 marks)

Problem statement & objectives:


Goal: predict or identify students at risk of failing by analyzing heterogeneous student data
(attendance, internal marks, lab performance, assignment scores, demographics, prior
history) stored as CSV / JSON / XML across multiple sources.
Deliverable: a pipeline to ingest, clean, integrate, run large-scale analytics and produce a
ranked list of at-risk students plus features for further ML modeling.

Data ingestion & schema / ETL design:


Sources: CSV, JSON, XML
Ingestion: Apache NiFi / Flume to HDFS
Cleaning & Normalization: unify schema, fill missing keys, normalize IDs and timestamps,
store curated datasets in Parquet/ORC formats.

Using MapReduce or Hive to identify students at risk:


HiveQL to compute aggregates and risk scores.
MapReduce for custom parsing if necessary.
Hive preferred for simplicity and integration with BI tools.

Data structures & partitioning:


Columnar formats (Parquet/ORC) for analytics, partition by time or department, bucketing
by student_id for joins.

Tool justification:
Hive for SQL-style analytics, MapReduce for custom logic.
HBase for low-latency profile lookups, NiFi/Flume/Kafka for ingestion.

Question 2 — Social Media Sentiment Analysis using Apache Pig (50 marks)

Problem & data understanding:


Goal: classify tweets/comments as positive, negative, or neutral for brand sentiment
tracking.
Data: unstructured text, noisy, multilingual.

Processing with Apache Pig:


Steps: ingestion, cleaning, tokenization, stopword removal, lexicon join, sentiment scoring,
aggregation.
Pig handles tokenization with TOKENIZE, filtering with FILTER, and aggregation with
GROUP.

Sample Pig Latin script:


REGISTER 'sentiment_udfs.py' USING jython AS s_udf;
... (script omitted for brevity; includes cleaning, tokenization, lexicon join, scoring,
classification).

Pig vs RDBMS:
Pig better for large, semi-structured, batch workloads.
RDBMS better for small, structured, low-latency queries.

Question 3 — Comparative Analysis – Pig vs Hive for Retail Sales Analytics (50
marks)

Dataset: sales.csv, products.csv, customers.csv.


Tasks: top-selling products, monthly revenue trends.

Pig approach: procedural ETL with JOIN, GROUP, SUM, ORDER, LIMIT.
Hive approach: SQL queries with GROUP BY, ORDER BY, LIMIT.

Comparison:
Pig excels in ETL and complex record transformations.
Hive excels in declarative analytics and BI integration.
Recommendation: Hive for retail analytics due to SQL expressiveness and optimizer
performance.
Question 4 — Smart City Sensor Data Analysis (50 marks)

Architecture:
Sensors -> Kafka/NiFi -> HDFS/Time-series DB -> Spark Streaming/Flink ->
Hive/Elasticsearch/Grafana.

Data ingestion:
Kafka for real-time, NiFi/Flume for batch.
Partition Kafka by sensor_id or region.

Data cleaning:
Missing value imputation, outlier removal, time alignment, calibration.

Real-time vs Batch:
Real-time for alerts (traffic congestion, pollution spikes).
Batch for long-term trends, model training.

Algorithm selection:
Linear-time algorithms for streaming, scalable partitioning, approximation for latency
trade-offs.

You might also like