0% found this document useful (0 votes)
130 views

Big Data Assignment Revised

The document discusses analyzing an Indian election data set from 2014 and 2019 containing information on states, cities, candidates, political parties, votes. It describes using Apache Hive to perform SQL-like queries to draw patterns and conclusions from the large data, such as which states had maximum voting and comparing results to the 2014 election. A blueprint for the analysis includes comparing vote share by party, margin of winning, plots, and a correlation matrix.

Uploaded by

Harshit Sukhija
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views

Big Data Assignment Revised

The document discusses analyzing an Indian election data set from 2014 and 2019 containing information on states, cities, candidates, political parties, votes. It describes using Apache Hive to perform SQL-like queries to draw patterns and conclusions from the large data, such as which states had maximum voting and comparing results to the 2014 election. A blueprint for the analysis includes comparing vote share by party, margin of winning, plots, and a correlation matrix.

Uploaded by

Harshit Sukhija
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Indian general Election

What is a data set

Data set is a collection of related, discrete items of related data that may be
accessed individually or in combination or managed as a whole entity.
A data set is organized into some type of data structure. In a database, for
example, a data set might contain a collection of business data (names, salaries,
contact information, sales figures, and so forth). The database itself can be
considered a data set, as can bodies of data within it related to a particular type
of information, such as sales data for a particular corporate department.
The term data set originated with IBM, where its meaning was similar to that
of file. In an IBM mainframe operating system, a data set s a named collection
of data that contains individual data units organized (formatted) in a specific,
IBM-prescribed way and accessed by a specific access method based on the
data set organization. Types of data set organization include sequential, relative
sequential, indexed sequential, and partitioned. Access methods include the
Virtual Sequential Access Method (VSAM) and the Indexed Sequential Access
Method (ISAM).

Types of data set

1 - Big data
2 - Structured, unstructured, semi-structured data
3 - Time-stamped data
4 - Machine data
5 - Spatiotemporal data
6 - Open data
7 - Dark data
8 - Real time data
9 - Genomics data
10 - Operational data
11 - High-dimensional data
12 - Unverified outdated data
13 - Translytic Data

Description of the data set

The data set that we are provide to work on consist data of Indian election 2014
& 2019

Parameters provide are:-

State, City, Candiate name, Politacal Party name, evm votes, total votes and
percentage of votes from that particular city.

STATES
Uttar Pradesh 16%
Maharashtra 11%
Tamil Nadu 11%
Bihar 8%

Party
Independent 37%
BSP 6%
INC 5%
Other 57%

State = 34 unique values


Rank = 1>43
PC (city) = 508 unique values.

Why to analyse this data set:-


A large amount of data has generated daily in Indian elections. It is very
important to anyalyse the data because it helps in gaining insights and
better decision making process. We will be able to define the problems
in dataset and finding the best appropriate solutions of the problems.
Business understanding

The problem in this data set is to analyze and find out the pattern or behavior of
different people from different states while they vote for the one political party
for finding out this we can use various tools like hive, sqoop, pig, Rdbms etc.

The basic objective is to draw patterns from the data and to reach at conclusions
related to from which state there are maximum voting.

Analytical approach

Hive provides SQL-like declarative language, called HiveQL, which is used for
expressing queries. Using Hive-QL users associated with SQL are able to perform
data analysis very easily.

Apache Hive is the technique which we are going to use for the analysis. Apache
Hive helps with querying and managing large datasets real fast. It is an ETL tool
for Hadoop ecosystem. It is a data warehouse framework for querying and
analysis of data that is stored in HDFS. Hive is an open source-software that lets
programmers analyzed large data sets on Hadoop.

The size of data sets being collected and analyzed in the industry for business
intelligence is growing and in a way, it is making traditional data warehousing
solutions more expensive.

Blue print is as below

 Comparison of vote hare party


 Comparison of margin of winning for different parties.
 Plot scatter & density plot.
 Comparison of results with 2014 election results.
 Plot of co relation matrix.

Submitted by:-

Harshit Sukhija:- 18MBA7118

Prerna Sharma:- 18MBA7106

Palak Sharma:- 18MBA7108

Lakshay Sharma:- 18MBA7081

You might also like