Big Data Assignment Revised
Big Data Assignment Revised
Data set is a collection of related, discrete items of related data that may be
accessed individually or in combination or managed as a whole entity.
A data set is organized into some type of data structure. In a database, for
example, a data set might contain a collection of business data (names, salaries,
contact information, sales figures, and so forth). The database itself can be
considered a data set, as can bodies of data within it related to a particular type
of information, such as sales data for a particular corporate department.
The term data set originated with IBM, where its meaning was similar to that
of file. In an IBM mainframe operating system, a data set s a named collection
of data that contains individual data units organized (formatted) in a specific,
IBM-prescribed way and accessed by a specific access method based on the
data set organization. Types of data set organization include sequential, relative
sequential, indexed sequential, and partitioned. Access methods include the
Virtual Sequential Access Method (VSAM) and the Indexed Sequential Access
Method (ISAM).
1 - Big data
2 - Structured, unstructured, semi-structured data
3 - Time-stamped data
4 - Machine data
5 - Spatiotemporal data
6 - Open data
7 - Dark data
8 - Real time data
9 - Genomics data
10 - Operational data
11 - High-dimensional data
12 - Unverified outdated data
13 - Translytic Data
The data set that we are provide to work on consist data of Indian election 2014
& 2019
State, City, Candiate name, Politacal Party name, evm votes, total votes and
percentage of votes from that particular city.
STATES
Uttar Pradesh 16%
Maharashtra 11%
Tamil Nadu 11%
Bihar 8%
Party
Independent 37%
BSP 6%
INC 5%
Other 57%
The problem in this data set is to analyze and find out the pattern or behavior of
different people from different states while they vote for the one political party
for finding out this we can use various tools like hive, sqoop, pig, Rdbms etc.
The basic objective is to draw patterns from the data and to reach at conclusions
related to from which state there are maximum voting.
Analytical approach
Hive provides SQL-like declarative language, called HiveQL, which is used for
expressing queries. Using Hive-QL users associated with SQL are able to perform
data analysis very easily.
Apache Hive is the technique which we are going to use for the analysis. Apache
Hive helps with querying and managing large datasets real fast. It is an ETL tool
for Hadoop ecosystem. It is a data warehouse framework for querying and
analysis of data that is stored in HDFS. Hive is an open source-software that lets
programmers analyzed large data sets on Hadoop.
The size of data sets being collected and analyzed in the industry for business
intelligence is growing and in a way, it is making traditional data warehousing
solutions more expensive.
Submitted by:-