Lahore School of Economics Data Analysis and Statistical Methods Winter 2020
Lahore School of Economics Data Analysis and Statistical Methods Winter 2020
Assignment 2
Q1: Explain different sources of Big Data.
Transactional data : The main purpose of Transaction Processing System is to capture the
information and update the data for the operational decisions in an organization. There
are two ways to process transactions namely Batch processing which processes the data
as a single unit over a period of time and Real Time Processing System where data are
processed immediately.
Social media data: People almost at every possible location in the world share their
information through social media which helps customers to make purchasing decisions
by having a glance at the feedback, customer complaints and miscellaneous services
provided with a product. Sentiments of the consumers are also expressed on social media
which help companies to make production decisions
Internet Applications: There are numerous online ecommerce websites (such as Amazon,
Flipkart, Alibaba, eBay, Paytm, bookmyshow.com etc.) search engines (Google, Yahoo,
Bing, etc.) or online banking applications where millions of users are logging in daily and
using them. During their searches or transactions various click streams and logs get
generated which could be of value.
Data from electronic instruments: There are numerous electronic media such as smart
phones, RFID tags, GPS Sensors, machines connected to networks, scanners, cameras
which generate high volumes of datasets. These are other sources of big data.
Business analytics can be classified into 3 categories based on the purpose of use – descriptive,
predictive and prescriptive.
Descriptive analytics explains a phenomenon from past data through reports, dashboards,
which helps in understanding what has happened.
Predictive analytics helps us to understand what can happen. It supports predictions based
on past data, correlations between variables and patterns.
Prescriptive analytics helps to understand different outcomes under different scenarios. It
consists of various tools such as optimization, simulations, what-if analysis scenarios
with change in input set of parameters.
Q4: Explore the “Predict Pakistan Elections 2018” dataset retrieved from
(https://www.kaggle.com/zusmani/predict-pakistan-elections-2018/kernels). Explain the
context, datatype, time regime, variable information, metadata etc. Discuss few questions
that are already answered (Hint: Kernel activity “Voter Behavior and Voting Reasons”
and “2002/2008/2013 Elections Visualizations”) and what further can be explored from it.
We predict the historic voters’ turn out in this election of 57-61%. Historically the average turn
out is 45% since 1977 (lowest 35% in 1997, highest 55% in 1977 and 53% in last elections).
Pakistan ranked 164th out of 169 nations in voters’ turn out; Australia being the first with 94.5%
turn out.
Voters’ participation in the country is very diverse, historically Musakhel and Kohlu yield less
than 25% whereas Layyah and Khanewal yield more than 60% and everything else is in between.
Punjab has the highest and Balochistan has the lowest voters’ turnout.
The contest will bring 3,675 candidates for 272 national assembly seats, that is 13 candidates on
average per seat. PTI has unleashed 244 candidates (highest in number by any political party).
Islamabad will see 76 candidates just for 3 seats fighting to rule the capital that guarantees the
psychological edge.
There a quite few interesting facts about these elections, for example we will see the highest
number of Lotas (candidates who often change their party affiliation) ever. PTI believes to win
the election no matter what may come while the survey pundits predicts the PML(N) lead of at
least 13% over PTI.
The history of elections and the charges of corruption, voters’ fraud, ghost votes, interferences
by deep state or violence go hand by hand. There is (almost) no country in the world without the
fear or accusations of such incidents in their elections.
We are releasing the complete National Assembly Elections’ Results dataset for 2002, 2008 and
2013 elections in CSV files for public and calling all data scientists, international observers and
journalists out there to help us achieve our inspirations.
Time Regime-Data collected is in a panel format which holds information from the timeline
2013 to 2018. The data set scrutinizes election results for the national assembly of Pakistan for
2002, 2008 and 2013.
Variable-The file contains Seat, Constituency, Candidates Name, Party Affiliation, Votes, Total
Valid Votes, Total Rejected Votes, Total Votes, Total Registered Voters and Turnout variables
for each seat.
Metadata-this data analyses different aspects of Pakistan’s election schedule. Canada, United
States, Pakistan and India are contributors of this data.
Dataset to enable people to explore local businesses of Pakistan. This dataset might help the local
community in gathering information of local businesses. This also contributes in local economic
development of Pakistan by bridging traders and manufacturers.
Geography: Pakistan
Dataset: The dataset contains information of approx 67000 businesses in Pakistan (~5000 in each
csv file)
Features: The dataset has total 7 columns
• Business Name
• Contact Name
• Telephone
• Website
• Address
• City