0% found this document useful (0 votes)
82 views

Big Data Report

Big data analysis involves summarizing large datasets. Tools like Python, Pandas, Matplotlib and PowerBI are used. Key issues include null/missing values in datasets which can bias results. Visualizations show sales by location, product type, country and employee type to help understand business metrics like inventory levels, marketing needs and staffing. Addressing data quality issues and generating insightful visualizations allows effective big data analysis.

Uploaded by

shahab qureshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views

Big Data Report

Big data analysis involves summarizing large datasets. Tools like Python, Pandas, Matplotlib and PowerBI are used. Key issues include null/missing values in datasets which can bias results. Visualizations show sales by location, product type, country and employee type to help understand business metrics like inventory levels, marketing needs and staffing. Addressing data quality issues and generating insightful visualizations allows effective big data analysis.

Uploaded by

shahab qureshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

BIG DATA ANALYSIS

Introduction:
Big Data has the ability to encourage the world wide web and in the present era the
relation to build up a advanced and new open-source Hadoop is used by a ton
opening that is linked to the variable and relationship to pound through immense
advanced associations of group which can proportions of data.
combine and analyze to form a industry
data. These formations have a unique and
innovative form of information and the As a more improved value for the random
specific buyers and tenders linked to it. In associations become visible and the Web
this report there is discussion about the 2.0 value takes off, the data is made more
background of big data. We will discuss the reliable and efficient. The startup of the
tools and techniques we used in performing innovative and modern organizations bit by
data analysis. This report will show the bit start to dig into this tremendous
visualization according to the scenario. We proportion of data and moreover
will ensure that the report will be readable governments start managing Big Data
for the naïve person [1]. adventures. This data is taken care of in the
greatest biometric informational collection
Background Information: on earth [1].

History of Big Data: The communication was linked to how


Most of the data that can be accessed easily many bytes are made informative for the
is made over from the recent years and this whole world between start of progress and
term known as Big Data was coined quite 2003. Since the same number is made at
before, when it was introduced by one of the standard spans.
the scientist Media. The Big Data use can be
The accompanying backwoods for
seen around anywhere linking to the data
advancement, competition, and
that is made available. In 2005 Roger
productivity, explains that in recent years
Mougalas composed the Big Data
the USA alone will go up against an
surprisingly, simply a year after they made
inadequacy of 120.000 – 170.000 data
the term Web 2.0. It insinuates a great
analyst similarly as million data chiefs. In
game plan of data that is essentially hard to
the past hardly any years, there has been a
direct and deal with using standard business
gigantic addition in Big Data new
understanding instruments.
organizations, all endeavoring to oversee
The year of the 2005 was the time period Big Data and helping relationship to see Big
when by use of the Google map a term of Data and an ever-expanding number of
yahoo was introduced and the main aim of associations are steadily accepting and
it was to pen down and gather the entire moving towards Big Data. Nevertheless,
while doubtlessly Big Data is around for a
long time starting at now, in assurance Big
Data is the such a formation of data accomplished through experimentation.
technology that the web was in found a lot Utilizing jokes to create fundamental plots
of years ago. The colossal Big Data is still in matplotlib is genuinely basic, yet handily
before us so a ton will vary a lot in the telling the staying 98% of the library can be
upcoming years [2]. Let the Big Data time overwhelming.
start!
This article is a learner to-halfway level
walkthrough on matplotlib that blends
Packages will used: hypothesis in with models. While learning
Python for programming: by model can be immensely keen, it assists
Python is a universally helpful interpreted, with having even recently a surface-level
canny, arranged, and noteworthy process of comprehension of the library's inward
programming language. operations and format also.
Python source code is in like manner PowerBI:
available under the GNU General Public Power BI is a linkage and combination of
License (GPL). the programs file linked to the
Pandas: administration, application and the linkages
that are able to make, contribute and burn-
Pandas is an source that is wide, has the through business bits of knowledge in the
library linked to Python and library that is manner that serves you and your business
diverse, it is easy to arrange and utilize its most successfully.
structure.
We will use this tool, for visualization
Python when linked up with the Panda it is purposes.
used in wide variety of fields ranging from
simple to complex business forums [3].

Matplotlib:

Matplotlib. pyplot is a diverse form of limits


that form the work of this setup and makes
the program like the MATLAB. Every work
or plot linked to this setup can be checked
or identified as some improvement to the
diagram or figure as it draws the plot of
several lines together.
Issues:
Nan/Missing Values in Datasets:
Missing Data (or missing qualities) is
portrayed as the data regard that isn't taken
care of for a variable in the view of interest.
The issue of missing data is commonly
ordinary in for all intents and purposes all
In the process of a matplotlib, there is investigation and can essentially influence
additionally a huge library, and getting a the closures that can be drawn from the
plot to look perfectly is regularly data [1]. As necessities be, a couple of
examinations have focused in on dealing
with the missing data, issues achieved by
missing data, and the strategies to keep up
a key good ways from or cutoff such in
clinical investigation.

Missing data present various issues.


Regardless, the nonappearance of data
decreases quantifiable power, which
suggests the probability that the test will
excuse the invalid hypothesis when it is
counterfeit. Second, the lost data can cause This visualization defines that the difference
inclination in the appraisal of limits. Third, it between the actual products counted and
can diminish the representativeness of the expected. For a successful business to run,
models. Fourth, it may confound the the difference between these two values
assessment of the examination. All of these should not lesser than 70%. Because if the
mutilations may bargain the authenticity of customer want to purchase a thing and
the fundamentals and can incite invalid that thing is not available, then it will cause
finishes. Nan esteems in the datasets trouble and bad impact on the client.
consistently show ambiguities in the
dataset. Underneath the screen captures
show the dataset ambiguities in the
datasets. In preparing this sort of
information, the invalid qualities ought to
be stuffed. For instance in the event that it
is invalid in number it ought to be zero. If in
string it should be some default word [4].

Discussion & Analytics

There was also some tables which seems


best.

Figure 1: Null Data count each sheet


This visualization shows that ratio of
Location visualization customers according to the area. It will help
In this visualization, it defines the the organization to manage the stocks. It
proportion of the products sold in the will help the organization for where it will
location. It is defining that 36.52 % percent appropriate for advertising.
prroducts are sold in the Sao Paulo and all
the information we can get by visualization.
This visualization defines as the number of
This visualization defines the ration
Employees working by their type. By this
between the total quantity of product
visualization it will be easy to the
types. It will help organization to analyze
organization for the new recruitment
the type of the products they are selling.
process. Salary employees has more cost
than hourly. Because bonus policies etc are
only applied on the Salaried Employee.

This visualization defines the total sales


according to the country. IT is telling us that
China has the most buyer of their products.
This visualization defines that the year in
which the most of the employees are
recruited. It helps the organization to track
the employee’s percentage. It gives them
advantage to estimate the total amount of
the bonus etc.
This is a type of line graph. It is the best
graph to show the trend of the data. It is
This graph will help the organization to
showing that the higher number of
know how much products was beign sold on
employees was recruited in between 2014-
which day. To increase thee sales in the
2015.
organization the organization can do
something like promotions coupons in
order to increase the sales in other days.

Conclusion

So by help of the python code generated,


information has been obtained regarding
the identification of areas for the
improvement in finance and operational
field. Issues are been addressed and error
can be found by help of this code. For the
case study, the data presented in Fictional
Project is used for achieving our objective
This graph shows that the names of the about this report which depicts the case
products which are finished. As we can see study.
that these are larger in quantities. So, it will
help the organization to track the
organization finished products which
References
needed to buy from the vendors.
[1] Chen, M., Mao, S., & Liu, Y. Big data: A
survey. Mobile networks and
applications, 19(2), 171-209, 2014.

[2] Zikopoulos, P., & Eaton,


C. Understanding big data: Analytics for
enterprise class hadoop and streaming data.
McGraw-Hill Osborne Media 2011.
[3] van Rossum, G., & de Boer, J. df4 = pd.read_excel(xls, 'Recipes')
Interactively testing remote servers using
the Python programming language. CWi df5 = pd.read_excel(xls, 'Stock Count')
Quarterly, 4(4), 283-303, 1991.
df6 = pd.read_excel(xls, 'Stock
[4] Sanner, M. F. Python: a programming Adjustments')
language for software integration and
development. J Mol Graph Model, 17(1), 57-
df7 = pd.read_excel(xls, 'Receipt')
61, 1999. df8 = pd.read_excel(xls, 'Area')
df9 = pd.read_excel(xls, 'Clinical Plan')

Appendix df10 = pd.read_excel(xls, 'Installment')


df11 = pd.read_excel(xls, 'Finance')
Python Code: df12 = pd.read_excel(xls, 'Buy Order')
The code was just made too check the
values, which are missing and to just check Plan')
the data in the sheets.
print(df1.isnull().sum().sum())
For the code we would take specific values print(df2.isnull().sum().sum())
range and apply the numbers formula to get
the desired results and this is done print(df3.isnull().sum().sum())
separately on each of the number. print(df4.isnull().sum().sum())
import pandas as pd print(df5.isnull().sum().sum())
import numpy as np print(df6.isnull().sum().sum())
xls = pd.ExcelFile('2208.xlsx') print(df7.isnull().sum().sum())
xls print(df8.isnull().sum().sum())
df1 = pd.read_excel(xls, 'Client') print(df9.isnull().sum().sum())
df2 = pd.read_excel(xls, 'Worker') print(df10.isnull().sum().sum())
df3 = pd.read_excel(xls, 'Completed Goods print(df11.isnull().sum().sum())
On-Hand')
print(df12.isnull().sum().sum())

You might also like