80% found this document useful (5 votes)
7K views29 pages

A Complete Tutorial To Learn Data Science With Python From Scratch PDF

This document provides an introduction and table of contents for a tutorial on learning data science with Python from scratch. The introduction discusses the author's motivation for creating Python resources due to a lack of available guides. The table of contents outlines 5 sections that will be covered: 1) Basics of Python for data analysis, 2) Python libraries and data structures, 3) Exploratory analysis in Python using Pandas, 4) Data munging in Python using Pandas, and 5) Building predictive models in Python. Section 1 discusses why Python is useful for data science and how to install Python and run basic programs.

Uploaded by

Teodor von Burg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
80% found this document useful (5 votes)
7K views29 pages

A Complete Tutorial To Learn Data Science With Python From Scratch PDF

This document provides an introduction and table of contents for a tutorial on learning data science with Python from scratch. The introduction discusses the author's motivation for creating Python resources due to a lack of available guides. The table of contents outlines 5 sections that will be covered: 1) Basics of Python for data analysis, 2) Python libraries and data structures, 3) Exploratory analysis in Python using Pandas, 4) Data munging in Python using Pandas, and 5) Building predictive models in Python. Section 1 discusses why Python is useful for data science and how to install Python and run basic programs.

Uploaded by

Teodor von Burg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Introduction
Ithappenedfewyearsback.AfterworkingonSASformorethan5years,Idecidedtomoveoutof
mycomfortzone.Beingadatascientist,myhuntforotherusefultoolswasON!Fortunately,itdidnt
takemelongtodecide,Pythonwasmyappetizer.
Ialwayshadainclinationtowardscoding.ThiswasthetimetodowhatIreallyloved.Code.Turned
out,codingwassoeasy!
IlearnedbasicsofPythonwithinaweek.And,sincethen,Ivenotonlyexploredthislanguagetothe
depth, but also have helped many other to learn this language. Python was originally a general
purposelanguage.But,overtheyears,withstrongcommunitysupport,thislanguagegotdedicated
libraryfordataanalysisandpredictivemodeling.
Due to lack of resource on python for data science, I decided to create this tutorial to help many
others to learn python faster. In this tutorial, we will take bite sized information about how to use
PythonforDataAnalysis,chewittillwearecomfortableandpracticeitatourownend.

TableofContents

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

1/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

1.BasicsofPythonforDataAnalysis
WhylearnPythonfordataanalysis?
Python2.7v/s3.4
HowtoinstallPython?
RunningafewsimpleprogramsinPython
2.Pythonlibrariesanddatastructures
PythonDataStructures
PythonIterationandConditionalConstructs
PythonLibraries
3.ExploratoryanalysisinPythonusingPandas
Introductiontoseriesanddataframes
AnalyticsVidhyadatasetLoanPredictionProblem
4.DataMunginginPythonusingPandas
5.BuildingaPredictiveModelinPython
LogisticRegression
DecisionTree
RandomForest

Letsgetstarted!

1.BasicsofPythonforDataAnalysis
WhylearnPythonfordataanalysis?
Python has gathered a lot of interest recently as a choice of language for data analysis. I
had compared it against SAS & Rsome time back. Here are some reasons which go in favour of
learningPython:

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

2/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

OpenSourcefreetoinstall
Awesomeonlinecommunity
Veryeasytolearn
Canbecomeacommonlanguagefordatascienceandproductionofwebbasedanalyticsproducts.

Needlesstosay,itstillhasfewdrawbackstoo:
It is an interpreted language rather than compiled language hence might take up more CPU time.
However,giventhesavingsinprogrammertime(duetoeaseoflearning),itmightstillbeagoodchoice.

Python2.7v/s3.4
ThisisoneofthemostdebatedtopicsinPython.Youwillinvariablycrosspathswithit,speciallyif
youareabeginner.Thereisnoright/wrongchoicehere.Ittotallydependsonthesituationandyour
needtouse.Iwilltrytogiveyousomepointerstohelpyoumakeaninformedchoice.

WhyPython2.7?
1.Awesomecommunitysupport!Thisissomethingyoudneedinyourearlydays.Python2wasreleased
inlate2000andhasbeeninuseformorethan15years.
2.Plethoraofthirdpartylibraries!Thoughmanylibrarieshaveprovided3.xsupportbutstillalargenumber
of modules work only on 2.x versions. If you plan to use Python for specific applications like web
developmentwithhighrelianceonexternalmodules,youmightbebetteroffwith2.7.
3.Someofthefeaturesof3.xversionshavebackwardcompatibilityandcanworkwith2.7version.

WhyPython3.4?
1.Cleanerandfaster!Pythondevelopershavefixedsomeinherentglitchesandminordrawbacksinorder
to set a stronger foundation for the future. These might not be very relevant initially, but will matter
eventually.
2.It is the future! 2.7 is the last release for the 2.x family and eventually everyone has to shift to 3.x
versions.Python3hasreleasedstableversionsforpast5yearsandwillcontinuethesame.

ThereisnoclearwinnerbutIsupposethebottomlineisthatyoushouldfocusonlearningPythonas
a language. Shifting between versions should just be a matter of time. Stay tuned for a dedicated
articleonPython2.xvs3.xinthenearfuture!

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

3/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

HowtoinstallPython?
Thereare2approachestoinstallPython:
YoucandownloadPythondirectlyfromitsprojectsiteandinstallindividualcomponentsandlibrariesyou
want
Alternately, you can download and install a package, which comes with preinstalled libraries. I would
recommenddownloadingAnaconda.AnotheroptioncouldbeEnthoughtCanopyExpress .

Second method provides a hassle free installation and hence Ill recommend that to
beginners.Theimitationofthisapproachisyouhavetowaitfortheentirepackagetobeupgraded,
evenifyouareinterestedinthelatestversionofasinglelibrary.Itshouldnotmatteruntilandunless,
untilandunless,youaredoingcuttingedgestatisticalresearch.

Choosingadevelopmentenvironment
OnceyouhaveinstalledPython,therearevariousoptionsforchoosinganenvironment.Herearethe
3mostcommonoptions:
Terminal/Shellbased
IDLE(defaultenvironment)
iPythonnotebooksimilartomarkdowninR

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

4/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

IDLEeditorforPython
While the right environment depends on your need, I personally prefer iPython Notebooks a lot. It
providesalotofgoodfeaturesfordocumentingwhilewritingthecodeitselfandyoucanchooseto
runthecodeinblocks(ratherthanthelinebylineexecution)
WewilluseiPythonenvironmentforthiscompletetutorial.

Warmingup:RunningyourfirstPythonprogram
YoucanusePythonasasimplecalculatortostartwith:

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

5/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Fewthingstonote
YoucanstartiPythonnotebookbywritingipythonnotebookonyourterminal/cmd,dependingonthe
OSyouareworkingon
YoucannameaiPythonnotebookbysimplyclickingonthenameUntitledOintheabovescreenshot
TheinterfaceshowsIn[*]forinputsandOut[*]foroutput.
YoucanexecuteacodebypressingShift+EnterorALT+Enter,ifyouwanttoinsertanadditional
rowafter.

Beforewedeepdiveintoproblemsolving,letstakeastepbackandunderstandthebasicsof
Python.Asweknowthatdatastructuresanditerationandconditionalconstructsformthecruxofany
language.InPython,theseincludelists,strings,tuples,dictionaries,forloop,whileloop,ifelse,etc.
Letstakealookatsomeofthese.

2.PythonlibrariesandDataStructures
PythonDataStructures
Followingaresomedatastructures,whichareusedinPython.Youshouldbefamiliarwiththemin
ordertousethemasappropriate.

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

6/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Lists Lists are one of the most versatile data structure in Python.A list can simply be defined by
writingalistofcommaseparatedvaluesinsquarebrackets.Listsmightcontainitemsofdifferenttypes,
butusuallytheitemsallhavethesametype.Pythonlistsaremutableandindividualelementsofalist
canbechanged.

Hereisaquickexampletodefinealistandthenaccessit:

StringsStringscansimplybedefinedbyuseofsingle(),double()ortriple()invertedcommas.
Stringsenclosedintripequotes()canspanovermultiplelinesandareusedfrequentlyindocstrings
(Pythons way of documenting functions). \ is used as an escape character. Please note that Python
stringsareimmutable,soyoucannotchangepartofstrings.

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

7/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

TuplesA tuple is represented by a number of values separated by commas.Tuples are immutable


andtheoutputissurroundedbyparenthesessothatnestedtuplesareprocessedcorrectly.Additionally,
eventhoughtuplesareimmutable,theycanholdmutabledataifneeded.

SinceTuplesareimmutableandcannotchange,theyarefasterinprocessingascomparedto
lists.Hence,ifyourlistisunlikelytochange,youshouldusetuples,insteadoflists.

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

8/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

DictionaryDictionaryisanunorderedsetofkey:valuepairs,withtherequirementthatthekeysare
unique(withinonedictionary).Apairofbracescreatesanemptydictionary:{}.

PythonIterationandConditionalConstructs
Like most languages, Python also has a FORloop which is the most widely used method for
iteration.Ithasasimplesyntax:

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

9/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

foriin[PythonIterable]:
expression(i)

HerePythonIterablecanbealist,tupleorotheradvanceddatastructureswhichwewillexplorein
latersections.Letstakealookatasimpleexample,determiningthefactorialofanumber.

fact=1
foriinrange(1,N+1):
fact*=i

Comingtoconditionalstatements,theseareusedtoexecutecodefragmentsbasedonacondition.
Themostcommonlyusedconstructisifelse,withfollowingsyntax:

if[condition]:
__executioniftrue__
else:
__executioniffalse__

Forinstance,ifwewanttoprintwhetherthenumberNisevenorodd:

ifN%2==0:
print'Even'
else:
print'Odd'

Now that you are familiar with Python fundamentals, lets take a step further. What if you have to
performthefollowingtasks:
1.Multiply2matrices
2.Findtherootofaquadraticequation
3.Plotbarchartsandhistograms
4.Makestatisticalmodels

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

10/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

5.Accesswebpages

Ifyoutrytowritecodefromscratch,itsgoing tobeanightmareandyouwontstayonPythonfor
morethan2days!Butletsnotworryaboutthat.Thankfully,therearemanylibrarieswithpredefined
whichwecandirectlyimportintoourcodeandmakeourlifeeasy.
Forexample,considerthefactorialexamplewejustsaw.Wecandothatinasinglestepas:

math.factorial(N)

Offcourseweneedtoimportthemathlibraryforthat.Letsexplorethevariouslibrariesnext.

PythonLibraries
Lets take one step ahead in our journey to learn Python by getting acquainted with some useful
libraries.Thefirststepisobviouslytolearntoimportthemintoourenvironment.Thereareseveral
waysofdoingsoinPython:

importmathasm

frommathimport*

Inthefirstmanner,wehavedefinedanaliasmtolibrarymath.Wecannowusevariousfunctions
frommathlibrary(e.g.factorial)byreferencingitusingthealiasm.factorial().
In the second manner, you have imported the entire name space in math i.e. you can directly use
factorial()withoutreferringtomath.
Tip:Googlerecommendsthatyouusefirststyleofimportinglibraries,asyouwillknowwhere
thefunctionshavecomefrom.

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

11/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Followingarealistoflibraries,youwillneedforanyscientificcomputationsanddataanalysis:
NumPystandsforNumericalPython.ThemostpowerfulfeatureofNumPyisndimensionalarray.This
library also contains basic linear algebra functions, Fourier transforms, advanced random number
capabilitiesandtoolsforintegrationwithotherlowlevellanguageslikeFortran,CandC++
SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the most useful library for
variety of high level science and engineering modules like discrete Fourier transform, LinearAlgebra,
OptimizationandSparsematrices.
Matplotlibforplottingvastvarietyofgraphs,startingfromhistogramstolineplotstoheatplots..Youcan
usePylabfeatureinipythonnotebook(ipythonnotebookpylab=inline)tousetheseplottingfeatures
inline.Ifyouignoretheinlineoption,thenpylabconvertsipythonenvironmenttoanenvironment,very
similartoMatlab.YoucanalsouseLatexcommandstoaddmathtoyourplot.
Pandasforstructureddataoperationsandmanipulations.Itisextensivelyusedfordatamungingand
preparation. Pandas were added relatively recently to Python and have been instrumental in boosting
Pythonsusageindatascientistcommunity.
ScikitLearnfor machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of
effiecient tools for machine learning and statistical modeling including classification, regression,
clusteringanddimensionalityreduction.
Statsmodelsforstatisticalmodeling.StatsmodelsisaPythonmodulethatallowsuserstoexploredata,
estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics,
statisticaltests,plottingfunctions,andresultstatisticsareavailablefordifferenttypesofdataandeach
estimator.
Seaborn for statistical data visualization. Seaborn is a library for making attractive and informative
statisticalgraphicsinPython.Itisbasedonmatplotlib.Seabornaimstomakevisualizationacentralpart
ofexploringandunderstandingdata.
Bokeh for creating interactive plots, dashboards and data applications on modern webbrowsers. It
empowerstheusertogenerateelegantandconcisegraphicsinthestyleofD3.js.Moreover,ithasthe
capabilityofhighperformanceinteractivityoververylargeorstreamingdatasets.
BlazeforextendingthecapabilityofNumpyandPandastodistributedandstreamingdatasets.Itcanbe
used to access data from a multitude of sources including Bcolz, MongoDB, SQLAlchemy, Apache
Spark,PyTables,etc.TogetherwithBokeh,Blazecanactasaverypowerfultoolforcreatingeffective
visualizationsanddashboardsonhugechunksofdata.
Scrapyforwebcrawling.Itisaveryusefulframeworkforgettingspecificpatternsofdata.Ithasthe
capability to start at a website home url and then dig through webpages within the website to gather
information.
SymPy for symbolic computation. It has wideranging capabilities from basic symbolic arithmetic to
calculus,algebra,discretemathematicsandquantumphysics.Anotherusefulfeatureisthecapabilityof

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

12/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

formattingtheresultofthecomputationsasLaTeXcode.
Requestsforaccessingtheweb.Itworkssimilartothethestandardpythonlibraryurllib2butismuch
easier to code.You will find subtle differences with urllib2 but for beginners, Requests might be more
convenient.

Additionallibraries,youmightneed:
osforOperatingsystemandfileoperations
networkxandigraphforgraphbaseddatamanipulations
regularexpressionsforfindingpatternsintextdata
BeautifulSoupforscrappingweb.ItisinferiortoScrapyasitwillextractinformationfromjustasingle
webpageinarun.

NowthatwearefamiliarwithPythonfundamentalsandadditionallibraries,letstakeadeepdiveinto
problem solving through Python. Yes I mean making a predictive model! In the process, we use
some powerful libraries and also come across the next level of data structures. We will take you
throughthe3keyphases:
1.DataExplorationfindingoutmoreaboutthedatawehave
2.DataMungingcleaningthedataandplayingwithittomakeitbettersuitstatisticalmodeling
3.PredictiveModelingrunningtheactualalgorithmsandhavingfun

3.ExploratoryanalysisinPythonusingPandas
In order to explore our data further, let me introduce you to another animal (as if Python was not
enough!)Pandas

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

13/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

ImageSource:Wikipedia
PandasisoneofthemostusefuldataanalysislibraryinPython(Iknowthesenamessoundsweird,
but hang on!). They have been instrumental in increasing the use of Python in data science
community. We will now use Pandas to read a data set from an Analytics Vidhya competition,
perform exploratory analysis and build our first basic categorization algorithm for solving this
problem.
Beforeloadingthedata,letsunderstandthe2keydatastructuresinPandasSeriesand
DataFrames

IntroductiontoSeriesandDataframes
Series can be understood as a 1 dimensional labelled / indexed array. You can access individual
elementsofthisseriesthroughtheselabels.
A dataframe is similar to Excel workbook you have column names referring to columns and you
have rows, which can be accessed with use of row numbers. The essential difference being that
columnnamesandrownumbersareknownascolumnandrowindex,incaseofdataframes.
SeriesanddataframesformthecoredatamodelforPandasinPython.Thedatasetsarefirstread
intothesedataframesandthenvariousoperations(e.g.groupby,aggregationetc.)canbeapplied
veryeasilytoitscolumns.
More:10MinutestoPandas

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

14/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

PracticedatasetLoanPredictionProblem
Youcandownloadthedatasetfromhere.Hereisthedescriptionofvariables:

VARIABLEDESCRIPTIONS:
Variable

Description

Loan_IDUniqueLoanID
Gender Male/Female
MarriedApplicantmarried(Y/N)
Dependents

Numberofdependents

Education

ApplicantEducation(Graduate/UnderGraduate)

Self_Employed

Selfemployed(Y/N)

ApplicantIncomeApplicantincome
CoapplicantIncome
LoanAmount

Coapplicantincome

Loanamountinthousands

Loan_Amount_Term

Termofloaninmonths

Credit_History credithistorymeetsguidelines
Property_Area

Urban/SemiUrban/Rural

Loan_Status

Loanapproved(Y/N)

Letsbeginwithexploration
Tobegin,startiPythoninterfaceinInlinePylabmodebytypingfollowingonyourterminal/windows
commandprompt:

ipythonnotebookpylab=inline

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

15/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

This opens up iPython notebook in pylab environment, which has a few useful libraries already
imported.Also,youwillbeabletoplotyourdatainline,whichmakesthisareallygoodenvironment
forinteractivedataanalysis.Youcancheckwhethertheenvironmenthasloadedcorrectly,bytyping
thefollowingcommand(andgettingtheoutputasseeninthefigurebelow):

plot(arange(5))

IamcurrentlyworkinginLinux,andhavestoredthedatasetinthefollowinglocation:
/home/kunal/Downloads/Loan_Prediction/train.csv

Importinglibrariesandthedataset:
Followingarethelibrarieswewilluseduringthistutorial:
numpy
matplotlib
pandas

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

16/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

PleasenotethatyoudonotneedtoimportmatplotlibandnumpybecauseofPylabenvironment.I
havestillkepttheminthecode,incaseyouusethecodeinadifferentenvironment.
Afterimportingthelibrary,youreadthedatasetusingfunctionread_csv().Thisishowthecodelooks
liketillthisstage:

importpandasaspd
importnumpyasnp
importmatplotlibasplt

df=pd.read_csv("/home/kunal/Downloads/Loan_Prediction/train.csv")#Readingthedatasetin
adataframeusingPandas

QuickDataExploration
Onceyouhavereadthedataset,youcanhavealookatfewtoprowsbyusingthefunctionhead()

df.head(10)

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

17/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Thisshouldprint10rows.Alternately,youcanalsolookatmorerowsbyprintingthedataset.
Next,youcanlookatsummaryofnumericalfieldsbyusingdescribe()function

df.describe()

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

18/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

describe()functionwouldprovidecount,mean,standarddeviation(std),min,quartilesandmaxinits
output(Readthisarticletorefreshbasicstatisticstounderstandpopulationdistribution)
Hereareafewinferences,youcandrawbylookingattheoutputofdescribe()function:
1.LoanAmounthas(614592)22missingvalues.
2.Loan_Amount_Termhas(614600)14missingvalues.
3.Credit_Historyhas(614564)50missingvalues.
4.Wecanalsolookthatabout84%applicantshaveacredit_history.How?ThemeanofCredit_History
fieldis0.84(Remember,Credit_Historyhasvalue1forthosewhohaveacredithistoryand0otherwise)
5.TheApplicantIncomedistributionseemstobeinlinewithexpectation.SamewithCoapplicantIncome

Pleasenotethatwecangetanideaofapossibleskewinthedatabycomparingthemeantothe
median,i.e.the50%figure.
For the nonnumerical values (e.g. Property_Area, Credit_History etc.), we can look at frequency
distribution to understand whether they make sense or not.The frequency table can be printed by
followingcommand:

df['Property_Area'].value_counts()

Similarly,wecanlookatuniquevaluesofportofcredithistory.Notethatdfname[column_name]isa
basicindexingtechniquetoacessaparticularcolumnofthedataframe.Itcanbealistofcolumnsas
well.Formoreinformation,refertothe10MinutestoPandasresourcesharedabove.

Distributionanalysis
Nowthatwearefamiliarwithbasicdatacharacteristics,letusstudydistributionofvariousvariables.
LetusstartwithnumericvariablesnamelyApplicantIncomeandLoanAmount
LetsstartbyplottingthehistogramofApplicantIncomeusingthefollowingcommands:

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

19/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

df['ApplicantIncome'].hist(bins=50)

Hereweobservethattherearefewextremevalues.Thisisalsothereasonwhy50binsarerequired
todepictthedistributionclearly.
Next,welookatboxplotstounderstandthedistributions.Boxplotforfarecanbeplottedby:

df.boxplot(column='ApplicantIncome')

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

20/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Thisconfirmsthepresenceofalotofoutliers/extremevalues.Thiscanbeattributedtotheincome
disparityinthesociety.Partofthiscanbedrivenbythefactthatwearelookingatpeoplewith
differenteducationlevels.LetussegregatethembyEducation:

df.boxplot(column='ApplicantIncome',by='Education')

We can see that there is no substantial different between the mean income of graduate and non
graduates.Butthereareahighernumberofgraduateswithveryhighincomes,whichareappearing

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

21/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

tobetheoutliers.
Now,LetslookatthehistogramandboxplotofLoanAmountusingthefollowingcommand:

df['LoanAmount'].hist(bins=50)

df.boxplot(column='LoanAmount')

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

22/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Again,therearesomeextremevalues.Clearly,bothApplicantIncomeandLoanAmountrequiresome
amount of data munging. LoanAmount has missing and well as extreme values values, while
ApplicantIncomehasafewextremevalues,whichdemanddeeperunderstanding.Wewilltakethis
upincomingsections.

Categoricalvariableanalysis
Now that we understand distributions for ApplicantIncome and LoanIncome, let us understand
categorical variables in more details. We will use Excel style pivot table and crosstabulation. For
instance,letuslookatthechancesofgettingaloanbasedoncredithistory.Thiscanbeachievedin
MSExcelusingapivottableas:

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

23/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Note: here loan status has been coded as 1 for Yes and 0 for No. So the mean represents the
probabilityofgettingloan.
NowwewilllookatthestepsrequiredtogenerateasimilarinsightusingPython.Pleasereferto this
articleforgettingahangofthedifferentdatamanipulationtechniquesinPandas.

temp1=df['Credit_History'].value_counts(ascending=True)
temp2=df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambdax:x.ma
p({'Y':1,'N':0}).mean())
print'FrequencyTableforCreditHistory:'
printtemp1

print'\nProbilityofgettingloanforeachCreditHistoryclass:'
printtemp2

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

24/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Nowwecanobservethatwegetasimilarpivot_tableliketheMSExcelone.Thiscanbeplottedasa
barchartusingthematplotliblibrarywithfollowingcode:

importmatplotlib.pyplotasplt
fig=plt.figure(figsize=(8,4))
ax1=fig.add_subplot(121)
ax1.set_xlabel('Credit_History')
ax1.set_ylabel('CountofApplicants')
ax1.set_title("ApplicantsbyCredit_History")
temp1.plot(kind='bar')

ax2=fig.add_subplot(122)
temp2.plot(kind='bar')
ax2.set_xlabel('Credit_History')
ax2.set_ylabel('Probabilityofgettingloan')
ax2.set_title("Probabilityofgettingloanbycredithistory")

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

25/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Thisshowsthatthechancesofgettingaloanareeightfoldiftheapplicanthasavalidcredithistory.
YoucanplotsimilargraphsbyMarried,SelfEmployed,Property_Area,etc.
Alternately,thesetwoplotscanalsobevisualizedbycombiningtheminastackedchart::

temp3=pd.crosstab(df['Credit_History'],df['Loan_Status'])
temp3.plot(kind='bar',stacked=True,color=['red','blue'],grid=False)

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

26/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Youcanalsoaddgenderintothemix(similartothepivottableinExcel):

Ifyouhavenotrealizedalready,wehavejustcreatedtwobasicclassificationalgorithmshere,one
based on credit history, while other on 2 categorical variables (including gender). You can quickly
codethistocreateyourfirstsubmissiononAVDatahacks.
We just saw how we can do exploratory analysis in Python using Pandas. I hope your love for

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

27/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

pandas (the animal) would have increased by now given the amount of help, the library can
provideyouinanalyzingdatasets.
Next lets explore ApplicantIncome and LoanStatus variables further, perform data munging and
create a dataset for applying various modeling techniques. I would strongly urge that you take
anotherdatasetandproblemandgothroughanindependentexamplebeforereadingfurther.

4.DataMunginginPython:UsingPandas
Forthose,whohavebeenfollowing,hereareyourmustwearshoestostartrunning.

Datamungingrecapoftheneed
Whileourexplorationofthedata,wefoundafewproblemsinthedataset,whichneedstobesolved
beforethedataisreadyforagoodmodel.ThisexerciseistypicallyreferredasDataMunging.Here
aretheproblems,wearealreadyawareof:
1.Therearemissingvaluesinsomevariables.Weshouldestimatethosevalueswiselydependingonthe
amountofmissingvaluesandtheexpectedimportanceofvariables.
2.While looking at the distributions, we saw thatApplicantIncome and LoanAmount seemed to contain
extreme values at either end. Though they might make intuitive sense, but should be treated
appropriately.

Inadditiontotheseproblemswithnumericalfields,weshouldalsolookatthenonnumericalfields
i.e. Gender, Property_Area, Married, Education and Dependents to see, if they contain any useful
information.
IfyouarenewtoPandas,Iwouldrecommendreading thisarticlebeforemovingon.Itdetailssome
usefultechniquesofdatamanipulation.

Checkmissingvaluesinthedataset
Letuslookatmissingvaluesinallthevariablesbecausemostofthemodelsdontworkwithmissing

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

28/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

dataandeveniftheydo,imputingthemhelpsmoreoftenthannot.So,letuscheckthenumberof
nulls/NaNsinthedataset

df.apply(lambdax:sum(x.isnull()),axis=0)

Thiscommandshouldtellusthenumberofmissingvaluesineachcolumnasisnull()returns1,ifthe
valueisnull.

Thoughthemissingvaluesarenotveryhighinnumber,butmanyvariableshavethemandeachone
of these should be estimated and added in the data. Get a detailed view on different imputation
techniquesthroughthisarticle.
Note: Remember that missing values may not always be NaNs. For instance, if the
Loan_Amount_Termis0,doesitmakessenseorwouldyouconsiderthatmissing?Isupposeyour
answerismissingandyoureright.Soweshouldcheckforvalueswhichareunpractical.

HowtofillmissingvaluesinLoanAmount?

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

29/29

Common questions

Powered by AI

Exploring and preparing a dataset in Python involves several key steps. First, examine data structures and statistics using functions like df.head() and df.describe() to obtain summaries and detect missing values or outliers in the data . Data munging is applied to handle missing values and outliers, ensuring data readiness for statistical modeling. Analysis of categorical variables using pivot tables and cross-tabulation provides insights into possible classifications and data patterns . These insights lead to more robust model development and analysis .

You can start an iPython Notebook by typing 'ipython notebook' in your terminal or command prompt, depending on the operating system . iPython Notebooks allow you to execute code in blocks rather than individual lines, which is beneficial for debugging and testing specific portions of code. Additionally, it supports rich text and mathematical notation, making it ideal for documenting along with coding .

Analyzing categorical variables can provide insights into classification patterns, such as the likelihood of loan approval based on credit history or other demographic factors . Python facilitates this analysis using pivot tables and cross-tabulation to calculate frequencies and probabilities, visualized through plots like bar charts. These tools help in identifying trends and correlations within the categorical data, ultimately influencing decision-making and prediction models .

Data munging, applied through techniques in Python, involves cleaning and transforming raw data to make it suitable for analysis. It addresses challenges such as handling missing values, removing outliers, and rectifying data inconsistencies or duplications . This process ensures data quality and accuracy, which are crucial for building reliable statistical and predictive models. Effective munging includes assessing the distribution, checking for null values, and using strategies like imputation or exclusion to prepare the dataset adequately .

Tuples are immutable, which makes them faster than lists for processing because they cannot be changed after their creation . They should be preferred over lists when you have a collection of items that do not need to be modified, which helps in improving the efficiency of the program by ensuring that these data structures remain constant throughout the execution .

Web scraping with Python libraries like Scrapy significantly enhances data acquisition as it automates the process of extracting and storing information from web pages. Scrapy allows users to efficiently gather large amounts of data from various sources online, which can then be processed and analyzed to derive meaningful patterns, trends, or insights pertaining to particular research or business objectives .

Using libraries in Python is crucial for performing complex computational tasks as they provide pre-defined functions and tools, such as mathematical computations (NumPy), scientific operations (SciPy), and data analysis (Pandas). Libraries can be imported using commands like 'import math as m' for selective usage with an alias or 'from math import *' to import the entire namespace . This modularity allows efficient code management and enhances productivity .

A developer may prefer using the alias method of importing libraries in Python because it maintains the namespace, making code readability and debugging easier by clearly identifying the source of functions . This practice, recommended by Google, enhances code maintainability and prevents conflicts between function names from different libraries. It also aids in reducing typos and provides a shorthand reference to frequently used libraries .

Dictionaries in Python are defined using curly braces {} and consist of unique key-value pairs . They are unordered, can be modified, and allow for quick lookups by keys, which makes them distinct from lists and tuples. This capability is particularly useful when you need to connect or map unequivocal associations between data sets .

Plots and visualizations transform raw statistical data into actionable insights by providing a visual representation of data trends, outliers, and patterns that are not easily discernible through raw data alone. For instance, histograms and box plots can reveal distribution and skewness, while bar charts can elucidate probabilities and frequencies in categorical data, thereby allowing deeper analysis and informed decision making .

You might also like