80% found this document useful (5 votes)

7K views29 pages

A Complete Tutorial To Learn Data Science With Python From Scratch PDF

This document provides an introduction and table of contents for a tutorial on learning data science with Python from scratch. The introduction discusses the author's motivation for creating Python resources due to a lack of available guides. The table of contents outlines 5 sections that will be covered: 1) Basics of Python for data analysis, 2) Python libraries and data structures, 3) Exploratory analysis in Python using Pandas, 4) Data munging in Python using Pandas, and 5) Building predictive models in Python. Section 1 discusses why Python is useful for data science and how to install Python and run basic programs.

Uploaded by

Teodor von Burg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

80% found this document useful (5 votes)

7K views29 pages

A Complete Tutorial To Learn Data Science With Python From Scratch PDF

Uploaded by

Teodor von Burg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Introduction
Ithappenedfewyearsback.AfterworkingonSASformorethan5years,Idecidedtomoveoutof
mycomfortzone.Beingadatascientist,myhuntforotherusefultoolswasON!Fortunately,itdidnt
takemelongtodecide,Pythonwasmyappetizer.
Ialwayshadainclinationtowardscoding.ThiswasthetimetodowhatIreallyloved.Code.Turned
out,codingwassoeasy!
IlearnedbasicsofPythonwithinaweek.And,sincethen,Ivenotonlyexploredthislanguagetothe
depth, but also have helped many other to learn this language. Python was originally a general
purposelanguage.But,overtheyears,withstrongcommunitysupport,thislanguagegotdedicated
libraryfordataanalysisandpredictivemodeling.
Due to lack of resource on python for data science, I decided to create this tutorial to help many
others to learn python faster. In this tutorial, we will take bite sized information about how to use
PythonforDataAnalysis,chewittillwearecomfortableandpracticeitatourownend.

TableofContents

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

1/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

1.BasicsofPythonforDataAnalysis
WhylearnPythonfordataanalysis?
Python2.7v/s3.4
HowtoinstallPython?
RunningafewsimpleprogramsinPython
2.Pythonlibrariesanddatastructures
PythonDataStructures
PythonIterationandConditionalConstructs
PythonLibraries
3.ExploratoryanalysisinPythonusingPandas
Introductiontoseriesanddataframes
AnalyticsVidhyadatasetLoanPredictionProblem
4.DataMunginginPythonusingPandas
5.BuildingaPredictiveModelinPython
LogisticRegression
DecisionTree
RandomForest

Letsgetstarted!

1.BasicsofPythonforDataAnalysis
WhylearnPythonfordataanalysis?
Python has gathered a lot of interest recently as a choice of language for data analysis. I
had compared it against SAS & Rsome time back. Here are some reasons which go in favour of
learningPython:

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

2/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

OpenSourcefreetoinstall
Awesomeonlinecommunity
Veryeasytolearn
Canbecomeacommonlanguagefordatascienceandproductionofwebbasedanalyticsproducts.

Needlesstosay,itstillhasfewdrawbackstoo:
It is an interpreted language rather than compiled language hence might take up more CPU time.
However,giventhesavingsinprogrammertime(duetoeaseoflearning),itmightstillbeagoodchoice.

Python2.7v/s3.4
ThisisoneofthemostdebatedtopicsinPython.Youwillinvariablycrosspathswithit,speciallyif
youareabeginner.Thereisnoright/wrongchoicehere.Ittotallydependsonthesituationandyour
needtouse.Iwilltrytogiveyousomepointerstohelpyoumakeaninformedchoice.

WhyPython2.7?
1.Awesomecommunitysupport!Thisissomethingyoudneedinyourearlydays.Python2wasreleased
inlate2000andhasbeeninuseformorethan15years.
2.Plethoraofthirdpartylibraries!Thoughmanylibrarieshaveprovided3.xsupportbutstillalargenumber
of modules work only on 2.x versions. If you plan to use Python for specific applications like web
developmentwithhighrelianceonexternalmodules,youmightbebetteroffwith2.7.
3.Someofthefeaturesof3.xversionshavebackwardcompatibilityandcanworkwith2.7version.

WhyPython3.4?
1.Cleanerandfaster!Pythondevelopershavefixedsomeinherentglitchesandminordrawbacksinorder
to set a stronger foundation for the future. These might not be very relevant initially, but will matter
eventually.
2.It is the future! 2.7 is the last release for the 2.x family and eventually everyone has to shift to 3.x
versions.Python3hasreleasedstableversionsforpast5yearsandwillcontinuethesame.

ThereisnoclearwinnerbutIsupposethebottomlineisthatyoushouldfocusonlearningPythonas
a language. Shifting between versions should just be a matter of time. Stay tuned for a dedicated
articleonPython2.xvs3.xinthenearfuture!

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

3/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

HowtoinstallPython?
Thereare2approachestoinstallPython:
YoucandownloadPythondirectlyfromitsprojectsiteandinstallindividualcomponentsandlibrariesyou
want
Alternately, you can download and install a package, which comes with preinstalled libraries. I would
recommenddownloadingAnaconda.AnotheroptioncouldbeEnthoughtCanopyExpress .

Second method provides a hassle free installation and hence Ill recommend that to
beginners.Theimitationofthisapproachisyouhavetowaitfortheentirepackagetobeupgraded,
evenifyouareinterestedinthelatestversionofasinglelibrary.Itshouldnotmatteruntilandunless,
untilandunless,youaredoingcuttingedgestatisticalresearch.

Choosingadevelopmentenvironment
OnceyouhaveinstalledPython,therearevariousoptionsforchoosinganenvironment.Herearethe
3mostcommonoptions:
Terminal/Shellbased
IDLE(defaultenvironment)
iPythonnotebooksimilartomarkdowninR

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

4/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

IDLEeditorforPython
While the right environment depends on your need, I personally prefer iPython Notebooks a lot. It
providesalotofgoodfeaturesfordocumentingwhilewritingthecodeitselfandyoucanchooseto
runthecodeinblocks(ratherthanthelinebylineexecution)
WewilluseiPythonenvironmentforthiscompletetutorial.

Warmingup:RunningyourfirstPythonprogram
YoucanusePythonasasimplecalculatortostartwith:

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

5/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Fewthingstonote
YoucanstartiPythonnotebookbywritingipythonnotebookonyourterminal/cmd,dependingonthe
OSyouareworkingon
YoucannameaiPythonnotebookbysimplyclickingonthenameUntitledOintheabovescreenshot
TheinterfaceshowsIn[*]forinputsandOut[*]foroutput.
YoucanexecuteacodebypressingShift+EnterorALT+Enter,ifyouwanttoinsertanadditional
rowafter.

Beforewedeepdiveintoproblemsolving,letstakeastepbackandunderstandthebasicsof
Python.Asweknowthatdatastructuresanditerationandconditionalconstructsformthecruxofany
language.InPython,theseincludelists,strings,tuples,dictionaries,forloop,whileloop,ifelse,etc.
Letstakealookatsomeofthese.

2.PythonlibrariesandDataStructures
PythonDataStructures
Followingaresomedatastructures,whichareusedinPython.Youshouldbefamiliarwiththemin
ordertousethemasappropriate.

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

6/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Lists Lists are one of the most versatile data structure in Python.A list can simply be defined by
writingalistofcommaseparatedvaluesinsquarebrackets.Listsmightcontainitemsofdifferenttypes,
butusuallytheitemsallhavethesametype.Pythonlistsaremutableandindividualelementsofalist
canbechanged.

Hereisaquickexampletodefinealistandthenaccessit:

StringsStringscansimplybedefinedbyuseofsingle(),double()ortriple()invertedcommas.
Stringsenclosedintripequotes()canspanovermultiplelinesandareusedfrequentlyindocstrings
(Pythons way of documenting functions). \ is used as an escape character. Please note that Python
stringsareimmutable,soyoucannotchangepartofstrings.

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

7/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

TuplesA tuple is represented by a number of values separated by commas.Tuples are immutable

andtheoutputissurroundedbyparenthesessothatnestedtuplesareprocessedcorrectly.Additionally,
eventhoughtuplesareimmutable,theycanholdmutabledataifneeded.

SinceTuplesareimmutableandcannotchange,theyarefasterinprocessingascomparedto
lists.Hence,ifyourlistisunlikelytochange,youshouldusetuples,insteadoflists.

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

8/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

DictionaryDictionaryisanunorderedsetofkey:valuepairs,withtherequirementthatthekeysare
unique(withinonedictionary).Apairofbracescreatesanemptydictionary:{}.

PythonIterationandConditionalConstructs
Like most languages, Python also has a FORloop which is the most widely used method for
iteration.Ithasasimplesyntax:

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

9/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

foriin[PythonIterable]:
expression(i)

HerePythonIterablecanbealist,tupleorotheradvanceddatastructureswhichwewillexplorein
latersections.Letstakealookatasimpleexample,determiningthefactorialofanumber.

fact=1
foriinrange(1,N+1):
fact*=i

Comingtoconditionalstatements,theseareusedtoexecutecodefragmentsbasedonacondition.
Themostcommonlyusedconstructisifelse,withfollowingsyntax:

if[condition]:
__executioniftrue__
else:
__executioniffalse__

Forinstance,ifwewanttoprintwhetherthenumberNisevenorodd:

ifN%2==0:
print'Even'
else:
print'Odd'

Now that you are familiar with Python fundamentals, lets take a step further. What if you have to
performthefollowingtasks:
1.Multiply2matrices
2.Findtherootofaquadraticequation
3.Plotbarchartsandhistograms
4.Makestatisticalmodels

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

10/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

5.Accesswebpages

Ifyoutrytowritecodefromscratch,itsgoing tobeanightmareandyouwontstayonPythonfor
morethan2days!Butletsnotworryaboutthat.Thankfully,therearemanylibrarieswithpredefined
whichwecandirectlyimportintoourcodeandmakeourlifeeasy.
Forexample,considerthefactorialexamplewejustsaw.Wecandothatinasinglestepas:

math.factorial(N)

Offcourseweneedtoimportthemathlibraryforthat.Letsexplorethevariouslibrariesnext.

PythonLibraries
Lets take one step ahead in our journey to learn Python by getting acquainted with some useful
libraries.Thefirststepisobviouslytolearntoimportthemintoourenvironment.Thereareseveral
waysofdoingsoinPython:

importmathasm

frommathimport*

Inthefirstmanner,wehavedefinedanaliasmtolibrarymath.Wecannowusevariousfunctions
frommathlibrary(e.g.factorial)byreferencingitusingthealiasm.factorial().
In the second manner, you have imported the entire name space in math i.e. you can directly use
factorial()withoutreferringtomath.
Tip:Googlerecommendsthatyouusefirststyleofimportinglibraries,asyouwillknowwhere
thefunctionshavecomefrom.

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

11/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Followingarealistoflibraries,youwillneedforanyscientificcomputationsanddataanalysis:
NumPystandsforNumericalPython.ThemostpowerfulfeatureofNumPyisndimensionalarray.This
library also contains basic linear algebra functions, Fourier transforms, advanced random number
capabilitiesandtoolsforintegrationwithotherlowlevellanguageslikeFortran,CandC++
SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the most useful library for
variety of high level science and engineering modules like discrete Fourier transform, LinearAlgebra,
OptimizationandSparsematrices.
Matplotlibforplottingvastvarietyofgraphs,startingfromhistogramstolineplotstoheatplots..Youcan
usePylabfeatureinipythonnotebook(ipythonnotebookpylab=inline)tousetheseplottingfeatures
inline.Ifyouignoretheinlineoption,thenpylabconvertsipythonenvironmenttoanenvironment,very
similartoMatlab.YoucanalsouseLatexcommandstoaddmathtoyourplot.
Pandasforstructureddataoperationsandmanipulations.Itisextensivelyusedfordatamungingand
preparation. Pandas were added relatively recently to Python and have been instrumental in boosting
Pythonsusageindatascientistcommunity.
ScikitLearnfor machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of
effiecient tools for machine learning and statistical modeling including classification, regression,
clusteringanddimensionalityreduction.
Statsmodelsforstatisticalmodeling.StatsmodelsisaPythonmodulethatallowsuserstoexploredata,
estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics,
statisticaltests,plottingfunctions,andresultstatisticsareavailablefordifferenttypesofdataandeach
estimator.
Seaborn for statistical data visualization. Seaborn is a library for making attractive and informative
statisticalgraphicsinPython.Itisbasedonmatplotlib.Seabornaimstomakevisualizationacentralpart
ofexploringandunderstandingdata.
Bokeh for creating interactive plots, dashboards and data applications on modern webbrowsers. It
empowerstheusertogenerateelegantandconcisegraphicsinthestyleofD3.js.Moreover,ithasthe
capabilityofhighperformanceinteractivityoververylargeorstreamingdatasets.
BlazeforextendingthecapabilityofNumpyandPandastodistributedandstreamingdatasets.Itcanbe
used to access data from a multitude of sources including Bcolz, MongoDB, SQLAlchemy, Apache
Spark,PyTables,etc.TogetherwithBokeh,Blazecanactasaverypowerfultoolforcreatingeffective
visualizationsanddashboardsonhugechunksofdata.
Scrapyforwebcrawling.Itisaveryusefulframeworkforgettingspecificpatternsofdata.Ithasthe
capability to start at a website home url and then dig through webpages within the website to gather
information.
SymPy for symbolic computation. It has wideranging capabilities from basic symbolic arithmetic to
calculus,algebra,discretemathematicsandquantumphysics.Anotherusefulfeatureisthecapabilityof

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

12/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

formattingtheresultofthecomputationsasLaTeXcode.
Requestsforaccessingtheweb.Itworkssimilartothethestandardpythonlibraryurllib2butismuch
easier to code.You will find subtle differences with urllib2 but for beginners, Requests might be more
convenient.

Additionallibraries,youmightneed:
osforOperatingsystemandfileoperations
networkxandigraphforgraphbaseddatamanipulations
regularexpressionsforfindingpatternsintextdata
BeautifulSoupforscrappingweb.ItisinferiortoScrapyasitwillextractinformationfromjustasingle
webpageinarun.

NowthatwearefamiliarwithPythonfundamentalsandadditionallibraries,letstakeadeepdiveinto
problem solving through Python. Yes I mean making a predictive model! In the process, we use
some powerful libraries and also come across the next level of data structures. We will take you
throughthe3keyphases:
1.DataExplorationfindingoutmoreaboutthedatawehave
2.DataMungingcleaningthedataandplayingwithittomakeitbettersuitstatisticalmodeling
3.PredictiveModelingrunningtheactualalgorithmsandhavingfun

3.ExploratoryanalysisinPythonusingPandas
In order to explore our data further, let me introduce you to another animal (as if Python was not
enough!)Pandas

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

13/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

ImageSource:Wikipedia
PandasisoneofthemostusefuldataanalysislibraryinPython(Iknowthesenamessoundsweird,
but hang on!). They have been instrumental in increasing the use of Python in data science
community. We will now use Pandas to read a data set from an Analytics Vidhya competition,
perform exploratory analysis and build our first basic categorization algorithm for solving this
problem.
Beforeloadingthedata,letsunderstandthe2keydatastructuresinPandasSeriesand
DataFrames

IntroductiontoSeriesandDataframes
Series can be understood as a 1 dimensional labelled / indexed array. You can access individual
elementsofthisseriesthroughtheselabels.
A dataframe is similar to Excel workbook you have column names referring to columns and you
have rows, which can be accessed with use of row numbers. The essential difference being that
columnnamesandrownumbersareknownascolumnandrowindex,incaseofdataframes.
SeriesanddataframesformthecoredatamodelforPandasinPython.Thedatasetsarefirstread
intothesedataframesandthenvariousoperations(e.g.groupby,aggregationetc.)canbeapplied
veryeasilytoitscolumns.
More:10MinutestoPandas

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

14/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

PracticedatasetLoanPredictionProblem
Youcandownloadthedatasetfromhere.Hereisthedescriptionofvariables:

VARIABLEDESCRIPTIONS:
Variable

Description

Loan_IDUniqueLoanID
Gender Male/Female
MarriedApplicantmarried(Y/N)
Dependents

Numberofdependents

Education

ApplicantEducation(Graduate/UnderGraduate)

Self_Employed

Selfemployed(Y/N)

ApplicantIncomeApplicantincome
CoapplicantIncome
LoanAmount

Coapplicantincome

Loanamountinthousands

Loan_Amount_Term

Termofloaninmonths

Credit_History credithistorymeetsguidelines
Property_Area

Urban/SemiUrban/Rural

Loan_Status

Loanapproved(Y/N)

Letsbeginwithexploration
Tobegin,startiPythoninterfaceinInlinePylabmodebytypingfollowingonyourterminal/windows
commandprompt:

ipythonnotebookpylab=inline

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

15/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

This opens up iPython notebook in pylab environment, which has a few useful libraries already
imported.Also,youwillbeabletoplotyourdatainline,whichmakesthisareallygoodenvironment
forinteractivedataanalysis.Youcancheckwhethertheenvironmenthasloadedcorrectly,bytyping
thefollowingcommand(andgettingtheoutputasseeninthefigurebelow):

plot(arange(5))

IamcurrentlyworkinginLinux,andhavestoredthedatasetinthefollowinglocation:
/home/kunal/Downloads/Loan_Prediction/train.csv

Importinglibrariesandthedataset:
Followingarethelibrarieswewilluseduringthistutorial:
numpy
matplotlib
pandas

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

16/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

PleasenotethatyoudonotneedtoimportmatplotlibandnumpybecauseofPylabenvironment.I
havestillkepttheminthecode,incaseyouusethecodeinadifferentenvironment.
Afterimportingthelibrary,youreadthedatasetusingfunctionread_csv().Thisishowthecodelooks
liketillthisstage:

importpandasaspd
importnumpyasnp
importmatplotlibasplt

df=pd.read_csv("/home/kunal/Downloads/Loan_Prediction/train.csv")#Readingthedatasetin
adataframeusingPandas

QuickDataExploration
Onceyouhavereadthedataset,youcanhavealookatfewtoprowsbyusingthefunctionhead()

df.head(10)

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

17/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Thisshouldprint10rows.Alternately,youcanalsolookatmorerowsbyprintingthedataset.
Next,youcanlookatsummaryofnumericalfieldsbyusingdescribe()function

df.describe()

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

18/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

describe()functionwouldprovidecount,mean,standarddeviation(std),min,quartilesandmaxinits
output(Readthisarticletorefreshbasicstatisticstounderstandpopulationdistribution)
Hereareafewinferences,youcandrawbylookingattheoutputofdescribe()function:
1.LoanAmounthas(614592)22missingvalues.
2.Loan_Amount_Termhas(614600)14missingvalues.
3.Credit_Historyhas(614564)50missingvalues.
4.Wecanalsolookthatabout84%applicantshaveacredit_history.How?ThemeanofCredit_History
fieldis0.84(Remember,Credit_Historyhasvalue1forthosewhohaveacredithistoryand0otherwise)
5.TheApplicantIncomedistributionseemstobeinlinewithexpectation.SamewithCoapplicantIncome

Pleasenotethatwecangetanideaofapossibleskewinthedatabycomparingthemeantothe
median,i.e.the50%figure.
For the nonnumerical values (e.g. Property_Area, Credit_History etc.), we can look at frequency
distribution to understand whether they make sense or not.The frequency table can be printed by
followingcommand:

df['Property_Area'].value_counts()

Similarly,wecanlookatuniquevaluesofportofcredithistory.Notethatdfname[column_name]isa
basicindexingtechniquetoacessaparticularcolumnofthedataframe.Itcanbealistofcolumnsas
well.Formoreinformation,refertothe10MinutestoPandasresourcesharedabove.

Distributionanalysis
Nowthatwearefamiliarwithbasicdatacharacteristics,letusstudydistributionofvariousvariables.
LetusstartwithnumericvariablesnamelyApplicantIncomeandLoanAmount
LetsstartbyplottingthehistogramofApplicantIncomeusingthefollowingcommands:

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

19/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

df['ApplicantIncome'].hist(bins=50)

Hereweobservethattherearefewextremevalues.Thisisalsothereasonwhy50binsarerequired
todepictthedistributionclearly.
Next,welookatboxplotstounderstandthedistributions.Boxplotforfarecanbeplottedby:

df.boxplot(column='ApplicantIncome')

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

20/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Thisconfirmsthepresenceofalotofoutliers/extremevalues.Thiscanbeattributedtotheincome
disparityinthesociety.Partofthiscanbedrivenbythefactthatwearelookingatpeoplewith
differenteducationlevels.LetussegregatethembyEducation:

df.boxplot(column='ApplicantIncome',by='Education')

We can see that there is no substantial different between the mean income of graduate and non
graduates.Butthereareahighernumberofgraduateswithveryhighincomes,whichareappearing

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

21/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

tobetheoutliers.
Now,LetslookatthehistogramandboxplotofLoanAmountusingthefollowingcommand:

df['LoanAmount'].hist(bins=50)

df.boxplot(column='LoanAmount')

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

22/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Again,therearesomeextremevalues.Clearly,bothApplicantIncomeandLoanAmountrequiresome
amount of data munging. LoanAmount has missing and well as extreme values values, while
ApplicantIncomehasafewextremevalues,whichdemanddeeperunderstanding.Wewilltakethis
upincomingsections.

Categoricalvariableanalysis
Now that we understand distributions for ApplicantIncome and LoanIncome, let us understand
categorical variables in more details. We will use Excel style pivot table and crosstabulation. For
instance,letuslookatthechancesofgettingaloanbasedoncredithistory.Thiscanbeachievedin
MSExcelusingapivottableas:

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

23/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Note: here loan status has been coded as 1 for Yes and 0 for No. So the mean represents the
probabilityofgettingloan.
NowwewilllookatthestepsrequiredtogenerateasimilarinsightusingPython.Pleasereferto this
articleforgettingahangofthedifferentdatamanipulationtechniquesinPandas.

temp1=df['Credit_History'].value_counts(ascending=True)
temp2=df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambdax:x.ma
p({'Y':1,'N':0}).mean())
print'FrequencyTableforCreditHistory:'
printtemp1

print'\nProbilityofgettingloanforeachCreditHistoryclass:'
printtemp2

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

24/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Nowwecanobservethatwegetasimilarpivot_tableliketheMSExcelone.Thiscanbeplottedasa
barchartusingthematplotliblibrarywithfollowingcode:

importmatplotlib.pyplotasplt
fig=plt.figure(figsize=(8,4))
ax1=fig.add_subplot(121)
ax1.set_xlabel('Credit_History')
ax1.set_ylabel('CountofApplicants')
ax1.set_title("ApplicantsbyCredit_History")
temp1.plot(kind='bar')

ax2=fig.add_subplot(122)
temp2.plot(kind='bar')
ax2.set_xlabel('Credit_History')
ax2.set_ylabel('Probabilityofgettingloan')
ax2.set_title("Probabilityofgettingloanbycredithistory")

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

25/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Thisshowsthatthechancesofgettingaloanareeightfoldiftheapplicanthasavalidcredithistory.
YoucanplotsimilargraphsbyMarried,SelfEmployed,Property_Area,etc.
Alternately,thesetwoplotscanalsobevisualizedbycombiningtheminastackedchart::

temp3=pd.crosstab(df['Credit_History'],df['Loan_Status'])
temp3.plot(kind='bar',stacked=True,color=['red','blue'],grid=False)

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

26/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Youcanalsoaddgenderintothemix(similartothepivottableinExcel):

Ifyouhavenotrealizedalready,wehavejustcreatedtwobasicclassificationalgorithmshere,one
based on credit history, while other on 2 categorical variables (including gender). You can quickly
codethistocreateyourfirstsubmissiononAVDatahacks.
We just saw how we can do exploratory analysis in Python using Pandas. I hope your love for

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

27/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

pandas (the animal) would have increased by now given the amount of help, the library can
provideyouinanalyzingdatasets.
Next lets explore ApplicantIncome and LoanStatus variables further, perform data munging and
create a dataset for applying various modeling techniques. I would strongly urge that you take
anotherdatasetandproblemandgothroughanindependentexamplebeforereadingfurther.

4.DataMunginginPython:UsingPandas
Forthose,whohavebeenfollowing,hereareyourmustwearshoestostartrunning.

Datamungingrecapoftheneed
Whileourexplorationofthedata,wefoundafewproblemsinthedataset,whichneedstobesolved
beforethedataisreadyforagoodmodel.ThisexerciseistypicallyreferredasDataMunging.Here
aretheproblems,wearealreadyawareof:
1.Therearemissingvaluesinsomevariables.Weshouldestimatethosevalueswiselydependingonthe
amountofmissingvaluesandtheexpectedimportanceofvariables.
2.While looking at the distributions, we saw thatApplicantIncome and LoanAmount seemed to contain
extreme values at either end. Though they might make intuitive sense, but should be treated
appropriately.

Inadditiontotheseproblemswithnumericalfields,weshouldalsolookatthenonnumericalfields
i.e. Gender, Property_Area, Married, Education and Dependents to see, if they contain any useful
information.
IfyouarenewtoPandas,Iwouldrecommendreading thisarticlebeforemovingon.Itdetailssome
usefultechniquesofdatamanipulation.

Checkmissingvaluesinthedataset
Letuslookatmissingvaluesinallthevariablesbecausemostofthemodelsdontworkwithmissing

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

28/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

dataandeveniftheydo,imputingthemhelpsmoreoftenthannot.So,letuscheckthenumberof
nulls/NaNsinthedataset

df.apply(lambdax:sum(x.isnull()),axis=0)

Thiscommandshouldtellusthenumberofmissingvaluesineachcolumnasisnull()returns1,ifthe
valueisnull.

Thoughthemissingvaluesarenotveryhighinnumber,butmanyvariableshavethemandeachone
of these should be estimated and added in the data. Get a detailed view on different imputation
techniquesthroughthisarticle.
Note: Remember that missing values may not always be NaNs. For instance, if the
Loan_Amount_Termis0,doesitmakessenseorwouldyouconsiderthatmissing?Isupposeyour
answerismissingandyoureright.Soweshouldcheckforvalueswhichareunpractical.

HowtofillmissingvaluesinLoanAmount?

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

29/29

Common questions

Exploring and preparing a dataset in Python involves several key steps. First, examine data structures and statistics using functions like df.head() and df.describe() to obtain summaries and detect missing values or outliers in the data . Data munging is applied to handle missing values and outliers, ensuring data readiness for statistical modeling. Analysis of categorical variables using pivot tables and cross-tabulation provides insights into possible classifications and data patterns . These insights lead to more robust model development and analysis .

You can start an iPython Notebook by typing 'ipython notebook' in your terminal or command prompt, depending on the operating system . iPython Notebooks allow you to execute code in blocks rather than individual lines, which is beneficial for debugging and testing specific portions of code. Additionally, it supports rich text and mathematical notation, making it ideal for documenting along with coding .

Analyzing categorical variables can provide insights into classification patterns, such as the likelihood of loan approval based on credit history or other demographic factors . Python facilitates this analysis using pivot tables and cross-tabulation to calculate frequencies and probabilities, visualized through plots like bar charts. These tools help in identifying trends and correlations within the categorical data, ultimately influencing decision-making and prediction models .

Data munging, applied through techniques in Python, involves cleaning and transforming raw data to make it suitable for analysis. It addresses challenges such as handling missing values, removing outliers, and rectifying data inconsistencies or duplications . This process ensures data quality and accuracy, which are crucial for building reliable statistical and predictive models. Effective munging includes assessing the distribution, checking for null values, and using strategies like imputation or exclusion to prepare the dataset adequately .

Tuples are immutable, which makes them faster than lists for processing because they cannot be changed after their creation . They should be preferred over lists when you have a collection of items that do not need to be modified, which helps in improving the efficiency of the program by ensuring that these data structures remain constant throughout the execution .

Web scraping with Python libraries like Scrapy significantly enhances data acquisition as it automates the process of extracting and storing information from web pages. Scrapy allows users to efficiently gather large amounts of data from various sources online, which can then be processed and analyzed to derive meaningful patterns, trends, or insights pertaining to particular research or business objectives .

Using libraries in Python is crucial for performing complex computational tasks as they provide pre-defined functions and tools, such as mathematical computations (NumPy), scientific operations (SciPy), and data analysis (Pandas). Libraries can be imported using commands like 'import math as m' for selective usage with an alias or 'from math import *' to import the entire namespace . This modularity allows efficient code management and enhances productivity .

A developer may prefer using the alias method of importing libraries in Python because it maintains the namespace, making code readability and debugging easier by clearly identifying the source of functions . This practice, recommended by Google, enhances code maintainability and prevents conflicts between function names from different libraries. It also aids in reducing typos and provides a shorthand reference to frequently used libraries .

Dictionaries in Python are defined using curly braces {} and consist of unique key-value pairs . They are unordered, can be modified, and allow for quick lookups by keys, which makes them distinct from lists and tuples. This capability is particularly useful when you need to connect or map unequivocal associations between data sets .

Plots and visualizations transform raw statistical data into actionable insights by providing a visual representation of data trends, outliers, and patterns that are not easily discernible through raw data alone. For instance, histograms and box plots can reveal distribution and skewness, while bar charts can elucidate probabilities and frequencies in categorical data, thereby allowing deeper analysis and informed decision making .

Python Data Science Essentials - Sample Chapter
50% (4)
Python Data Science Essentials - Sample Chapter
36 pages
Python Data Analysis Tutorial for Beginners
100% (1)
Python Data Analysis Tutorial for Beginners
26 pages
Mastering Python For Data Science With Numpy & Pandas
100% (3)
Mastering Python For Data Science With Numpy & Pandas
136 pages
A Complete Tutorial To Learn Data Science With Python From Scratch
No ratings yet
A Complete Tutorial To Learn Data Science With Python From Scratch
68 pages
Top 5 Python Libraries for Data Science
100% (1)
Top 5 Python Libraries for Data Science
5 pages
Data Science Guide
No ratings yet
Data Science Guide
35 pages
65 Free Data Science Resources For Beginners PDF
No ratings yet
65 Free Data Science Resources For Beginners PDF
19 pages
Python Seaborn Notes
No ratings yet
Python Seaborn Notes
28 pages
Python for Data Analysis Basics
100% (5)
Python for Data Analysis Basics
37 pages
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy PDF
100% (1)
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy PDF
3 pages
Python Data Science Cookbook - Sample Chapter
100% (4)
Python Data Science Cookbook - Sample Chapter
48 pages
Python Data Science Guide
100% (2)
Python Data Science Guide
47 pages
Python Data Analysis & Visualization
No ratings yet
Python Data Analysis & Visualization
34 pages
Essential Data Science Resources for Beginners
100% (1)
Essential Data Science Resources for Beginners
2 pages
Pandas Data Analysis Handbook
No ratings yet
Pandas Data Analysis Handbook
55 pages
ML Cheatsheets
100% (2)
ML Cheatsheets
17 pages
Python Data Science
92% (12)
Python Data Science
65 pages
Github Data Science Projects
No ratings yet
Github Data Science Projects
16 pages
Pandas DataFrame Basics Cheatsheet
No ratings yet
Pandas DataFrame Basics Cheatsheet
3 pages
Python Programming Guide
No ratings yet
Python Programming Guide
211 pages
Coding Python
100% (9)
Coding Python
252 pages
Data Science With Python - Lesson 01 - Data Science Overview
100% (5)
Data Science With Python - Lesson 01 - Data Science Overview
35 pages
AI Publishing. Python Scikit-Learn For Beginners... For Data Scientist 2021
100% (9)
AI Publishing. Python Scikit-Learn For Beginners... For Data Scientist 2021
339 pages
Data Science Crash Course SharpSight
100% (6)
Data Science Crash Course SharpSight
107 pages
Python For Data Analytics
67% (3)
Python For Data Analytics
69 pages
Python Tutorial For Beginners: Learn Python Programming in 7 Days
No ratings yet
Python Tutorial For Beginners: Learn Python Programming in 7 Days
7 pages
Ultimate Step by Step Guide To Machine Learning Using Python Predictive
100% (3)
Ultimate Step by Step Guide To Machine Learning Using Python Predictive
56 pages
Introduction To Data Mining
100% (1)
Introduction To Data Mining
643 pages
MachineLearningNotes PDF
100% (1)
MachineLearningNotes PDF
299 pages
NumPy, SciPy, Pandas, Quandl Cheat Sheet
100% (3)
NumPy, SciPy, Pandas, Quandl Cheat Sheet
4 pages
Getting Started With Python Programming
100% (11)
Getting Started With Python Programming
1,484 pages
Key Python Libraries for Numerical Computing
100% (1)
Key Python Libraries for Numerical Computing
41 pages
100 Numpy Exercises Guide
No ratings yet
100 Numpy Exercises Guide
14 pages
Data Science Course with Python Overview
No ratings yet
Data Science Course with Python Overview
4 pages
Python Overview for Data Science
No ratings yet
Python Overview for Data Science
17 pages
Introduction To Data Science
75% (4)
Introduction To Data Science
74 pages
NumPy Basics for Data Science
No ratings yet
NumPy Basics for Data Science
1 page
01 Complete-Tutorial-Learn-Data-Science-Python-Scratch-2
No ratings yet
01 Complete-Tutorial-Learn-Data-Science-Python-Scratch-2
28 pages
Python For Data Science .
100% (5)
Python For Data Science .
112 pages
Analyticsvidhya Com
No ratings yet
Analyticsvidhya Com
38 pages
A Complete Beginner's Guide: Mastering Data Science With Python
No ratings yet
A Complete Beginner's Guide: Mastering Data Science With Python
6 pages
How To Learn Python For Data Science
100% (1)
How To Learn Python For Data Science
22 pages
Complete Roadmap To Learn Python
No ratings yet
Complete Roadmap To Learn Python
3 pages
Python Data Mastery Report
No ratings yet
Python Data Mastery Report
9 pages
Python For Data Science
No ratings yet
Python For Data Science
89 pages
Beginner's Guide to Data Science
No ratings yet
Beginner's Guide to Data Science
12 pages
python-data-science-the-ultimate-handbook-for-beginners-on-how-to-explore-numpy-for-numerical-data-pandas-for-data-analysis-ipython-scikit-learn-and-tensorflow-for-machine-learning-and-business-1081068000
No ratings yet
python-data-science-the-ultimate-handbook-for-beginners-on-how-to-explore-numpy-for-numerical-data-pandas-for-data-analysis-ipython-scikit-learn-and-tensorflow-for-machine-learning-and-business-1081068000
126 pages
Python's Role in Data Science Explained
No ratings yet
Python's Role in Data Science Explained
17 pages
Python
No ratings yet
Python
170 pages
T - Report Abhishek Choudary
No ratings yet
T - Report Abhishek Choudary
17 pages
Python Data Analysis for Beginners
No ratings yet
Python Data Analysis for Beginners
28 pages
Python DataScience Course Outline
No ratings yet
Python DataScience Course Outline
2 pages
Slidesgo Unlocking Insights A Professional Introduction To Data Science With Python 20241125160150D6YR
No ratings yet
Slidesgo Unlocking Insights A Professional Introduction To Data Science With Python 20241125160150D6YR
14 pages
Data Analysis With Python - FreeCodeCamp
No ratings yet
Data Analysis With Python - FreeCodeCamp
28 pages
Roshan SDP
No ratings yet
Roshan SDP
11 pages
Python For Data Science FNL
No ratings yet
Python For Data Science FNL
6 pages
SDP Report
No ratings yet
SDP Report
13 pages
Data Science With Career Program - Compressed - English - 1666121133
No ratings yet
Data Science With Career Program - Compressed - English - 1666121133
15 pages
Gradient Boosting for Load Forecasting
No ratings yet
Gradient Boosting for Load Forecasting
19 pages
Machine Learning To Predict San Francisco Crime - EFavDB PDF
No ratings yet
Machine Learning To Predict San Francisco Crime - EFavDB PDF
4 pages
Dong Ying PDF
No ratings yet
Dong Ying PDF
52 pages
Essentials of Machine Learning Algorithms (With Python and R Codes) PDF
100% (1)
Essentials of Machine Learning Algorithms (With Python and R Codes) PDF
20 pages
Data Science for Non-Programmers
No ratings yet
Data Science for Non-Programmers
5 pages
Python Data Munging Guide
No ratings yet
Python Data Munging Guide
7 pages
XGBoost Parameter Tuning Guide
No ratings yet
XGBoost Parameter Tuning Guide
20 pages
Understanding the Airbnb Kaggle Data
No ratings yet
Understanding the Airbnb Kaggle Data
9 pages
Bayesian Stats for Beginners
100% (1)
Bayesian Stats for Beginners
19 pages
GBM Parameter Tuning Guide Python
No ratings yet
GBM Parameter Tuning Guide Python
5 pages
Tree-Based Modeling Tutorial in R & Python
No ratings yet
Tree-Based Modeling Tutorial in R & Python
28 pages
A Complete Tutorial Which Teaches Data Exploration in Detail PDF
No ratings yet
A Complete Tutorial Which Teaches Data Exploration in Detail PDF
18 pages
Complete Guide To Create A Time Series Forecast (With Codes in Python) PDF
100% (4)
Complete Guide To Create A Time Series Forecast (With Codes in Python) PDF
18 pages
Python Data Visualization Guide
100% (1)
Python Data Visualization Guide
7 pages
12 Pandas Techniques for Data Manipulation
No ratings yet
12 Pandas Techniques for Data Manipulation
13 pages
Phonology PPT-1
67% (3)
Phonology PPT-1
17 pages
Addressing Modes Computer Organization Questions and Answers Sanfoundry3
No ratings yet
Addressing Modes Computer Organization Questions and Answers Sanfoundry3
5 pages
Arithmetic Sequences and Sums of Geometric Sequences Fall 2015
No ratings yet
Arithmetic Sequences and Sums of Geometric Sequences Fall 2015
2 pages
Generate
No ratings yet
Generate
2 pages
Senior Editing for Technology Content
No ratings yet
Senior Editing for Technology Content
12 pages
Lab 4
No ratings yet
Lab 4
9 pages
Guid For Open Top Container (Equipment Inspection) PDF
50% (2)
Guid For Open Top Container (Equipment Inspection) PDF
44 pages
Lecture 16 - The PN Junction Diode (2) - Handout
No ratings yet
Lecture 16 - The PN Junction Diode (2) - Handout
20 pages
Knowledge Representation and Expert System
No ratings yet
Knowledge Representation and Expert System
36 pages
8 More Raster Analysis Functions - QGIS
No ratings yet
8 More Raster Analysis Functions - QGIS
19 pages
41st International Congress Noise
No ratings yet
41st International Congress Noise
56 pages
RipPro 3 User Guide
0% (1)
RipPro 3 User Guide
51 pages
The Feasibility Study
No ratings yet
The Feasibility Study
5 pages
Portfolio
No ratings yet
Portfolio
31 pages
OMLVU13606 POD Unlocked
No ratings yet
OMLVU13606 POD Unlocked
124 pages
MCADI Method for 3x3 Matrix Inversion
No ratings yet
MCADI Method for 3x3 Matrix Inversion
2 pages
Bts432e2 20030926
0% (1)
Bts432e2 20030926
14 pages
Practical Reseach 1
No ratings yet
Practical Reseach 1
10 pages
Translation Theories
No ratings yet
Translation Theories
54 pages
ML (Unit-1)
No ratings yet
ML (Unit-1)
17 pages
Quiz Competition (Part3)
No ratings yet
Quiz Competition (Part3)
5 pages
Casting Defect Solutions Guide
No ratings yet
Casting Defect Solutions Guide
6 pages
Understanding Parallel Processing in Computing
No ratings yet
Understanding Parallel Processing in Computing
5 pages
Project Proposal
No ratings yet
Project Proposal
20 pages
DnD Changeling Guide for Players
No ratings yet
DnD Changeling Guide for Players
3 pages
ST1 Science-5 Q3
No ratings yet
ST1 Science-5 Q3
3 pages
Operating Light - TopLED - V201911 - LXS
No ratings yet
Operating Light - TopLED - V201911 - LXS
4 pages
Method of Speaking - 2
No ratings yet
Method of Speaking - 2
70 pages
Dissertation Writing Help Services
100% (2)
Dissertation Writing Help Services
7 pages
Suppliers Assessment Checklist
100% (2)
Suppliers Assessment Checklist
3 pages

A Complete Tutorial To Learn Data Science With Python From Scratch PDF

Uploaded by

A Complete Tutorial To Learn Data Science With Python From Scratch PDF

Uploaded by

3/6/2016

TuplesA tuple is represented by a number of values separated by commas.Tuples are immutable

Common questions

Describe the process of exploring and preparing a dataset for statistical modeling using Python, highlighting any key steps involved.

How do you initiate an iPython Notebook on your system and what advantages does it offer for code execution and documentation?

What statistical insights might be drawn from analyzing categorical variables in a dataset and how can Python facilitate this analysis?

How can data munging be effectively applied to clean and prepare data for analysis, and what challenges might it address?

What are the advantages of using tuples over lists in Python, and in what scenarios should tuples be preferred?

In what ways can web scraping with Python libraries like Scrapy enhance data acquisition for analysis?

What is the significance of using libraries in Python for complex computational tasks, and how can these be imported into your workspace?

Why might a developer prefer using the alias method of importing libraries in Python as recommended by Google, and what are its benefits?

How can dictionaries in Python be defined and manipulated, and what are their unique features compared to other data structures?

How does the use of plots and visualizations in data analysis transform raw statistical data into actionable insights?

You might also like