Machine Learning with
Python
The Complete Course
TELCOMA
Copyright © TELCOMA. All Rights Reserved
Module 2
Data Scientist’s Toolbox
Copyright © TELCOMA. All Rights Reserved
Content:
1. Python - Quick recap
2. Python 2.7.x or 3.x ?
3. Installation and setup
4. Data types, functions and important packages
5. Data manipulation & Data Engineering
6. Data Visualization
Copyright © TELCOMA. All Rights Reserved
Quick recap – Python got a lot of traction with the launch of
Python Pandas, Scipy and Scikit Learn
highly popular, open-source and general purpose
programming language
extensive production support
Preferred choice for Deep Learning
will soon overtake R as the preferred language for a data
scientist
Popular packages
Pandas, numpy, matplotlib, scipy, statsmodel, scikit-learn
Copyright © TELCOMA. All Rights Reserved
Python 2.7.x or 3.x?
Python 2.x is legacy, Python 3.x is the present and future of the language
Few Differences
Python 2.7.x Python 3.x
• Has more libraries • Cutting edge – all new features will
• Extensive 3rd party be added to 3.x
module support • Limited 3rd Party module support
• No new major releases • Is under active development
Copyright © TELCOMA. All Rights Reserved
Installation & setup Windows Installation
• Download the installer (32 or 64 bit)
• Install the .exe file install and follow the
Recommended distribution installation wizard
– Anaconda Python 3.5.x
OSX Installation
Graphical Installer
• Download the graphical installer
• Double-click the downloaded .pkg file and
What platforms are supported? follow the installation wizard
- Windows, Linux and Mac
(32 bit and 64 bit versions) Command Line Installer
• Download the command-line installer
• In your terminal window type one of the below
and follow the instructions: bash <Anaconda2-
x.x.x-MacOSX-x86_64.sh>
What tools do we use? Linux Installation
- IDE : Spyder/ Jupyter Notebooks
• Download the installer (32 or 64 bit)
• In your terminal window type one of the below
and follow the instructions: bash Anaconda2-
x.x.x-Linux-x86_xx.sh
https://www.anaconda.com/download/
https://repo.continuum.io/archive/
Copyright © TELCOMA. All Rights Reserved
Data types List vs Tuple vs Set vs Dictionary?
List: Use when you need an ordered
sequence of homogenous/heterogenous
Basic Data types collections, whose values can be changed
later in the program
Boolean True, False
Integer -1,0,1 (32 bits of precision) Tuple: Use when you need an ordered
Long 1234 (unlimited precision) sequence of heterogeneous collections
Float 3.21456, 6.3 whose values need not be changed later in
Complex 2+9j (numbers with real and imaginary part) the program
String ‘This is a string’
Dictionary {‘A’ : ’item1’, ‘B’ : ‘item2’} Set: ideal when we don’t have to store
File f=open(‘path/filename’,’rb’) duplicates and you are not concerned
List [1,2,4,5] about the order or the items.
Set set(1,’ML’,2)
Tuple [1,2,3,5] Dictionary: ideal whenever we need to
reference values with keys
Copyright © TELCOMA. All Rights Reserved
Demo
Copyright © TELCOMA. All Rights Reserved
Data types continued..
A sample numpy array with 10 rows and 3 columns
Numpy 0 1 2
NumPy’s main object is the homogeneous 0 12 13 35
multidimensional array. It has an associated fast 1 14 16 56
math functions that operate on it. It also provides 2 15 19 77
simple routines for linear algebra and fft and 3 17 22 98
sophisticated random-number generation. 4 18 25 119
5 20 28 140
E.g.. 6 21 31 161
import numpy as np 7 23 34 182
a = np.arange(15).reshape(3, 5) 8 24 37 203
a 9 26 40 224
10 27 43 245
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
Copyright © TELCOMA. All Rights Reserved
A sample pandas DataFrame with 8 rows and 3 columns
Data types continued.. Index
1
Column1 Column2 Column3
100 Andy 1/1/1990
2 120 Jake 5/6/1985
3 140 Bill 17/05/2000
Pandas 4 160 Smith 9/8/1980
5 180 Jane 1/12/1976
Series 6 200 Melvin 17/05/2001
1-dimensional labelled array capable of holding 7 220 Roger 5/17/1971
any data type. 8 240 Rahul 9/19/1966
DataFrame A sample pandas Series with 8 rows
A 2-dimensional labelled data structure with Index
columns of potentially different types. 1 45
Analogous to spreadsheet or SQL table. 2 46
3 47
Note 4 48
The Series is the data structure for a single column 5 49
of a DataFrame. The data in a DataFrame is 6 50
actually stored in memory as a collection of Series. 7 51
8 52
Download data from - https://www.kaggle.com/ludobenistant/hr-analytics/data
Copyright © TELCOMA. All Rights Reserved
Sample
Functions
Function
A function is a block of organized, reusable code
that is used to perform a single, related action.
Defining a function
Eg.
def functionname( parameters ):
"function_docstring"
function_suite
return [expression]
Copyright © TELCOMA. All Rights Reserved
Demo
Copyright © TELCOMA. All Rights Reserved
Data Manipulation
and Engineering
Most elementary data manipulation exercises
- Reading CSV data from local system
- Exploring length and breadth of data
head/tail/shape/columns
- CRUD
- CR : add new columns/ create new dataframes
- U : update columns/filter
- D : delete columns
Data Engineering : Transform data
String manipulation
Data rollup (groupby)
Merge, Join, Pivot, concat
Copyright © TELCOMA. All Rights Reserved
Data Visualization
Scatter plot Bar charts Line charts Histogram
Copyright © TELCOMA. All Rights Reserved
Demo
Copyright © TELCOMA. All Rights Reserved
Next Module :
Exploratory Data Analysis,
Feature Engineering &
Hypothesis Testing
Copyright © TELCOMA. All Rights Reserved