Fast and Scalable Python

Fast and Scalable Python
Travis E. Oliphant, PhD @teoliphant
Python Enthusiast since 1997
Making NumPy- and Pandas-style code faster
and run in parallel.

2
A few libraries: Python for Data Science
Machine Learning
Big DataVisualization
BI / ETL Scientific computing
CS / Programming
Numba
Blaze
Bokeh
Dask

Zen of NumPy / Pandas
3
• Strided is better than scattered
• Contiguous is better than strided
• Descriptive is better than imperative (use data-types)
• Array-oriented and data-oriented is often better than object-oriented
• Broadcasting is a great idea – use where possible
• Split-apply-combine is a great idea – use where possible
• Vectorized is better than an explicit loop
• Write more ufuncs and generalized ufuncs (numba can help)
• Unless it’s complicated — then use numba
• Think in higher dimensions

At the heart is array-oriented
• Array-oriented, column-oriented, and data-oriented programming
means using data-structures that are optimized for the computing
needed.
• Object oriented typically scatters objects throughout system
memory, or separates alike elements in system memory
• Array-Oriented organizes data together where it can stream through
modern multi-core hardware making better use of cache.
• Array-Oriented code can also be parallelized more easily — making
it possible to automatically parallelize code written in this style.
4

5
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Memory using object-oriented Memory using array-oriented
Attr1 Attr2 Attr3
Object1
Object2
Object3
Object4
Object5
Object6

6
Array-oriented maps to modern chips and is the key to scale
Knights Landing
Up to 72 cores
16GB on package
With PyData projects
we can take full advantage of
this target as well as GPUs
and many machines together.

Scale Up vs Scale Out
7
Big Memory &
Many Cores
/ GPU Box
Best of Both
(e.g. GPU Cluster)
Many commodity
nodes in a cluster
ScaleUp
(BiggerNodes)
Scale Out
(More Nodes)

Scale Up vs Scale Out
8
Big Memory &
Many Cores
/ GPU Box
Best of Both
(e.g. GPU Cluster)
Many commodity
nodes in a cluster
ScaleUp
(BiggerNodes)
Scale Out
(More Nodes)
Numba
Dask
Blaze

9
I spent the first 15 years of my “Python life”
working to build the foundation of the Python
Scientific Ecosystem as a practitioner/developer
PEP 357
PEP 3118
SciPy

10
I plan to spend the next 15 years ensuring this
ecosystem can scale both up and out as an
entrepreneur, evangelist, and director of
innovation. CONDA
NUMBA
DASK
DYND
BLAZE
DATA-FABRIC

11
Rally a community
“Apache” for Open
Data Science
Community-led and directed Conference Series
with sponsorship proceeds going directly to
NumFocus

12
Organize a company
Empower people to solve the world’s
greatest challenges.
We help people discover, analyze, and collaborate by
connecting their curiosity and experience with any data.
Purpose
Mission

13
Start working on key technology
Blaze — blaze, odo, datashape, dynd, dask
Bokeh — interactive visualization in web including Jupyter
Numba — Array-oriented Python subset compiler
Conda — Cross-platform package manager
with environments
Bootstrap and seed funding through 2014
VC funded in 2015

14
is….
the Open Data Science Platform
powered by Python…
the fastest growing open data science language
• Accelerate Time-to-Value
• Connect Data, Analytics & Compute
• Empower Data Science Teams
Create a Platform!
Free and Open Source!

15
Governance
Provenance
Security
OPERATIONS
Python
R
Spark | Scala
JavaScript
Java
Fortran
C/C++
DATA SCIENCE
LANGUAGES
DATA
Flat Files (CSV, XLS…)) SQL DB NoSQL Hadoop Streaming
HARDWARE
Workstation Server Cluster
APPLICATIONS
Interactive Presentations,
Notebooks, Reports & Apps
Solution
Templates
Visual Data Exploration
Advanced Spreadsheets
APIs
ANALYTICS
Data Exploration
Querying Data Prep
Data Connectors
Visual Programming
Notebooks
Analytics Development
Stats
Data
Mining
Deep Learning
Machine
Learning
Simulation &
Optimization
Geospatial
Text & NLP
Graph & Network
Image Analysis
Advanced Analytics
IDEsCICD
Package, Dependency,
Environment Management
Web & Desktop App
Dev
SOFTWARE DEVELOPMENT
HIGH PERFORMANCE
Distributed Computing
Parallelism & 
Multi-threading
Compiled Assets
GPUs & Multi-core
Compiler
Business
Analyst
Data
Scientist
Developer
Data
Engineer
DevOps
DATA SCIENCE
TEAM
Cloud On-premises

Free Anaconda now with MKL as default
16
•Intel MKL (Math Kernel Libraries) provide enhanced
algorithms for basic math functions.
•Using MKL provides optimal performance for basic BLAS,
LAPACK, FFT, and math functions.
•Anaconda since version 2.5 has MKL provided as the default
in the free download of Anaconda (you can also distribute
binaries linked against these MKL-enhanced tools).

Numba + Dask
Look at all of the data with Bokeh’s datashader.
Decouple the data-processing from the visualization.
Visualize arbitrarily large data.
17
• E.g. Open Street Map data:
• About 3 billion GPS coordinates
• https://blog.openstreetmap.org/
2012/04/01/bulk-gps-point-data/.
• This image was rendered in one
minute on a standard MacBook
with 16 GB RAM
• Renders in a milliseconds on
several 128GB Amazon EC2
instances

Categorical data: 2010 US Census
18
• One point per
person
• 300 million total
• Categorized by
race
• Interactive
rendering with
Numba+Dask
• No pre-tiling

YARN
JVM
Bottom Line 
5-100X faster overall performance
• Interact with data in HDFS and
Amazon S3 natively from Python
• Distributed computations without the
JVM & Python/Java serialization
• Framework for easy, flexible
parallelism using directed acyclic
graphs (DAGs)
• Interactive, distributed computing
with in-memory persistence/caching
Bottom Line
• Leverage Python &
R with Spark
Batch
Processing
Interactive
Processing
HDFS
Ibis
Impala
PySpark & SparkR
Python & R
ecosystem
MPI
High Performance,
Interactive,
Batch
Processing
Native
read & write
NumPy, Pandas, …
720+ packages
19

Overview of Dask as a Parallel
Processing Framework with Distributed

Precursors to Parallelism
21
• Consider the following approaches first:
1. Use better algorithms
2. Try Numba or C/Cython
3. Store data in efficient formats
4. Subsample your data
• If you have to parallelize:
1. Start with your laptop (4 cores, 16 GB RAM, 1 TB disk)
2. Then a large workstation (24 cores, 1 TB RAM)
3. Finally, scale out to a cluster

22
Moving from small data to big data
Client Machine Compute
Node
Compute
Node
Compute
Node
Head Node
Node
Compute
Node
Compute
Node
Head Node
Big DataSmall Data
Dask
Numba

23
Dask Dataframes
Dask
>>> import pandas as pd
>>> df = pd.read_csv('iris.csv')
>>> df.head()
sepal_length sepal_width petal_length petal_width
species
0 5.1 3.5 1.4 0.2
Iris-setosa
1 4.9 3.0 1.4 0.2
Iris-setosa
2 4.7 3.2 1.3 0.2
Iris-setosa
3 4.6 3.1 1.5 0.2
Iris-setosa
4 5.0 3.6 1.4 0.2
Iris-setosa
>>> max_sepal_length_setosa = df[df.species ==
'setosa'].sepal_length.max()
5.7999999999999998
>>> import dask.dataframe as dd
>>> ddf = dd.read_csv('*.csv')
>>> ddf.head()
sepal_length sepal_width petal_length petal_width
species
0 5.1 3.5 1.4 0.2
Iris-setosa
1 4.9 3.0 1.4 0.2
Iris-setosa
2 4.7 3.2 1.3 0.2
Iris-setosa
3 4.6 3.1 1.5 0.2
Iris-setosa
4 5.0 3.6 1.4 0.2
Iris-setosa
…
>>> d_max_sepal_length_setosa = ddf[ddf.species ==
'setosa'].sepal_length.max()
>>> d_max_sepal_length_setosa.compute()
5.7999999999999998

24
Dask Arrays
>>> import numpy as np
>>> np_ones = np.ones((5000, 1000))
>>> np_ones
array([[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
...,
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.]])
>>> np_y = np.log(np_ones + 1)[:5].sum(axis=1)
>>> np_y
array([ 693.14718056, 693.14718056, 693.14718056,
693.14718056, 693.14718056])
>>> import dask.array as da
>>> da_ones = da.ones((5000000, 1000000),
chunks=(1000, 1000))
>>> da_ones.compute()
array([[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
...,
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.]])
>>> da_y = da.log(da_ones + 1)[:5].sum(axis=1)
>>> np_da_y = np.array(da_y) #fits in memory
array([ 693.14718056, 693.14718056, 693.14718056,
693.14718056, …, 693.14718056])
# Result doesn’t fit in memory
>>> da_y.to_hdf5('myfile.hdf5', 'result')
Dask

Overview of Dask
25
Dask is a Python parallel computing library that is:
• Familiar: Implements parallel NumPy and Pandas objects
• Fast: Optimized for demanding for numerical applications
• Flexible: for sophisticated and messy algorithms
• Scales out: Runs resiliently on clusters of 100s of machines
• Scales down: Pragmatic in a single process on a laptop
• Interactive: Responsive and fast for interactive data science
Dask complements the rest of Anaconda. It was developed with 
NumPy, Pandas, and scikit-learn developers.

Spectrum of Parallelization
26
Threads
Processes
MPI
ZeroMQ
Dask
Hadoop
Spark
SQL:
Hive
Pig
Impala
Implicit control: Restrictive but easyExplicit control: Fast but hard

Dask: From User Interaction to Execution
27

Dask Collections: Familiar Expressions and API
28
x.T - x.mean(axis=0)
df.groupby(df.index).value.mean()
def load(filename):
def clean(data):
def analyze(result):
Dask array (mimics NumPy)
Dask dataframe (mimics Pandas) Dask imperative (wraps custom code)
b.map(json.loads).foldby(...)
Dask bag (collection of data)

Dask Graphs: Example Machine Learning Pipeline
29

Dask Graphs: Example Machine Learning Pipeline + Grid Search
30

• simple: easy to use API
• flexible: perform a lots of action with a minimal amount of code
• fast: dispatching to run-time engines & cython
• database like: familiar ops
• interop: integration with the PyData Stack
31
(((A + 1) * 2) ** 3)

32
(B - B.mean(axis=0))  
+ (B.T / B.std())

Scheduler
Worker
Worker
Worker
Worker
Client
Same network
User Machine (laptop)Client
Worker
Dask Schedulers: Example - Distributed Scheduler
33

Cluster Architecture Diagram
34
Node
Compute
Node
Compute
Node
Head Node

• Single machine with multiple threads or processes
• On a cluster with SSH (dcluster)
• Resource management: YARN (knit), SGE, Slurm
• On the cloud with Amazon EC2 (dec2)
• On a cluster with Anaconda for cluster management
• Manage multiple conda environments and packages  
on bare-metal or cloud-based clusters
Using Anaconda and Dask on your Cluster
35

Scheduler Visualization with Bokeh
36

Examples
37
Analyzing
NYC Taxi
CSV data using
distributed Dask
DataFrames
• Demonstrate
Pandas at scale
• Observe responsive
user interface
Distributed
language
processing with
text data using
Dask Bags
• Explore data using
a distributed
memory cluster
• Interactively query
data using libraries
from Anaconda
Analyzing global
temperature
data using
Dask Arrays
• Visualize complex
algorithms
• Learn about dask
collections and
tasks
Handle custom
code and
workflows using
Dask Imperative
• Deal with messy
situations
• Learn about
scheduling
1 2 3 4

Example 1: Using Dask DataFrames on a cluster with CSV data
38
• Built from Pandas DataFrames
• Match Pandas interface
• Access data from HDFS, S3, local, etc.
• Fast, low latency
• Responsive user interface
January, 2016
Febrary, 2016
March, 2016
April, 2016
May, 2016
Pandas
DataFrame}
Dask
DataFrame
}

Example 2: Using Dask Bags on a cluster with text data
39
• Distributed natural language processing
with text data stored in HDFS
• Handles standard computations
• Looks like other parallel frameworks
(Spark, Hive, etc.)
• Access data from HDFS, S3, local, etc.
• Handles the common case
...
(...)
data
...
(...)
data
function
...
...
(...)
data
function
...
result
merge
... ...
data
function
(...)
...
function

NumPy
Array
}
}Dask
Array
Example 3: Using Dask Arrays with global temperature data
40
• Built from NumPy 
n-dimensional arrays
• Matches NumPy interface
(subset)
• Solve medium-large
problems
• Complex algorithms

Example 4: Using Dask Delayed to handle custom workflows
41
• Manually handle functions to support messy situations
• Life saver when collections aren't flexible enough
• Combine futures with collections for best of both worlds
• Scheduler provides resilient and elastic execution

What happened to Blaze?
Still going strong — just at another place in the stack
(and sometimes a description of an ecosystem).

43
Expressions
Metadata
Runtime

44
+ - / * ^ []
join, groupby, filter
map, sort, take
where, topk
datashape,dtype,
shape,stride
hdf5,json,csv,xls
protobuf,avro,...
NumPy,Pandas,R,
Julia,K,SQL,Spark,
Mongo,Cassandra,...

APIs, syntax, language
45
Data Runtime
Expressions
metadata
storage/containers
compute
datashape
blaze
dask
odo
parallelize optimize, JIT
numba

Blaze
46
Interface to query data on different storage systems http://blaze.pydata.org/en/latest/
from blaze import Data
iris = Data('iris.csv')
iris = Data('sqlite:///flowers.db::iris')
iris = Data('mongodb://localhost/mydb::iris')
iris = Data('iris.json')
CSV
SQL
MongoDB
JSON
iris = Data('s3://blaze-data/iris.csv')S3
…
Current focus is SQL and the pydata stack for run-time (dask, dynd, numpy, pandas, x-
ray, etc.) + customer-driven needs (i.e. kdb, mongo, PostgreSQL).

Blaze
47
iris[['sepal_length', 'species']]Select columns
log(iris.sepal_length * 10)Operate
Reduce iris.sepal_length.mean()
Split-apply
-combine
by(iris.species, shortest=iris.petal_length.min(),
longest=iris.petal_length.max(),
average=iris.petal_length.mean())
Add new
columns
transform(iris, sepal_ratio = iris.sepal_length /
iris.sepal_width, petal_ratio = iris.petal_length /
iris.petal_width)
Text matching iris.like(species='*versicolor')
iris.relabel(petal_length='PETAL-LENGTH',
petal_width='PETAL-WIDTH')
Relabel columns
Filter iris[(iris.species == 'Iris-setosa') & (iris.sepal_length > 5.0)]

48
datashapeblaze
Blaze (like DyND) uses datashape as its type system
>>> iris = Data('iris.json')
>>> iris.dshape
dshape("""var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}""")

Datashape
49
A structured data description language for all kinds of data
http://datashape.pydata.org/
dimension dtype
unit types
var
3 string
int32
4 float64
*
*
*
*
var * { x : int32, y : string, z : float64 }
datashape
tabular datashape
record
ordered struct dtype
{ x : int32, y : string, z : float64 }
collection of types keyed by labels
4

Datashape
50
{
flowersdb: {
iris: var * {
species: string
}
},
iriscsv: var * {
sepal_length: ?float64,
sepal_width: ?float64,
petal_length: ?float64,
petal_width: ?float64,
species: ?string
},
irisjson: var * {
species: string
},
irismongo: 150 * {
species: string
}
}
# Arrays of Structures
100 * {
name: string,
birthday: date,
address: {
street: string,
city: string,
postalcode: string,
country: string
}
}
# Structure of Arrays
{
x: 100 * 100 * float32,
y: 100 * 100 * float32,
u: 100 * 100 * float32,
v: 100 * 100 * float32,
}
# Function prototype
(3 * int32, float64) -> 3 * float64
# Function prototype with broadcasting dimensions
(A... * int32, A... * int32) -> A... * int32
# Arrays
3 * 4 * int32
3 * 4 * int32
10 * var * float64
3 * complex[float64]

iriscsv:
source: iris.csv
irisdb:
source: sqlite:///flowers.db::iris
irisjson:
source: iris.json
dshape: "var * {name: string, amount: float64}"
irismongo:
source: mongodb://localhost/mydb::iris
Blaze Server — Lights up your Dark Data
51
Builds off of Blaze uniform interface to
host data remotely through a JSON web
API.
$ blaze-server server.yaml -e
localhost:6363/compute.json
server.yaml

Blaze Client
52
>>> from blaze import Data
>>> s = Data('blaze://localhost:6363')
>>> t.fields
[u'iriscsv', u'irisdb', u'irisjson', u’irismongo']
>>> t.iriscsv
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
>>> t.irisdb
petal_length petal_width sepal_length sepal_width species
0 1.4 0.2 5.1 3.5 Iris-setosa
1 1.4 0.2 4.9 3.0 Iris-setosa
2 1.3 0.2 4.7 3.2 Iris-setosa

Compute recipes work with existing libraries and have multiple
backends — write once and run anywhere. Layer over existing data!
• Python list
• Numpy arrays
• Dynd
• Pandas DataFrame
• Spark, Impala, Ibis
• Mongo
• Dask
53
• Hint: Use odo to copy from
one format and/or engine to
another!

Classic Example
55
from numba import jit
@jit
def mandel(x, y, max_iters):
c = complex(x,y)
z = 0j
for i in range(max_iters):
z = z*z + c
if z.real * z.real + z.imag * z.imag >= 4:
return 255 * i // max_iters
return 255
Mandelbrot

The Basics
56
CPython 1x
Numpy array-wide operations 13x
Numba (CPU) 120x
Numba (NVidia Tesla K20c) 2100x
Mandelbrot

How Numba works
57
Bytecode
Analysis
Python
Function
Function
Arguments
Type
Inference
Numba IR
LLVM IR
Machine
Code
@jit
def do_math(a,b):
…
>>> do_math(x, y)
Cache
Execute!
Rewrite IR
Lowering
LLVM JIT

Numba Features
Does not replace the standard Python interpreter 
(all of your existing Python libraries are still available)
58

How to Use Numba
1. Create a realistic benchmark test case. 
(Do not use your unit tests as a benchmark!)
2. Run a profiler on your benchmark. 
(cProfile is a good choice)
3. Identify hotspots that could potentially be compiled by Numba with a little
refactoring. 
(see online documentation)
4. Apply @numba.jit, @numba.vectorize, and @numba.guvectorize as needed to
critical functions. (Small rewrites may be needed to work around Numba
limitations.)
5. Re-run benchmark to check if there was a performance improvement.
6. Use target=parallel to get access to multiple cores (or target=gpu if you have one)
59

Example: Filter an array
61
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the array
Numba decorator 
(nopython=True not required)
2.7x Speedup
over NumPy!

NumPy UFuncs and GUFuncs
62
NumPy ufuncs (and gufuncs) are functions that operate “element-
wise” (or “sub-dimension-wise”) across an array without an explicit loop.
This implicit loop (which is in machine code) is at the core of why NumPy
is fast. Dispatch is done internally to a particular code-segment based
on the type of the array. It is a very powerful abstraction in the PyData
stack.
Making new fast ufuncs used to be only possible in C — painful!
With nb.vectorize and numba.guvectorize it is now easy!
The inner secrets of NumPy are now at your finger-tips for you to make
your own magic!

Simple Ufunc
63
@vectorize
def dot2(a,b,x,y):
return a*x + b*y
>>> a, b, x, y = np.random.randn(4,1000)
>>> z = a * x + b * y
>>> z2 = dot2(a, b, x, y) # faster
Faster (especially) as N grows because it does not create temporaries.
NumPy creates temporary arrays for intermediate results.
Numba creates a fast machine-code kernel from the Python template
and calls it for every element in the arrays.

Generalized Ufunc
64
@guvectorize(‘f8[:], f8[:], f8[:]’, ‘(n),(n)->()’)
def dot2(a,b,c):
c[0]=a[0]*b[0] + a[1]*b[1]
>>> a, b = np.random.randn(10000,2), np.random.randn(10000,2)
>>> z1 = np.einsum(‘ij,ij->i’, a, b)
>>> z2 = dot2(a, b) # uses last dimension as in each kernel
This can create quite a bit of computation with very little code.
Numba creates a fast machine-code kernel from the Python template
and calls it for every element in the arrays.
3.8x faster

Example: Making a windowed compute filter
65
Perform a computation
on a finite window of the
input.
For a linear system, this is a
FIR filter and what np.convolve
or sp.signal.lfilter can do.
But what if you want some
arbitrary computation like a
windowed median filter.

66
Hand-coded implementation
Build a ufunc for the kernel
which is faster for large arrays!
This can now run easily on GPU
with ‘target=cuda’ and many-
cores ‘target=parallel’
Array-oriented!

67

Other interesting things
• CUDA Simulator to debug your code in Python interpreter
• Generalized ufuncs (@guvectorize) including GPU support and multi-
core (threaded) support
• Call ctypes and cffi functions directly and pass them as arguments
• Support for types that understand the buffer protocol
• Pickle Numba functions to run on remote execution engines
• “numba annotate” to dump HTML annotated version of compiled
code
• See: http://numba.pydata.org/numba-doc
68

Recently Added Numba Features
• Support for named tuples in nopython mode
• Support for sets in nopython mode
• Limited support for lists in nopython mode
• On-disk caching of compiled functions (opt-in)
• JIT classes (zero-cost abstraction)
• Support of np.dot (and ‘@‘ operator on Python 3.5)
• Support for some of np.linalg
• generated_jit (jit the functions that are the return values of the decorated function —
meta-programming)
• SmartArrays which can exist on host and GPU (transparent data access).
• Ahead of Time Compilation (you can ship code pre-compiled with Numba)
• Disk-caching of pre-compiled code so you don’t compile next time you run on
machine with access to disk.
• @cfunc decorator which makes machine-code call-backs for things like
scipy.integrate and other extension modules which expect call-backs.
69

Data Fabric (pre-alpha, demo-ware)
• New initiative just starting (if you can fund it, please tell me).
• Generalization of Python buffer-protocol to all languages
• A library to build a catalog of shared-memory chunks across the
cluster holding data that can be described by data-shape
• A helper type library in DyND that any language can use to
select appropriate analytics for
• Foundation for a cross-language “super-set” of PyData stack
70
https://github.com/blaze/datafabric

Fast and Scalable Python

Recommended

More Related Content

What's hot (17)

Viewers also liked (6)

Similar to Fast and Scalable Python (20)

More from Travis Oliphant (11)

Recently uploaded (20)

Fast and Scalable Python