End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017

DONALD MINER
End-to-end Big Data
Projects with Python
DONALD MINER
StampedeCon Big Data
July 25th, 2017

dminer@minerkasch.com
Donald Miner

Real “Big Data”
isn’t just about
platforms anymore

Real “Big Data”
isn’t just about
platforms anymore
Streaming
Infrastructure
Cloud
Applications
Mobile
{

Real “Big Data”
isn’t just about
data processing
anymore

Real “Big Data”
isn’t just about
data processing
anymore
Machine Learning
Data Science
NLP
Deep Learning
Visualization
{

2017
BIG DATA =
& integrated and user facing
& advanced analytics

Big Data in 2009 was so Java oriented.
It was easier to use Java for everything or
use a collection of random languages.

Python seemed to have everything we
wanted, except for Big Data
Some brave souls tried:
Hadoopy, mrjob, Pig+Python

PySpark
PySpark was the missing piece of the Big Data Python picture
The first major Big Data platform with first-class Python support
Thanks to PySpark, Python is now a viable and competitive option
for end-to-end systems that utilize Big Data

What’s the big deal?
Python has best in class functionality for all the other things we want
to do with Big Data:
Data manipulation, Machine Learning, Text, Applications, Visualization
In 2017,
we can build end-to-end Big Data systems entirely in Python:
from ingest to user experience and everything between

The case for Python
Succinct code that’s easy to read

The case for Python
A language people know

The case for Python
Interpreted, not compiled

The Python Big Data Architecture

Distributed Computing
# Read data as lines from a source
lines = spark.read.text(inpath).rdd.map(lambda r: r[0])
# Count the data
counts = lines.flatMap(lambda x: x.split(' '))
.map(lambda x: (x, 1))
.reduceByKey(add)
# Bring it locally
output = counts.collect()

Machine Learning
# Initialize Random Forest classifier
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=250)
# Train Random Forest classifier
clf = clf.fit(feature_vectors, labels)
Why sklearn over MLLib?

Deep Learning
# Load the ImageNet pretrained network
model = VGG16(weights="imagenet")
# Run the model on an image
preds = model.predict(preprocess_input(image))
# Hotdog or not hotdog?
print(decode_predictions(preds))
& Keras
Also excited about pytorch!

Visualization
Lots of visualization options in Python
• Seaborn
• Matplotlib
• Bokeh
• ggplot
seaborn.swarmplot(x="measurement", y="value", hue="species", data=iris)

Integration
# http://56.120.177.55/hello?name=Don
@app.route('/hello')
def say_hello():
name = request.args.get(‘name’)
return json.dumps({ ‘query’ : name,
‘message’: ‘HELLO ‘ + name })
# returns { ‘query’ : ‘Don’, ‘message’ : ‘HELLO Don’ }

Workflows
# Run this every day at 3:45AM
mdag = DAG(’DRSpark', description=’DailyRun', schedule_interval=’45 3 * * *')
sp1 = PythonOperator(task_id=‘sp1’, python_callable=runspark1, dag=mdag)
sp2 = PythonOperator(task_id=’sp2’, python_callable=runspark2, dag=mdag)
ou = PythonOperator(task_id=‘clean’, python_callable=cleanupresults, dag=mdag)
sp1 >> ou # sp1 happens before ou
sp2 >> ou # sp2 happens before ou, but doesn’t depend on sp1

# Spark job to build feature vectors
rows = myrdd.map(lambda r: r[0].split(‘,’))
out = rows.map(lambda row: (row[0], row))
.groupByKey().map(build_feature_vector) # outputs [(FV, label)]
# Bring data down locally and prepare it
localout = counts.collect()
X = [ row[0] for row in localout ] # feature is set of 40 aggregate properties
t = [ row[1] for row in localout ] # potential labels are types of devices
# Train a RF classifier on it
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=250)
clf = clf.fit(X, y)
# save the model (maybe to s3 instead?)
pickle.dump(clf, open(‘/models/behavior.sklearn’, ‘w’))
--- training data sample of netflow ---
SOURCE IP DEST IP DATE STIME ETIME DATAIN DATAOUT
123.41.12.31, 123.41.155.32, 2017-02-01, 09:00, 09:59, 103KB, 959KB
123.41.59.99, 123.41.155.32, 2017-02-01, 09:00, 09:59, 44KB, 884KB
123.41.12.31, 123.41.155.32, 2017-02-01, 10:00, 10:59, 3KB, 9KB
123.41.59.99, 123.41.155.32, 2017-02-01, 10:00, 10:59, 4KB, 15KB

# http://56.120.177.55/predictip?ip=159.31.120.44
# http://56.120.177.55/predictfv -- for POST of feature vector
_MODEL = pickle.load(open(’/models/maintenance.sklearn’))
@app.route('/predictrepair')
def predicttypefrombehavior():
netflowlog = request.form[‘logcsv’]
fv = build_feature_vetor(netflowlog)
pr = _MODEL.predict(fv)
return json.dumps({ ‘query’ : fv,
‘prediction’ : pr })
# returns { ‘query’ : [9, 4, 123.1, …], ‘prediction’ : ‘HTTP PROXY’ }
--- training data sample of netflow ---
SOURCE IP DEST IP DATE STIME ETIME DATAIN DATAOUT
123.41.12.31, 123.41.155.32, 2017-02-01, 09:00, 09:59, 103KB, 959KB
123.41.59.99, 123.41.155.32, 2017-02-01, 09:00, 09:59, 44KB, 884KB
123.41.12.31, 123.41.155.32, 2017-02-01, 10:00, 10:59, 3KB, 9KB
123.41.59.99, 123.41.155.32, 2017-02-01, 10:00, 10:59, 4KB, 15KB

PYTHON!
Viable option for Big Data Analytics with PySpark
Tie it all together and integrate into the enterprise with the
same language
Leverage the benefits of Python for data analysis
Get projects done faster

End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017

Recommended

More Related Content

What's hot (20)

Similar to End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017 (20)

More from StampedeCon (20)

Recently uploaded (20)

End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017