Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Rapid Productionalization of Predictive Models
In-database Modeling with Revolution Analytics on Teradata
Skylar Lyon
Accenture Analytics

Introduction
Skylar Lyon
Accenture Analytics
• 7 years of experience with focus on big data
and predictive analytics - using discrete choice
modeling, random forest classification,
ensemble modeling, and clustering
• Technology experience includes: Hadoop,
Accumulo, PostgreSQL, qGIS, JBoss, Tomcat,
R, GeoMesa, and more
• Worked from Army installations across the
nation and also had the opportunity to travel
twice to Baghdad to deploy solutions
downrange.
Copyright © 2014 Accenture. All rights reserved. 2

How we got here
Project background and my involvement
• New Customer Analytics team for Silicon Valley Internet eCommerce
giant
• Data scientists developing predictive models
• Deferred focus on productionalization
• Joined as Big Data Infrastructure and Analytics Lead

Colleague‘s CRAN R model
Binomial logistic regression
• 50+ Independent variables including categorical with indicator
variables
• Train from small sample (many thousands) – not a problem in and of
itself
• Scoring across entire corpus (many hundred millions) – slightly more
challenging

We optimized the current productionalization process
We moved compute to data
Before After
Reduced 5+ hour process to 40 seconds

Benchmarking our optimized process
5+ hours to 40 seconds: Recommendation is that this now become
the defacto productionalization process
rows
minutes

Optimization process
Recode CRAN R to Rx R
Before
trainit <- glm(as.formula(specs[[i]]), data = training.data,
family='binomial', maxit=iters)
fits <- predict(trainit, newdata=test.data, type='response')
After
trainit <- rxGlm(as.formula(specs[[i]]), data = training.data,
family='binomial', maxIterations=iters)
fits <- rxPredict(trainit, newdata=test.data, type='response')

Additional benefits to new process
Technology is increasing data science team’s options and
opportunities
• Train in-database on much larger set – reduces need to sample
• Nearly “native” R language – decrease deploy time
• Hadoop support – score in multiple data warehouses

Appendix
Table of Contents
• Technical Considerations

Technical considerations
Environment setup
• Teradata environment – 4 node, 1700 series appliance server
• Revolution R Enterprise – version 7.1, running R 3.0.2

Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

More Related Content

What's hot

Viewers also liked

Similar to Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

More from Revolution Analytics

Recently uploaded

Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Editor's Notes