Rapid Productionalization of Predictive Models 
In-database Modeling with Revolution Analytics on Teradata 
Skylar Lyon 
Accenture Analytics
Introduction 
Skylar Lyon 
Accenture Analytics 
• 7 years of experience with focus on big data 
and predictive analytics - using discrete choice 
modeling, random forest classification, 
ensemble modeling, and clustering 
• Technology experience includes: Hadoop, 
Accumulo, PostgreSQL, qGIS, JBoss, Tomcat, 
R, GeoMesa, and more 
• Worked from Army installations across the 
nation and also had the opportunity to travel 
twice to Baghdad to deploy solutions 
downrange. 
Copyright © 2014 Accenture. All rights reserved. 2
How we got here 
Project background and my involvement 
• New Customer Analytics team for Silicon Valley Internet eCommerce 
giant 
• Data scientists developing predictive models 
• Deferred focus on productionalization 
• Joined as Big Data Infrastructure and Analytics Lead 
Copyright © 2014 Accenture. All rights reserved. 3
Colleague‘s CRAN R model 
Binomial logistic regression 
• 50+ Independent variables including categorical with indicator 
variables 
• Train from small sample (many thousands) – not a problem in and of 
itself 
• Scoring across entire corpus (many hundred millions) – slightly more 
challenging 
Copyright © 2014 Accenture. All rights reserved. 4
We optimized the current productionalization process 
We moved compute to data 
Before After 
Reduced 5+ hour process to 40 seconds 
Copyright © 2014 Accenture. All rights reserved. 5
Benchmarking our optimized process 
5+ hours to 40 seconds: Recommendation is that this now become 
the defacto productionalization process 
Copyright © 2014 Accenture. All rights reserved. 6 
rows 
minutes
Optimization process 
Recode CRAN R to Rx R 
Before 
trainit <- glm(as.formula(specs[[i]]), data = training.data, 
family='binomial', maxit=iters) 
fits <- predict(trainit, newdata=test.data, type='response') 
After 
trainit <- rxGlm(as.formula(specs[[i]]), data = training.data, 
family='binomial', maxIterations=iters) 
fits <- rxPredict(trainit, newdata=test.data, type='response') 
Copyright © 2014 Accenture. All rights reserved. 7
Additional benefits to new process 
Technology is increasing data science team’s options and 
opportunities 
• Train in-database on much larger set – reduces need to sample 
• Nearly “native” R language – decrease deploy time 
• Hadoop support – score in multiple data warehouses 
Copyright © 2014 Accenture. All rights reserved. 8
Appendix 
Table of Contents 
• Technical Considerations 
Copyright © 2014 Accenture. All rights reserved. 9
Technical considerations 
Environment setup 
• Teradata environment – 4 node, 1700 series appliance server 
• Revolution R Enterprise – version 7.1, running R 3.0.2 
Copyright © 2014 Accenture. All rights reserved. 10

Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

  • 1.
    Rapid Productionalization ofPredictive Models In-database Modeling with Revolution Analytics on Teradata Skylar Lyon Accenture Analytics
  • 2.
    Introduction Skylar Lyon Accenture Analytics • 7 years of experience with focus on big data and predictive analytics - using discrete choice modeling, random forest classification, ensemble modeling, and clustering • Technology experience includes: Hadoop, Accumulo, PostgreSQL, qGIS, JBoss, Tomcat, R, GeoMesa, and more • Worked from Army installations across the nation and also had the opportunity to travel twice to Baghdad to deploy solutions downrange. Copyright © 2014 Accenture. All rights reserved. 2
  • 3.
    How we gothere Project background and my involvement • New Customer Analytics team for Silicon Valley Internet eCommerce giant • Data scientists developing predictive models • Deferred focus on productionalization • Joined as Big Data Infrastructure and Analytics Lead Copyright © 2014 Accenture. All rights reserved. 3
  • 4.
    Colleague‘s CRAN Rmodel Binomial logistic regression • 50+ Independent variables including categorical with indicator variables • Train from small sample (many thousands) – not a problem in and of itself • Scoring across entire corpus (many hundred millions) – slightly more challenging Copyright © 2014 Accenture. All rights reserved. 4
  • 5.
    We optimized thecurrent productionalization process We moved compute to data Before After Reduced 5+ hour process to 40 seconds Copyright © 2014 Accenture. All rights reserved. 5
  • 6.
    Benchmarking our optimizedprocess 5+ hours to 40 seconds: Recommendation is that this now become the defacto productionalization process Copyright © 2014 Accenture. All rights reserved. 6 rows minutes
  • 7.
    Optimization process RecodeCRAN R to Rx R Before trainit <- glm(as.formula(specs[[i]]), data = training.data, family='binomial', maxit=iters) fits <- predict(trainit, newdata=test.data, type='response') After trainit <- rxGlm(as.formula(specs[[i]]), data = training.data, family='binomial', maxIterations=iters) fits <- rxPredict(trainit, newdata=test.data, type='response') Copyright © 2014 Accenture. All rights reserved. 7
  • 8.
    Additional benefits tonew process Technology is increasing data science team’s options and opportunities • Train in-database on much larger set – reduces need to sample • Nearly “native” R language – decrease deploy time • Hadoop support – score in multiple data warehouses Copyright © 2014 Accenture. All rights reserved. 8
  • 9.
    Appendix Table ofContents • Technical Considerations Copyright © 2014 Accenture. All rights reserved. 9
  • 10.
    Technical considerations Environmentsetup • Teradata environment – 4 node, 1700 series appliance server • Revolution R Enterprise – version 7.1, running R 3.0.2 Copyright © 2014 Accenture. All rights reserved. 10

Editor's Notes

  • #4 Problem statement
  • #5 Gabi’s binomial logistic regression model Admittedly, could be recoded to SQL, but not so easy with random forest and more powerful ensemble models
  • #6 Lots of data movement; 6+ hour process
  • #8 Show some CRAN R versus Rx R code