Girish Nathan
Misha Bilenko
Microsoft Azure Machine Learning
How to Work
with Large Datasets to Build Predictive
Models
Agenda
1. How to Work with Large Datasets
• Sample Dataset: NYC Taxi
• HDInsight (Hadoop on Azure)
• iPython notebook and HDInsight
2. Building Predictive Models
• Azure ML Studio
• Learning with Counts
3. Putting it all together: Learning with Counts and HDInsight
Sample Data: NYC Taxi
• One year log of NYC taxi rides
• 60GB, publicly available at http://www.andresmh.com/nyctaxitrips/
• Trip (driver id, times, locations) and fare (fare, tip, tolls)
• Rest of tutorial: data wrangling and tip prediction
• Tools: AzCopy, HDInsight, iPython, Azure ML Studio
• 100% Apache Hadoop as an Azure service
• Can deploy on Windows or Linux
• Provides Map-Reduce capability over big data in Azure
blobs
• Head node: job and cluster monitoring
• Hive: SQL-like queries as an alternative to writing code
SELECT Col1, COUNT(*) AS Count_Col1 FROM Your_Table
GROUP BY Col1 ORDER BY Count_Col1 DESC LIMIT 10;
HD Insight : Hadoop on Azure
• Web-based Python REPL environment
• Combines authoring, execution, visualization
• Can author and execute HDInsight Hive queries
• Sample query (python code snippet)
def submit_hive_query(self):
response=urllib2.urlopen(self.url, self.hiveParams)
data = json.load(response)
self.hiveJobID = data[‘id’]
def query(self, queryString):
self.submit_hive_query()
Example query string: SELECT * FROM sample_table LIMIT 10;
Ipython Notebook
• Fully managed cloud service
• Browser based authoring of
dataflow
• Best in class machine learning
algorithms
• Support for R/Python/SQL
• Collaborative data science
• Quickly deploy models as web
services/REST API’s
• Publish to a gallery for
collaboration with community
What is Azure ML Studio
(Distributed Robust Algorithm for CoUnt-based LeArning)
Misha Bilenko
Microsoft Azure Machine Learning
Microsoft Research
Learning with Counts
a.k.a Dracula
adid = 1010054353
adText = K2 ski sale!
adURL= www.k2.com/sale
Userid = 0xb49129827048dd9b
IP = 131.107.65.14
Query = powder skis
QCategories = {skiing, outdoor gear}
8
¿ 𝑢𝑠𝑒𝑟𝑠 10
9
¿ 𝑞𝑢𝑒𝑟𝑖𝑒𝑠 10
9 +¿ ¿
𝑎𝑑𝑠 10
7
¿ ( 𝑎𝑑 × 𝑞𝑢𝑒𝑟𝑦 ) 10
10+ ¿ ¿
• Information retrieval
• Advertising, recommending, search: item, page/query, user
• Transaction classification
• Payment fraud: transaction, product, user
• Email spam: message, sender, recipient
• Intrusion detection: session, system, user
• IoT: device, location
Large Scale learning in multi entity
domains
adid: 1010054353
adText: Fall ski sale!
adURL:
www.k2.com/sale
userid 0xb49129827048dd9b
IP 131.107.65.14
query powder skis
qCategories {skiing, outdoor gear}
9
• Problem: representing high-cardinality attributes as features
• Scalable: to billions of attribute values
• Efficient: predictions/sec
• Flexible: for a variety of downstream learners
• Adaptive: to distribution change
• Standard approaches: binary features, hashing, projections
• What everyone uses in industry: learning with counts
• This talk: formalization and generalization
Large Scale learning in multi entity
domains
• Features are transforms of conditional statistics (per-label
counts)
= [N+
N-
log(N+
)-log(N-
) IsBackoff]
• log(N+
)-log(N-
) = log log-odds/Naïve Bayes estimate
• N+
, N-
indicators of confidence of the naïve estimate
• IsFromRest: indicator of back-off vs. “real count”
) )
131.107.65.14
) )
k 2.com
)
powder skis
)
powder skis, k2.com
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.107.65.14 12 430
… … …
REST 745623 13964931
Learning with Counts
• Features are transforms of conditional counts
= [N+
N-
log(N+
)-log(N-
) IsBackoff]
Scalable “head” in memory + tail in backoff; or: count-min sketch
Efficient low cost, low dimensionality
Flexible low dimensionality works well with non-linear learners
new values easily added, back-off for infrequent values, temporal counts
) )
131.107.65.14
) )
k 2.com
)
powder skis
)
powder skis, k2.com
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.107.65.14 12 430
… … …
REST 745623 13964931
Learning with Counts
Aggregate for different
• Standard MapReduce
• Bin function: any projection
• Backoff options: “tail bin”, hashing,
hierarchical (shrinkage)
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430
… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964
… … …
REST 6321789 43477252
Query × AdId
facebook, ad1 54546 978964
facebook, ad2 232343 8431467
dozen roses, ad3 12973 430982
… … …
REST 441931
2
52754683
time
Tnow
Counting
IP[2]
173.194.*.* 46964 993424
87.250.*.* 6341 91356
131.253.*.* 75126 430826
… … …
12
Learning with Counts : aggregation
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430
… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964
… … …
REST 6321789 43477252
time
Tnow
Train predictor
….
IsBackoff
ln 𝑁
+¿
−ln 𝑁
−
¿
Aggregated
features
Original numeric features
𝑁
−
𝑁+¿¿
Counting
Train non-linear model on count-based features
• Counts, transforms, lookup properties
• Additional features can be injected
Query × AdId
facebook, ad1 54546 978964
facebook, ad2 232343 8431467
dozen roses, ad3 12973 430982
… … …
REST 441931
2
52754683
13
Learning with Counts : combiner
training
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430
… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964
… … …
REST 6321789 43477252
URL × Country
url1, US 54546 978964
url2, CA 232343 8431467
url3, FR 12973 430982
… … …
REST 441931
2
52754683
time
Tnow
….
IsBackoff
ln 𝑁
+¿
−ln 𝑁
−
¿
Aggregated
features
𝑁
−
𝑁+¿¿
Counting
• Counts are updated continuously
• Combiner re-training infrequent
Ttrain
Original numeric features
Prediction with counts
• State-of-the-art accuracy
• Good fit for map-reduce
• Modular (vs. monolithic)
• Learner can be tuned/monitored/replaced in isolation
• Monitorable, debuggable (this is HUGE in practice!)
• Temporal changes easy to monitor
• Easy emergency recovery (remove bot attacks, etc.)
• Decomposable predictions
• Error debugging (which feature can we blame…) 15
What is great about learning with
Counts ?
Learning with Counts : in Azure ML
• HDInsight: large data storage and map-reduce
processing
• Azure ML: cloud ML and analytics accessible
anywhere
• Learning with Counts: intuitive, flexible large-scale
ML solution
Putting it all together
Thanks for your time
Useful Links:
http://azure.microsoft.com/ml- Sign up for your free Azure ML Trial
http://bit.ly/datasc_ebook - Free tutorial on how to use Azure ML
Need Azure ML for teaching in classroom ? - Contact the speakers
Other Questions ? - Contact the speakers
Speakers :-
Misha Bilenko : mbilenko@Microsoft.com
Girish Nathan – ginathan@Microsoft.com

Learning with counts

  • 1.
    Girish Nathan Misha Bilenko MicrosoftAzure Machine Learning How to Work with Large Datasets to Build Predictive Models
  • 2.
    Agenda 1. How toWork with Large Datasets • Sample Dataset: NYC Taxi • HDInsight (Hadoop on Azure) • iPython notebook and HDInsight 2. Building Predictive Models • Azure ML Studio • Learning with Counts 3. Putting it all together: Learning with Counts and HDInsight
  • 3.
    Sample Data: NYCTaxi • One year log of NYC taxi rides • 60GB, publicly available at http://www.andresmh.com/nyctaxitrips/ • Trip (driver id, times, locations) and fare (fare, tip, tolls) • Rest of tutorial: data wrangling and tip prediction • Tools: AzCopy, HDInsight, iPython, Azure ML Studio
  • 4.
    • 100% ApacheHadoop as an Azure service • Can deploy on Windows or Linux • Provides Map-Reduce capability over big data in Azure blobs • Head node: job and cluster monitoring • Hive: SQL-like queries as an alternative to writing code SELECT Col1, COUNT(*) AS Count_Col1 FROM Your_Table GROUP BY Col1 ORDER BY Count_Col1 DESC LIMIT 10; HD Insight : Hadoop on Azure
  • 5.
    • Web-based PythonREPL environment • Combines authoring, execution, visualization • Can author and execute HDInsight Hive queries • Sample query (python code snippet) def submit_hive_query(self): response=urllib2.urlopen(self.url, self.hiveParams) data = json.load(response) self.hiveJobID = data[‘id’] def query(self, queryString): self.submit_hive_query() Example query string: SELECT * FROM sample_table LIMIT 10; Ipython Notebook
  • 6.
    • Fully managedcloud service • Browser based authoring of dataflow • Best in class machine learning algorithms • Support for R/Python/SQL • Collaborative data science • Quickly deploy models as web services/REST API’s • Publish to a gallery for collaboration with community What is Azure ML Studio
  • 7.
    (Distributed Robust Algorithmfor CoUnt-based LeArning) Misha Bilenko Microsoft Azure Machine Learning Microsoft Research Learning with Counts a.k.a Dracula
  • 8.
    adid = 1010054353 adText= K2 ski sale! adURL= www.k2.com/sale Userid = 0xb49129827048dd9b IP = 131.107.65.14 Query = powder skis QCategories = {skiing, outdoor gear} 8 ¿ 𝑢𝑠𝑒𝑟𝑠 10 9 ¿ 𝑞𝑢𝑒𝑟𝑖𝑒𝑠 10 9 +¿ ¿ 𝑎𝑑𝑠 10 7 ¿ ( 𝑎𝑑 × 𝑞𝑢𝑒𝑟𝑦 ) 10 10+ ¿ ¿ • Information retrieval • Advertising, recommending, search: item, page/query, user • Transaction classification • Payment fraud: transaction, product, user • Email spam: message, sender, recipient • Intrusion detection: session, system, user • IoT: device, location Large Scale learning in multi entity domains
  • 9.
    adid: 1010054353 adText: Fallski sale! adURL: www.k2.com/sale userid 0xb49129827048dd9b IP 131.107.65.14 query powder skis qCategories {skiing, outdoor gear} 9 • Problem: representing high-cardinality attributes as features • Scalable: to billions of attribute values • Efficient: predictions/sec • Flexible: for a variety of downstream learners • Adaptive: to distribution change • Standard approaches: binary features, hashing, projections • What everyone uses in industry: learning with counts • This talk: formalization and generalization Large Scale learning in multi entity domains
  • 10.
    • Features aretransforms of conditional statistics (per-label counts) = [N+ N- log(N+ )-log(N- ) IsBackoff] • log(N+ )-log(N- ) = log log-odds/Naïve Bayes estimate • N+ , N- indicators of confidence of the naïve estimate • IsFromRest: indicator of back-off vs. “real count” ) ) 131.107.65.14 ) ) k 2.com ) powder skis ) powder skis, k2.com IP 173.194.33.9 46964 993424 87.250.251.11 31 843 131.107.65.14 12 430 … … … REST 745623 13964931 Learning with Counts
  • 11.
    • Features aretransforms of conditional counts = [N+ N- log(N+ )-log(N- ) IsBackoff] Scalable “head” in memory + tail in backoff; or: count-min sketch Efficient low cost, low dimensionality Flexible low dimensionality works well with non-linear learners new values easily added, back-off for infrequent values, temporal counts ) ) 131.107.65.14 ) ) k 2.com ) powder skis ) powder skis, k2.com IP 173.194.33.9 46964 993424 87.250.251.11 31 843 131.107.65.14 12 430 … … … REST 745623 13964931 Learning with Counts
  • 12.
    Aggregate for different •Standard MapReduce • Bin function: any projection • Backoff options: “tail bin”, hashing, hierarchical (shrinkage) IP 173.194.33.9 46964 993424 87.250.251.11 31 843 131.253.13.32 12 430 … … … REST 745623 13964931 query facebook 281912 7957321 dozen roses 32791 640964 … … … REST 6321789 43477252 Query × AdId facebook, ad1 54546 978964 facebook, ad2 232343 8431467 dozen roses, ad3 12973 430982 … … … REST 441931 2 52754683 time Tnow Counting IP[2] 173.194.*.* 46964 993424 87.250.*.* 6341 91356 131.253.*.* 75126 430826 … … … 12 Learning with Counts : aggregation
  • 13.
    IP 173.194.33.9 46964 993424 87.250.251.1131 843 131.253.13.32 12 430 … … … REST 745623 13964931 query facebook 281912 7957321 dozen roses 32791 640964 … … … REST 6321789 43477252 time Tnow Train predictor …. IsBackoff ln 𝑁 +¿ −ln 𝑁 − ¿ Aggregated features Original numeric features 𝑁 − 𝑁+¿¿ Counting Train non-linear model on count-based features • Counts, transforms, lookup properties • Additional features can be injected Query × AdId facebook, ad1 54546 978964 facebook, ad2 232343 8431467 dozen roses, ad3 12973 430982 … … … REST 441931 2 52754683 13 Learning with Counts : combiner training
  • 14.
    IP 173.194.33.9 46964 993424 87.250.251.1131 843 131.253.13.32 12 430 … … … REST 745623 13964931 query facebook 281912 7957321 dozen roses 32791 640964 … … … REST 6321789 43477252 URL × Country url1, US 54546 978964 url2, CA 232343 8431467 url3, FR 12973 430982 … … … REST 441931 2 52754683 time Tnow …. IsBackoff ln 𝑁 +¿ −ln 𝑁 − ¿ Aggregated features 𝑁 − 𝑁+¿¿ Counting • Counts are updated continuously • Combiner re-training infrequent Ttrain Original numeric features Prediction with counts
  • 15.
    • State-of-the-art accuracy •Good fit for map-reduce • Modular (vs. monolithic) • Learner can be tuned/monitored/replaced in isolation • Monitorable, debuggable (this is HUGE in practice!) • Temporal changes easy to monitor • Easy emergency recovery (remove bot attacks, etc.) • Decomposable predictions • Error debugging (which feature can we blame…) 15 What is great about learning with Counts ?
  • 16.
    Learning with Counts: in Azure ML
  • 17.
    • HDInsight: largedata storage and map-reduce processing • Azure ML: cloud ML and analytics accessible anywhere • Learning with Counts: intuitive, flexible large-scale ML solution Putting it all together
  • 18.
    Thanks for yourtime Useful Links: http://azure.microsoft.com/ml- Sign up for your free Azure ML Trial http://bit.ly/datasc_ebook - Free tutorial on how to use Azure ML Need Azure ML for teaching in classroom ? - Contact the speakers Other Questions ? - Contact the speakers Speakers :- Misha Bilenko : [email protected] Girish Nathan – [email protected]