SlideShare a Scribd company logo
Numerical Recipesin HadoopJake Mannixlinkedin/in/jakemannixtwitter/pbranejake.mannix@gmail.comjmannix@apache.orgPrincipal SDE, LinkedInCommitter, Apache Mahout, Zoie,  Bobo-Browse, DecomposerAuthor, Lucene in Depth (Manning MM/DD/2010)
A Mathematician’s ApologyWhat mathematical structure describes all of these?Full-text search:Score documents matching “query string”Collaborative filtering recommendation:Users who liked {those} also liked {these}(Social/web)-graph proximity:People/pages “close” to {this} are {these}
Matrix Multiplication!
Full-text SearchVector Space Model of IRCorpus as term-document matrixQuery as bag-of-words vectorFull-text search is just:
Collaborative FilteringUser preference matrix (and item-item similarity matrix                 )Input user as vector of preferences (simple) Item-based CF recommendations are:T
Graph ProximityAdjacency matrix:2nd degree adjacency matrix:  Input all of a user’s “friends” or page links:(weighted) distance measure of 1st – 3rd degree connections is then:
DictionaryApplications                  Linear Algebra
How does this help?In Search:Latent Semantic Indexing (LSI)probabalistic LSILatent Dirichlet AllocationIn Recommenders:Singular Value DecompositionLayered Restricted Boltzmann Machines (Deep Belief Networks)In Graphs:PageRankSpectral Decomposition / Spectral Clustering
Often use “Dimensional Reduction”To alleviate the sparse Big Data problem of “the curse of dimensionality”Used to improve recall and relevance in general: smooth the metric on your data set
New applications with MatricesIf Search is finding doc-vector by: and users query with data represented: Q = Giving implicit feedback based on click-through per session: C =
… continuedThen               has the form (docs-by-terms) for search!Approach has been used by Ted Dunning at Veoh(and probably others)
Linear Algebra performance tricksNaïve item-based recommendations:Calculate item similarity matrix:Calculate item recs:Express in one step:In matrix notation:Re-writing as:     is the vector of preferences for user “v”,      is the vector of preferences of item “i”The result is the matrix sum of the outer (tensor) products of these vectors, scaled by the entry they intersect at.
Item Recommender via Hadoop
Apache MahoutApache Mahout currently on release 0.3http://lucene.apache.org/mahoutWill be a “Top Level Project” soon (before 0.4)( http://mahout.apache.org )“Scalable Machine Learning with commercially friendly licensing”
Mahout Features Recommenders absorbed the Taste projectClassification (Naïve Bayes, C-Bayes, more)Clustering (Canopy, fuzzy-K-means, Dirichlet, etc…)Fast non-distributed linear mathematics absorbed the classic CERN Colt projectDistributed Matrices and decompositionabsorbed the Decomposer projectmahout shell-script analogous to $HADOOP_HOME/bin/hadoop$MAHOUT_HOME/bin/mahout kmeans –i “in” –o “out” –k 100$MAHOUT_HOME/bin/mahout svd –i “in” –o “out” –k 300etc…Taste web-app for real-time recommendations
DistributedRowMatrixWrapper around a SequenceFile<IntWritable,VectorWritable>Distributed methods like:Matrix transpose();Matrix times(Matrix other);Vector times(Vectorv);Vector timesSquared(Vectorv);To get SVD: pass into DistributedLanczosSolver:LanczosSolver.solve(Matrix input, Matrix eigenVectors, List<Double> eigenValues, int rank);
Questions?Contact: jake.mannix@gmail.comjmannix@apache.orghttp://twitter.com/pbranehttp://www.decomposer.org/bloghttp://www.linkedin.com/in/jakemannix
AppendixThere are lots of ways to deal with sparse Big Data, and many (not all) need to deal with the dimensionality of the feature-space growing beyond reasonable limits, and techniques to deal with this depend heavily on your data…That having been said, there are some general techniques
Dealing with Curse of DimensionalitySparseness means fast, but overlap is too smallCan we reduce the dimensionality (from “all possible text tokens” or “all userIds”) while keeping the nice aspects of the search problem?If possible, collapse “similar” vectors (synonymous terms, userIds with high overlap, etc…) towards each other while keeping “dissimilar” vectors far apart…
Solution A: Matrix decompositionSingular Value Decomposition (truncated)“best” approximation to your matrixUsed in Latent Semantic Indexing (LSI)For graphs: spectral decompositionCollaborative filtering (Netflix leaderboard)Issues: very computation intensive no parallelized open-source packages see Apache MahoutMakes things too dense
SVD: continuedHadoopimpl. in Mahout (Lanczos)O(N*d*k) for rank-k SVD on N docs, delt’s each Density can be dealt with by doing Canopy Clustering offlineBut only extracting linear feature mixesAlso, still very computation intensive and I/O intensive (k-passes over data set), are there better dimensional reduction methods?
Solution B: Stochastic Decomposition co-ocurrence-based kernel + online Random Projection + SVD
Co-ocurrence-based kernelExtract bigram phrases / pairs of items rated by the same person (using Log-Likelihood Ratio test to pick the best)“Disney on Ice was Amazing!” -> {“disney”, “disney on ice”, “ice”, “was” “amazing”}{item1:4, item2:5, item5:3, item9:1} -> {item1:4, (items1+2):4.5, item2:5, item5:3,…}Dim(features) goes from 105to 108+(yikes!)
Online Random ProjectionRandomly project kernelized text vectors down to “merely” 103dimensions with a Gaussian matrix Or project eachnGram down to an random (but sparse) 103-dim vector:V= {123876244 =>1.3}    (tf-IDF of “disney”)V’= c*{h(i) => 1, h(h(i)) =>1, h(h(h(i))) =>1}    (c= 1.3 / sqrt(3))
Outer-product and SumTake the 103-dim projected vectors and outer-product with themselves,result is 103x103-dim matrixsum these in a CombinerAll results go to single Reducer, where you compute…
SVD SVD-them quickly (they fit in memory) Over and over again (as new data comes in)Use the most recent SVD to project your (already randomly projected) text still further (now encoding “semantic” similarity).SVD-projected vectors can be assigned immediately to nearest clusters if desired
ReferencesRandomized matrix decomposition review: http://arxiv.org/abs/0909.4061Sparse hashing/projection:John Langford et al. “VowpalWabbit”http://hunch.net/~vw/

More Related Content

PDF
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...
PDF
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
PDF
What's new in pandas and the SciPy stack for financial users
PDF
Designing Distributed Machine Learning on Apache Spark
PDF
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
KEY
Cascalog
PPTX
Practical Machine Learning for Smarter Search with Solr and Spark
PDF
What is a distributed data science pipeline. how with apache spark and friends.
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
What's new in pandas and the SciPy stack for financial users
Designing Distributed Machine Learning on Apache Spark
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Cascalog
Practical Machine Learning for Smarter Search with Solr and Spark
What is a distributed data science pipeline. how with apache spark and friends.

What's hot (20)

PDF
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
PPTX
Big data week presentation
PDF
Overview of the Hive Stinger Initiative
PDF
Real-World NoSQL Schema Design
PDF
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
PDF
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
PDF
Spark Meetup @ Netflix, 05/19/2015
PDF
Spark after Dark by Chris Fregly of Databricks
PPTX
OrientDB vs Neo4j - Comparison of query/speed/functionality
PDF
AlphaPy: A Data Science Pipeline in Python
PPT
Linked in stream experimentation framework
PPTX
Dictionary Based Annotation at Scale with Spark by Sujit Pal
PDF
Webinar: Solr 6 Deep Dive - SQL and Graph
PDF
TinkerPop: a story of graphs, DBs, and graph DBs
PPTX
Hadoop with Python
PDF
Complex queries in a distributed multi-model database
PDF
Spark SQL with Scala Code Examples
PDF
Microservices, containers, and machine learning
PDF
Latent Semantic Analysis of Wikipedia with Spark
PDF
Web Crawling with Apache Nutch
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Big data week presentation
Overview of the Hive Stinger Initiative
Real-World NoSQL Schema Design
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Spark Meetup @ Netflix, 05/19/2015
Spark after Dark by Chris Fregly of Databricks
OrientDB vs Neo4j - Comparison of query/speed/functionality
AlphaPy: A Data Science Pipeline in Python
Linked in stream experimentation framework
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Webinar: Solr 6 Deep Dive - SQL and Graph
TinkerPop: a story of graphs, DBs, and graph DBs
Hadoop with Python
Complex queries in a distributed multi-model database
Spark SQL with Scala Code Examples
Microservices, containers, and machine learning
Latent Semantic Analysis of Wikipedia with Spark
Web Crawling with Apache Nutch
Ad

Viewers also liked (14)

PPSX
K Search
PDF
E lex presentation_03
PDF
The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...
PPT
Cebit2009new
PPTX
Information retrieval based on word sens 1
PPTX
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
PPT
Chap10
PPT
K Search Al Khawarizmy Language Software
PPT
Statistika
PDF
REA (Resources, Events, Agents)
PPTX
Indexing Strategies to Help You Scale
PPTX
treaty of hudabiya
PPTX
Treaty of Al Hudaybiyah
PDF
Full Text Search In PostgreSQL
K Search
E lex presentation_03
The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...
Cebit2009new
Information retrieval based on word sens 1
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Chap10
K Search Al Khawarizmy Language Software
Statistika
REA (Resources, Events, Agents)
Indexing Strategies to Help You Scale
treaty of hudabiya
Treaty of Al Hudaybiyah
Full Text Search In PostgreSQL
Ad

Similar to Seattle Scalability Mahout (20)

PDF
OSCON: Apache Mahout - Mammoth Scale Machine Learning
PPT
Orchestrating the Intelligent Web with Apache Mahout
PPT
Buidling large scale recommendation engine
PPTX
Intro to Mahout -- DC Hadoop
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPT
Schemaless Databases
PPTX
Apache Mahout
DOCX
Vipul divyanshu mahout_documentation
PPTX
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ODP
Nonrelational Databases
PPT
Hands on Mahout!
PDF
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
PPT
Hadoop basics
PDF
TCS_DATA_ANALYSIS_REPORT_ADITYA
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
PDF
Introduction to Mahout and Machine Learning
PDF
The Evolution of Big Data Frameworks
PPTX
Apache Mahout: Driving the Yellow Elephant
PPTX
Jane Recommendation Engines
PPTX
IOTA 2016 Social Recomender System Presentation.
OSCON: Apache Mahout - Mammoth Scale Machine Learning
Orchestrating the Intelligent Web with Apache Mahout
Buidling large scale recommendation engine
Intro to Mahout -- DC Hadoop
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Schemaless Databases
Apache Mahout
Vipul divyanshu mahout_documentation
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
Nonrelational Databases
Hands on Mahout!
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Hadoop basics
TCS_DATA_ANALYSIS_REPORT_ADITYA
Yarn spark next_gen_hadoop_8_jan_2014
Introduction to Mahout and Machine Learning
The Evolution of Big Data Frameworks
Apache Mahout: Driving the Yellow Elephant
Jane Recommendation Engines
IOTA 2016 Social Recomender System Presentation.

Recently uploaded (20)

PDF
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
PDF
Test Bank, Solutions for Java How to Program, An Objects-Natural Approach, 12...
PDF
Transforming Manufacturing operations through Intelligent Integrations
PPTX
ChatGPT's Deck on The Enduring Legacy of Fax Machines
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
Reimagining Insurance: Connected Data for Confident Decisions.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
PPTX
How Much Does It Cost to Build a Train Ticket App like Trenitalia in Italy.pptx
PDF
Chapter 2 Digital Image Fundamentals.pdf
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
PDF
Google’s NotebookLM Unveils Video Overviews
PDF
This slide provides an overview Technology
PDF
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
PDF
Top Generative AI Tools for Patent Drafting in 2025.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
Test Bank, Solutions for Java How to Program, An Objects-Natural Approach, 12...
Transforming Manufacturing operations through Intelligent Integrations
ChatGPT's Deck on The Enduring Legacy of Fax Machines
NewMind AI Weekly Chronicles - July'25 - Week IV
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
Reimagining Insurance: Connected Data for Confident Decisions.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
How Much Does It Cost to Build a Train Ticket App like Trenitalia in Italy.pptx
Chapter 2 Digital Image Fundamentals.pdf
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
A Day in the Life of Location Data - Turning Where into How.pdf
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
Google’s NotebookLM Unveils Video Overviews
This slide provides an overview Technology
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
Top Generative AI Tools for Patent Drafting in 2025.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication

Seattle Scalability Mahout

  • 1. Numerical Recipesin HadoopJake Mannixlinkedin/in/jakemannixtwitter/[email protected]@apache.orgPrincipal SDE, LinkedInCommitter, Apache Mahout, Zoie, Bobo-Browse, DecomposerAuthor, Lucene in Depth (Manning MM/DD/2010)
  • 2. A Mathematician’s ApologyWhat mathematical structure describes all of these?Full-text search:Score documents matching “query string”Collaborative filtering recommendation:Users who liked {those} also liked {these}(Social/web)-graph proximity:People/pages “close” to {this} are {these}
  • 4. Full-text SearchVector Space Model of IRCorpus as term-document matrixQuery as bag-of-words vectorFull-text search is just:
  • 5. Collaborative FilteringUser preference matrix (and item-item similarity matrix )Input user as vector of preferences (simple) Item-based CF recommendations are:T
  • 6. Graph ProximityAdjacency matrix:2nd degree adjacency matrix: Input all of a user’s “friends” or page links:(weighted) distance measure of 1st – 3rd degree connections is then:
  • 7. DictionaryApplications Linear Algebra
  • 8. How does this help?In Search:Latent Semantic Indexing (LSI)probabalistic LSILatent Dirichlet AllocationIn Recommenders:Singular Value DecompositionLayered Restricted Boltzmann Machines (Deep Belief Networks)In Graphs:PageRankSpectral Decomposition / Spectral Clustering
  • 9. Often use “Dimensional Reduction”To alleviate the sparse Big Data problem of “the curse of dimensionality”Used to improve recall and relevance in general: smooth the metric on your data set
  • 10. New applications with MatricesIf Search is finding doc-vector by: and users query with data represented: Q = Giving implicit feedback based on click-through per session: C =
  • 11. … continuedThen has the form (docs-by-terms) for search!Approach has been used by Ted Dunning at Veoh(and probably others)
  • 12. Linear Algebra performance tricksNaïve item-based recommendations:Calculate item similarity matrix:Calculate item recs:Express in one step:In matrix notation:Re-writing as: is the vector of preferences for user “v”, is the vector of preferences of item “i”The result is the matrix sum of the outer (tensor) products of these vectors, scaled by the entry they intersect at.
  • 14. Apache MahoutApache Mahout currently on release 0.3http://lucene.apache.org/mahoutWill be a “Top Level Project” soon (before 0.4)( http://mahout.apache.org )“Scalable Machine Learning with commercially friendly licensing”
  • 15. Mahout Features Recommenders absorbed the Taste projectClassification (Naïve Bayes, C-Bayes, more)Clustering (Canopy, fuzzy-K-means, Dirichlet, etc…)Fast non-distributed linear mathematics absorbed the classic CERN Colt projectDistributed Matrices and decompositionabsorbed the Decomposer projectmahout shell-script analogous to $HADOOP_HOME/bin/hadoop$MAHOUT_HOME/bin/mahout kmeans –i “in” –o “out” –k 100$MAHOUT_HOME/bin/mahout svd –i “in” –o “out” –k 300etc…Taste web-app for real-time recommendations
  • 16. DistributedRowMatrixWrapper around a SequenceFile<IntWritable,VectorWritable>Distributed methods like:Matrix transpose();Matrix times(Matrix other);Vector times(Vectorv);Vector timesSquared(Vectorv);To get SVD: pass into DistributedLanczosSolver:LanczosSolver.solve(Matrix input, Matrix eigenVectors, List<Double> eigenValues, int rank);
  • 18. AppendixThere are lots of ways to deal with sparse Big Data, and many (not all) need to deal with the dimensionality of the feature-space growing beyond reasonable limits, and techniques to deal with this depend heavily on your data…That having been said, there are some general techniques
  • 19. Dealing with Curse of DimensionalitySparseness means fast, but overlap is too smallCan we reduce the dimensionality (from “all possible text tokens” or “all userIds”) while keeping the nice aspects of the search problem?If possible, collapse “similar” vectors (synonymous terms, userIds with high overlap, etc…) towards each other while keeping “dissimilar” vectors far apart…
  • 20. Solution A: Matrix decompositionSingular Value Decomposition (truncated)“best” approximation to your matrixUsed in Latent Semantic Indexing (LSI)For graphs: spectral decompositionCollaborative filtering (Netflix leaderboard)Issues: very computation intensive no parallelized open-source packages see Apache MahoutMakes things too dense
  • 21. SVD: continuedHadoopimpl. in Mahout (Lanczos)O(N*d*k) for rank-k SVD on N docs, delt’s each Density can be dealt with by doing Canopy Clustering offlineBut only extracting linear feature mixesAlso, still very computation intensive and I/O intensive (k-passes over data set), are there better dimensional reduction methods?
  • 22. Solution B: Stochastic Decomposition co-ocurrence-based kernel + online Random Projection + SVD
  • 23. Co-ocurrence-based kernelExtract bigram phrases / pairs of items rated by the same person (using Log-Likelihood Ratio test to pick the best)“Disney on Ice was Amazing!” -> {“disney”, “disney on ice”, “ice”, “was” “amazing”}{item1:4, item2:5, item5:3, item9:1} -> {item1:4, (items1+2):4.5, item2:5, item5:3,…}Dim(features) goes from 105to 108+(yikes!)
  • 24. Online Random ProjectionRandomly project kernelized text vectors down to “merely” 103dimensions with a Gaussian matrix Or project eachnGram down to an random (but sparse) 103-dim vector:V= {123876244 =>1.3} (tf-IDF of “disney”)V’= c*{h(i) => 1, h(h(i)) =>1, h(h(h(i))) =>1} (c= 1.3 / sqrt(3))
  • 25. Outer-product and SumTake the 103-dim projected vectors and outer-product with themselves,result is 103x103-dim matrixsum these in a CombinerAll results go to single Reducer, where you compute…
  • 26. SVD SVD-them quickly (they fit in memory) Over and over again (as new data comes in)Use the most recent SVD to project your (already randomly projected) text still further (now encoding “semantic” similarity).SVD-projected vectors can be assigned immediately to nearest clusters if desired
  • 27. ReferencesRandomized matrix decomposition review: http://arxiv.org/abs/0909.4061Sparse hashing/projection:John Langford et al. “VowpalWabbit”http://hunch.net/~vw/

Editor's Notes

  • #28: And the usual references for LSI and Spectral Decomposition