Seattle Scalability Mahout

Numerical Recipesin HadoopJake Mannixlinkedin/in/jakemannixtwitter/pbranejake.mannix@gmail.comjmannix@apache.orgPrincipal SDE, LinkedInCommitter, Apache Mahout, Zoie, Bobo-Browse, DecomposerAuthor, Lucene in Depth (Manning MM/DD/2010)

A Mathematician’s ApologyWhat mathematical structure describes all of these?Full-text search:Score documents matching “query string”Collaborative filtering recommendation:Users who liked {those} also liked {these}(Social/web)-graph proximity:People/pages “close” to {this} are {these}

Full-text SearchVector Space Model of IRCorpus as term-document matrixQuery as bag-of-words vectorFull-text search is just:

Collaborative FilteringUser preference matrix (and item-item similarity matrix )Input user as vector of preferences (simple) Item-based CF recommendations are:T

Graph ProximityAdjacency matrix:2nd degree adjacency matrix: Input all of a user’s “friends” or page links:(weighted) distance measure of 1st – 3rd degree connections is then:

DictionaryApplications Linear Algebra

How does this help?In Search:Latent Semantic Indexing (LSI)probabalistic LSILatent Dirichlet AllocationIn Recommenders:Singular Value DecompositionLayered Restricted Boltzmann Machines (Deep Belief Networks)In Graphs:PageRankSpectral Decomposition / Spectral Clustering

Often use “Dimensional Reduction”To alleviate the sparse Big Data problem of “the curse of dimensionality”Used to improve recall and relevance in general: smooth the metric on your data set

New applications with MatricesIf Search is finding doc-vector by: and users query with data represented: Q = Giving implicit feedback based on click-through per session: C =

… continuedThen has the form (docs-by-terms) for search!Approach has been used by Ted Dunning at Veoh(and probably others)

Linear Algebra performance tricksNaïve item-based recommendations:Calculate item similarity matrix:Calculate item recs:Express in one step:In matrix notation:Re-writing as: is the vector of preferences for user “v”, is the vector of preferences of item “i”The result is the matrix sum of the outer (tensor) products of these vectors, scaled by the entry they intersect at.

Apache MahoutApache Mahout currently on release 0.3http://lucene.apache.org/mahoutWill be a “Top Level Project” soon (before 0.4)( http://mahout.apache.org )“Scalable Machine Learning with commercially friendly licensing”

Mahout Features Recommenders absorbed the Taste projectClassification (Naïve Bayes, C-Bayes, more)Clustering (Canopy, fuzzy-K-means, Dirichlet, etc…)Fast non-distributed linear mathematics absorbed the classic CERN Colt projectDistributed Matrices and decompositionabsorbed the Decomposer projectmahout shell-script analogous to $HADOOP_HOME/bin/hadoop$MAHOUT_HOME/bin/mahout kmeans –i “in” –o “out” –k 100$MAHOUT_HOME/bin/mahout svd –i “in” –o “out” –k 300etc…Taste web-app for real-time recommendations

DistributedRowMatrixWrapper around a SequenceFile<IntWritable,VectorWritable>Distributed methods like:Matrix transpose();Matrix times(Matrix other);Vector times(Vectorv);Vector timesSquared(Vectorv);To get SVD: pass into DistributedLanczosSolver:LanczosSolver.solve(Matrix input, Matrix eigenVectors, List<Double> eigenValues, int rank);

Questions?Contact: jake.mannix@gmail.comjmannix@apache.orghttp://twitter.com/pbranehttp://www.decomposer.org/bloghttp://www.linkedin.com/in/jakemannix

AppendixThere are lots of ways to deal with sparse Big Data, and many (not all) need to deal with the dimensionality of the feature-space growing beyond reasonable limits, and techniques to deal with this depend heavily on your data…That having been said, there are some general techniques

Dealing with Curse of DimensionalitySparseness means fast, but overlap is too smallCan we reduce the dimensionality (from “all possible text tokens” or “all userIds”) while keeping the nice aspects of the search problem?If possible, collapse “similar” vectors (synonymous terms, userIds with high overlap, etc…) towards each other while keeping “dissimilar” vectors far apart…

Solution A: Matrix decompositionSingular Value Decomposition (truncated)“best” approximation to your matrixUsed in Latent Semantic Indexing (LSI)For graphs: spectral decompositionCollaborative filtering (Netflix leaderboard)Issues: very computation intensive no parallelized open-source packages see Apache MahoutMakes things too dense

SVD: continuedHadoopimpl. in Mahout (Lanczos)O(N*d*k) for rank-k SVD on N docs, delt’s each Density can be dealt with by doing Canopy Clustering offlineBut only extracting linear feature mixesAlso, still very computation intensive and I/O intensive (k-passes over data set), are there better dimensional reduction methods?

Solution B: Stochastic Decomposition co-ocurrence-based kernel + online Random Projection + SVD

Co-ocurrence-based kernelExtract bigram phrases / pairs of items rated by the same person (using Log-Likelihood Ratio test to pick the best)“Disney on Ice was Amazing!” -> {“disney”, “disney on ice”, “ice”, “was” “amazing”}{item1:4, item2:5, item5:3, item9:1} -> {item1:4, (items1+2):4.5, item2:5, item5:3,…}Dim(features) goes from 105to 108+(yikes!)

Online Random ProjectionRandomly project kernelized text vectors down to “merely” 103dimensions with a Gaussian matrix Or project eachnGram down to an random (but sparse) 103-dim vector:V= {123876244 =>1.3} (tf-IDF of “disney”)V’= c*{h(i) => 1, h(h(i)) =>1, h(h(h(i))) =>1} (c= 1.3 / sqrt(3))

Outer-product and SumTake the 103-dim projected vectors and outer-product with themselves,result is 103x103-dim matrixsum these in a CombinerAll results go to single Reducer, where you compute…

SVD SVD-them quickly (they fit in memory) Over and over again (as new data comes in)Use the most recent SVD to project your (already randomly projected) text still further (now encoding “semantic” similarity).SVD-projected vectors can be assigned immediately to nearest clusters if desired

ReferencesRandomized matrix decomposition review: http://arxiv.org/abs/0909.4061Sparse hashing/projection:John Langford et al. “VowpalWabbit”http://hunch.net/~vw/

Seattle Scalability Mahout

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to Seattle Scalability Mahout (20)

Recently uploaded (20)

Seattle Scalability Mahout

Editor's Notes