Classification with Naive Bayes

Classification with Naïve BayesA Deep Dive into Apache Mahout

Today’s speaker – Josh Pattersonjosh@cloudera.com / twitter: @jpatanoogaMaster’s Thesis: self-organizing mesh networksPublished in IAAI-09: TinyTermite: A Secure Routing AlgorithmConceived, built, and led Hadoop integration for the openPDC project at TVA (Smartgrid stuff)Led small team which designed classification techniques for time series and Map ReduceOpen source work at http://openpdc.codeplex.comNow: Solutions Architect at Cloudera2

What is Classification?Supervised LearningWe give the system a set of instances to learn fromSystem builds knowledge of some structureLearns “concepts”System can then classify new instances

Supervised vs Unsupervised LearningSupervisedGive system examples/instances of multiple conceptsSystem learns “concepts”More “hands on”Example: Naïve Bayes, Neural NetsUnsupervisedUses unlabled dataBuilds joint density modelExample: k-means clustering

Naïve BayesCalled Naïve Bayes because its based on “Baye’sRule” and “naively” assumes independence given the labelIt is only valid to multiply probabilities when the events are independentSimplistic assumption in real lifeDespite the name, Naïve works well on actual datasets

Naïve Bayes ClassifierSimple probabilistic classifier based on applying Baye’s theorem (from Bayesian statistics) strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be “independent feature model".

Naïve Bayes Classifier (2)Assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. Example: a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.

Condensing MeaningTo train our system we needTotal number input training instances (count)Counts tuples: {attributen,outcomeo,valuem} Total counts of each outcomeo{outcome-count}To Calculate each Pr[En|H]({attributen,outcomeo,valuem} / {outcome-count} )…From the Vapor of That Last Big Equation

A Real Example From Witten, et al

Enter Apache MahoutWhat is it?Apache Mahout is a scalable machine learning library that supports large data setsWhat Are the Major Algorithm Type?ClassificationRecommendationClusteringhttp://mahout.apache.org/

Naïve Bayes and TextNaive Bayes does not model text well. “Tackling the Poor Assumptions of Naive Bayes Text Classifiers”http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdfMahout does some modifications based around TF-IDF scoring (Next Slide)Includes two other pre-processing steps, common for information retrieval but not for Naive Bayes classification

High Level AlgorithmFor Each Feature(word) in each Doc:Calc: “Weight Normalized Tf-Idf”for a given feature in a label is the Tf-idf calculated using standard idf multiplied by the Weight Normalized TfWe calculate the sum of W-N-Tf-idf for all the features in a label called Sigma_k, and alpha_i == 1.0Weight = Log [ ( W-N-Tf-Idf + alpha_i ) / ( Sigma_k + N ) ]

BayesDriver Training WorkflowNaïve Bayes Training MapReduce Workflow in Mahout

Logical Classification ProcessGather, Clean, and Examine the Training DataReally get to know your data!Train the Classifier, allowing the system to “Learn” the “Concepts”But not “overfit” to this specific training data setClassify New Unseen InstancesWith Naïve Bayes we’ll calculate the probabilities of each class wrt this instance

How Is Classification Done?Sequentially or via Map ReduceTestClassifier.javaCreates ClassifierContextFor Each File in DirFor Each LineBreak line into map of tokensFeed array of words to Classifier engine for new classification/labelCollect classifications as output

A Quick Note About Training Data…Your classifier can only be as good as the training data lets it be…If you don’t do good data prep, everything will perform poorlyData collection and pre-processing takes the bulk of the time

Enough Math, Run the CodeDownload and install Mahouthttp://www.apache.orgRun 20Newsgroups Examplehttps://cwiki.apache.org/confluence/display/MAHOUT/Twenty+NewsgroupsUses Naïve Bayes ClassificationDownload and extract 20news-bydate.tar.gz from the 20newsgroups dataset

Generate Test and Train DatasetTraining Dataset:mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \ -p examples/bin/work/20news-bydate/20news-bydate-train \ -o examples/bin/work/20news-bydate/bayes-train-input \ -a org.apache.mahout.vectorizer.DefaultAnalyzer\ -c UTF-8Test Dataset:mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \ -p examples/bin/work/20news-bydate/20news-bydate-test \ -o examples/bin/work/20news-bydate/bayes-test-input \ -a org.apache.mahout.vectorizer.DefaultAnalyzer \ -c UTF-8

Train and Test ClassifierTrain:$MAHOUT_HOME/bin/mahout trainclassifier \ -i 20news-input/bayes-train-input \ -o newsmodel \ -type bayes \ -ng 3 \ -source hdfsTest:$MAHOUT_HOME/bin/mahout testclassifier \ -m newsmodel \ -d 20news-input \ -type bayes \ -ng 3 \ -source hdfs \ -method mapreduce

Other Use CasesPredictive AnalyticsYou’ll hear this term a lot in the field, especially in the context of SASGeneral Supervised Learning ClassificationWe can recognize a lot of things with practiceAnd lots of tuning!Document ClassificationSentiment Analysis

Questions?We’re Hiring!Cloudera’sDistro of Apache Hadoop:http://www.cloudera.comResources“Tackling the Poor Assumptions of Naive Bayes Text Classifiers”http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

Classification with Naive Bayes

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Classification with Naive Bayes (20)

More from Josh Patterson (20)

Classification with Naive Bayes

Editor's Notes