Hadoop/Mahout/HBaseでテキスト分類器を作ったよ

Hadoop/Mahout/HBase

2011/04/10
#TokyoWebmining10-2

yanaoki

2011 4 18

•
• HBase
• Mahout
• Naive Bayes
•
• Web

2011 4 18

•
• naoki yanai
•
•
• …

•
•
• Hadoop

•
•

2011 4 18

HBase
• KeyValue

• read/write

• goal is the hosting of very large tables -- billions of rows ,
millions of columns ...

• Hadoop

• CAP C,P

• C: ,A: ,P:

• Sharding

• Hadoop/MapReduce
2011 4 18

HBase
•
• ―

• ―

•

qualiﬁer

2011 4 18

Mahout
•
• Hadoop

•
• HBase

•
•
• Classiﬁer / Clustering / Pattern Mining

• Recommenders / Collaborative Filtering

• Evolutionary Algorithms ...
2011 4 18

Mahout

•
•
•
• Mahout

• Mahout in Action PDF

• hamadakoichi

• TokyoWebmining

2011 4 18

Naive Bayes
• F1,...,Fn C

• C

•

2011 4 18

Naive Bayes
•
•
•
•
•
•
•
2011 4 18

Naive Bayes
•
•
•
•
•
•
•
•
2011 4 18

• Web

•
•
•
•
•

2011 4 18

• Ruby

• ExtractContent

require "open-uri"
require "extractcontent"

html = open("http://
news.nifty.com/....htm").read
body, title = ExtractContent::analyse(html)

puts body.toutf8 #=> HTML

2011 4 18

• Ruby

• scrAPI

require 'scrapi'
require 'open-uri'

scr = Scraper.deﬁne do
process "div.tweet", "tweets[]"=> :text
result :tweets
end

tweets = scr.scrape(URI.parse("http://togetter.com/li/
121476"), :parser_options => {:char_encoding => 'utf8'})

tweets.each{ |tw| puts tw } #=>

2011 4 18

• RSS HBase

•
(URL)
content categories

http://togetter/1.html category:src=”togetter”
...
category:cat=”social”

http:// category:src=”nifty”
news.nifty.com/....html AKB ...
category:cat=”entertainment”
http://groups.google.com/ 10
group/webmining-tokyo/
…

http://ameblo.jp/....html
KARA …

2011 4 18

• HBase

category_id <TAB>

• HBase MaprReduce HDFS

•
•
•
• Wikipedia

•
2011 4 18

• mahout

$ mahout trainclassifier ...

$ mahout testclassifier …

• mahout

• --input/--output /

• --dataSource HDFS HBase

• --gramSize N-gram

• --classifierType

• --alpha

• --minDF/--minSupport /

2011 4 18

• HBase

•
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       1884       82.2348%
Incorrectly Classified Instances        :        407       17.7652%
Total Classified Instances              :       2291
=======================================================
Confusion Matrix
-------------------------------------------------------
a       b       c       d       e       <--Classified as
216     32      22      155     0        | 425         a     = t
0       514     13      70      0        | 597         b     = s
0       2       514     9       0        | 525         c     = e
1       8       13      638     0        | 660         d     = b
0       0       67      15      2        | 84          e     = a
Default Category: unknown: 5

2011 4 18

•
• reducer HBase

//
BayesParameters params = new BayesParameters();
params.set("alpha_i", "1");
algorithm = new CBayesAlgorithm();
datastore = new HBaseBayesDatastore("model_table_name", params);
classifier = new ClassifierContext(algorithm, datastore);

//
ClassifierResult category = classifier.classifyDocument(doc.toArray(new String
[doc.size()]), "default");

String label = category.getLabel();

2011 4 18

•

(URL)
content categories

http://togetter/1.html category:src=”togetter”
...
category:cat=”social”

http:// category:src=”nifty”
news.nifty.com/....html AKB ...
category:cat=”entertainment”
http://groups.google.com/ 10
group/webmining-tokyo/ category:cat=”technology”
…

http://ameblo.jp/....html
KARA … category:cat=”entertainment”

2011 4 18

Web
• Google News Togetter
RSS

•
• …

• …
•
a 935 5.2M
b 5,112 7.2M
e 3,746 8.1M
s 4,737 12M
t 3,969 9.2M
2011 4 18

4/18

Web
•
•
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :      13388        91.6798%
Incorrectly Classified Instances        :       1215         8.3202%
Total Classified Instances              :      14603

=======================================================
Confusion Matrix
-------------------------------------------------------
a         b         c         d         e         <--Classified as
2328      19        515       250       0          | 3112       a     = t
3         2939      54        20        0          | 3016       b     = e
32        3         3542      109       0          | 3686       c     = s
33        16        128       3877      0          | 4054       d     = b
1         27        2         3         702        | 735        e     = a
Default Category: unknown: 5

2011 4 18

Web

•
•
• alpha

1 0.5 0.1 0.01 0.001

65.38% 65.83% 66.73% 66.82% 67.02%

2011 4 18

4/18

Web

•
•
• N-Gram

unigram bigram

63.57% 66.09%

2011 4 18

Web

•
•
•

+

56.8% 65.38%

2011 4 18

4/18

Web

•
•
•

67.02% 67.88%

2011 4 18

•
•
• HBase/Mahout

•
• HBase

2011 4 18

Hadoop/Mahout/HBaseでテキスト分類器を作ったよ

Recommended

More Related Content

Viewers also liked (15)

Similar to Hadoop/Mahout/HBaseでテキスト分類器を作ったよ (20)

Recently uploaded (20)