SlideShare a Scribd company logo
Hadoop/Mahout/HBase



                     2011/04/10
                   #TokyoWebmining10-2

                         yanaoki



2011   4   18
•
                • HBase
                • Mahout
                • Naive Bayes
                •
                • Web

2011   4   18
•
                    •   naoki yanai
                •
                    •
                    •                 …

                •
                    •
                    •       Hadoop

                •
                    •

2011   4   18
HBase
                •   KeyValue

                    •                                                         read/write

                        •   goal is the hosting of very large tables -- billions of rows ,
                            millions of columns ...


                    •   Hadoop

                •   CAP                   C,P

                    •   C:            ,A:             ,P:

                •            Sharding

                •   Hadoop/MapReduce
2011   4   18
HBase
                •
                    •   ―

                    •   ―

                    •



                            qualifier

2011   4   18
Mahout
           •
           •    Hadoop

                •
                •                          HBase

                •
           •
                •   Classifier / Clustering / Pattern Mining

                •   Recommenders / Collaborative Filtering

                •   Evolutionary Algorithms ...
2011   4   18
Mahout

           •
           •
                •
                •   Mahout

                •   Mahout in Action PDF

                •   hamadakoichi

                •   TokyoWebmining

2011   4   18
Naive Bayes
           •        F1,...,Fn           C




           •    C




           •


2011   4   18
Naive Bayes
                •
                    •
                        •
                    •
                        •
                    •
                        •
2011   4   18
Naive Bayes
                •
                    •
                •
                    •
                •
                    •
                •
                    •
2011   4   18
•       Web

                    •
                    •
                    •
                •
                    •

2011   4   18
2011   4   18
•    Ruby

                •   ExtractContent

           require "open-uri"
           require "extractcontent"

           html = open("http://
           news.nifty.com/....htm").read
           body, title = ExtractContent::analyse(html)

           puts body.toutf8 #=>        HTML


2011   4   18
•    Ruby

                •   scrAPI


       require 'scrapi'
       require 'open-uri'

       scr = Scraper.define do
        process "div.tweet", "tweets[]"=> :text
        result :tweets
       end

       tweets = scr.scrape(URI.parse("http://togetter.com/li/
       121476"), :parser_options => {:char_encoding => 'utf8'})

       tweets.each{ |tw| puts tw } #=>


2011   4   18
•                                             RSS                      HBase


           •
                      (URL)
                                         content                         categories

       http://togetter/1.html                                  category:src=”togetter”
                                                   ...
                                                               category:cat=”social”

       http://                                                 category:src=”nifty”
       news.nifty.com/....html     AKB      ...
                                                               category:cat=”entertainment”
       http://groups.google.com/                         10
       group/webmining-tokyo/
                                                  …

       http://ameblo.jp/....html
                                   KARA …

2011   4   18
•    HBase

                    category_id <TAB>

           •    HBase           MaprReduce   HDFS

                •
                    •
                    •
                        •   Wikipedia

                    •
2011   4   18
•    mahout

                    $ mahout trainclassifier       ...

                    $ mahout testclassifier        …

           •    mahout

                •    --input/--output         /

                •    --dataSource                   HDFS   HBase

                •    --gramSize     N-gram

                •    --classifierType

                •    --alpha

                •    --minDF/--minSupport                  /

2011   4   18
•                            HBase


           •
           =======================================================
           Summary
           -------------------------------------------------------
           Correctly Classified Instances          :       1884       82.2348%
           Incorrectly Classified Instances        :        407       17.7652%
           Total Classified Instances              :       2291
           =======================================================
           Confusion Matrix
           -------------------------------------------------------
           a       b       c       d       e       <--Classified as
           216     32      22      155     0        |  425         a     = t
           0       514     13      70      0        |  597         b     = s
           0       2       514     9       0        |  525         c     = e
           1       8       13      638     0        |  660         d     = b
           0       0       67      15      2        |  84          e     = a
           Default Category: unknown: 5


2011   4   18
•
           •                                      reducer                      HBase


            //
            BayesParameters params = new BayesParameters();
            params.set("alpha_i", "1");
            algorithm = new CBayesAlgorithm();
            datastore = new HBaseBayesDatastore("model_table_name", params);
            classifier = new ClassifierContext(algorithm, datastore);

            //
            ClassifierResult category = classifier.classifyDocument(doc.toArray(new String
            [doc.size()]), "default");

            String label = category.getLabel();


2011   4   18
•

                      (URL)
                                         content                        categories

       http://togetter/1.html                                 category:src=”togetter”
                                                   ...
                                                              category:cat=”social”

       http://                                                category:src=”nifty”
       news.nifty.com/....html     AKB      ...
                                                              category:cat=”entertainment”
       http://groups.google.com/                         10
       group/webmining-tokyo/                                 category:cat=”technology”
                                                  …

       http://ameblo.jp/....html
                                   KARA …                     category:cat=”entertainment”

2011   4   18
Web




2011   4   18
Web
                •   Google News Togetter
                                   RSS

                •
                    •                              …

                    •                                         …
                •
                        a                   935        5.2M
                        b                  5,112       7.2M
                        e                  3,746       8.1M
                        s                  4,737       12M
                        t                  3,969       9.2M
2011   4   18
4/18

                                 Web
                •
                      •
                =======================================================
                Summary
                -------------------------------------------------------
                Correctly Classified Instances          :      13388        91.6798%
                Incorrectly Classified Instances        :       1215         8.3202%
                Total Classified Instances              :      14603

                =======================================================
                Confusion Matrix
                -------------------------------------------------------
                a         b         c         d         e         <--Classified as
                2328      19        515       250       0          |  3112       a       =   t
                3         2939      54        20        0          |  3016       b       =   e
                32        3         3542      109       0          |  3686       c       =   s
                33        16        128       3877      0          |  4054       d       =   b
                1         27        2         3         702        |  735        e       =   a
                Default Category: unknown: 5


2011   4   18
Web


                •
                    •
                        •                              alpha


                              1         0.5     0.1        0.01    0.001




                            65.38%   65.83%   66.73%     66.82%   67.02%


2011   4   18
4/18

                               Web


                •
                    •
                        •   N-Gram


                                     unigram   bigram


                                     63.57%    66.09%


2011   4   18
Web


                •
                    •
                        •

                                           +




                                  56.8%   65.38%


2011   4   18
4/18

                            Web


                •
                    •
                        •



                                  67.02%   67.88%


2011   4   18
•
                    •
                •               HBase/Mahout

                    •
                    •   HBase



2011   4   18
2011   4   18

More Related Content

Viewers also liked (15)

Mahoutにパッチを送ってみた
Mahoutにパッチを送ってみたMahoutにパッチを送ってみた
Mahoutにパッチを送ってみた
issaymk2
 
ComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifierComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifier
Naoki Yanai
 
Introduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahoutIntroduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahout
takaya imai
 
Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6
Koichi Hamada
 
Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8 Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8
Koichi Hamada
 
協調フィルタリング with Mahout
協調フィルタリング with Mahout協調フィルタリング with Mahout
協調フィルタリング with Mahout
Katsuhiro Takata
 
Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9
Koichi Hamada
 
"Mahout Recommendation" - #TokyoWebmining 14th
"Mahout Recommendation" -  #TokyoWebmining 14th"Mahout Recommendation" -  #TokyoWebmining 14th
"Mahout Recommendation" - #TokyoWebmining 14th
Koichi Hamada
 
MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習
Preferred Networks
 
20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share
Yasushi Gunya
 
計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)
Shota Yasui
 
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
Koichi Hamada
 
Appium: Automation for Mobile Apps
Appium: Automation for Mobile AppsAppium: Automation for Mobile Apps
Appium: Automation for Mobile Apps
Sauce Labs
 
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
Naoki Yanai
 
Mahoutにパッチを送ってみた
Mahoutにパッチを送ってみたMahoutにパッチを送ってみた
Mahoutにパッチを送ってみた
issaymk2
 
ComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifierComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifier
Naoki Yanai
 
Introduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahoutIntroduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahout
takaya imai
 
Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6
Koichi Hamada
 
Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8 Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8
Koichi Hamada
 
協調フィルタリング with Mahout
協調フィルタリング with Mahout協調フィルタリング with Mahout
協調フィルタリング with Mahout
Katsuhiro Takata
 
Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9
Koichi Hamada
 
"Mahout Recommendation" - #TokyoWebmining 14th
"Mahout Recommendation" -  #TokyoWebmining 14th"Mahout Recommendation" -  #TokyoWebmining 14th
"Mahout Recommendation" - #TokyoWebmining 14th
Koichi Hamada
 
MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習
Preferred Networks
 
20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share
Yasushi Gunya
 
計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)
Shota Yasui
 
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
Koichi Hamada
 
Appium: Automation for Mobile Apps
Appium: Automation for Mobile AppsAppium: Automation for Mobile Apps
Appium: Automation for Mobile Apps
Sauce Labs
 
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
Naoki Yanai
 

Similar to Hadoop/Mahout/HBaseで テキスト分類器を作ったよ (20)

The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
Yahoo Developer Network
 
Hadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionHadoop: A Hands-on Introduction
Hadoop: A Hands-on Introduction
Claudio Martella
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data mining
Ahmad Ammari
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata 제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata
Gruter
 
Hadoop pycon2011uk
Hadoop pycon2011ukHadoop pycon2011uk
Hadoop pycon2011uk
Aditya Sakhuja
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
Grant Ingersoll
 
Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with Cascading
Paco Nathan
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
Rohit
 
A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...
Paco Nathan
 
Jan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalogJan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalog
Yahoo Developer Network
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
Yahoo Developer Network
 
Future of HCatalog
Future of HCatalogFuture of HCatalog
Future of HCatalog
DataWorks Summit
 
Introduction to HADOOP
Introduction to HADOOPIntroduction to HADOOP
Introduction to HADOOP
Shital Kat
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
Korea Sdec
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Cloudera, Inc.
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
Hadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionHadoop: A Hands-on Introduction
Hadoop: A Hands-on Introduction
Claudio Martella
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data mining
Ahmad Ammari
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata 제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata
Gruter
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
Grant Ingersoll
 
Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with Cascading
Paco Nathan
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
Rohit
 
A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...
Paco Nathan
 
Introduction to HADOOP
Introduction to HADOOPIntroduction to HADOOP
Introduction to HADOOP
Shital Kat
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
Korea Sdec
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Cloudera, Inc.
 

Recently uploaded (20)

Fully Open-Source Private Clouds: Freedom, Security, and Control
Fully Open-Source Private Clouds: Freedom, Security, and ControlFully Open-Source Private Clouds: Freedom, Security, and Control
Fully Open-Source Private Clouds: Freedom, Security, and Control
ShapeBlue
 
GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...
GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...
GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...
James Anderson
 
Talk: On an adventure into the depths of Maven - Kaya Weers
Talk: On an adventure into the depths of Maven - Kaya WeersTalk: On an adventure into the depths of Maven - Kaya Weers
Talk: On an adventure into the depths of Maven - Kaya Weers
Kaya Weers
 
Building Agents with LangGraph & Gemini
Building Agents with LangGraph &  GeminiBuilding Agents with LangGraph &  Gemini
Building Agents with LangGraph & Gemini
HusseinMalikMammadli
 
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Eugene Fidelin
 
cloudgenesis cloud workshop , gdg on campus mita
cloudgenesis cloud workshop , gdg on campus mitacloudgenesis cloud workshop , gdg on campus mita
cloudgenesis cloud workshop , gdg on campus mita
siyaldhande02
 
Content and eLearning Standards: Finding the Best Fit for Your-Training
Content and eLearning Standards: Finding the Best Fit for Your-TrainingContent and eLearning Standards: Finding the Best Fit for Your-Training
Content and eLearning Standards: Finding the Best Fit for Your-Training
Rustici Software
 
What’s New in Web3 Development Trends to Watch in 2025.pptx
What’s New in Web3 Development Trends to Watch in 2025.pptxWhat’s New in Web3 Development Trends to Watch in 2025.pptx
What’s New in Web3 Development Trends to Watch in 2025.pptx
Lisa ward
 
Optimize IBM i with Consulting Services Help
Optimize IBM i with Consulting Services HelpOptimize IBM i with Consulting Services Help
Optimize IBM i with Consulting Services Help
Alice Gray
 
Security Operations and the Defense Analyst - Splunk Certificate
Security Operations and the Defense Analyst - Splunk CertificateSecurity Operations and the Defense Analyst - Splunk Certificate
Security Operations and the Defense Analyst - Splunk Certificate
VICTOR MAESTRE RAMIREZ
 
What is DePIN? The Hottest Trend in Web3 Right Now!
What is DePIN? The Hottest Trend in Web3 Right Now!What is DePIN? The Hottest Trend in Web3 Right Now!
What is DePIN? The Hottest Trend in Web3 Right Now!
cryptouniversityoffi
 
UiPath Community Zurich: Release Management and Build Pipelines
UiPath Community Zurich: Release Management and Build PipelinesUiPath Community Zurich: Release Management and Build Pipelines
UiPath Community Zurich: Release Management and Build Pipelines
UiPathCommunity
 
Artificial Intelligence (Kecerdasan Buatan).pdf
Artificial Intelligence (Kecerdasan Buatan).pdfArtificial Intelligence (Kecerdasan Buatan).pdf
Artificial Intelligence (Kecerdasan Buatan).pdf
NufiEriKusumawati
 
Build your own NES Emulator... with Kotlin
Build your own NES Emulator... with KotlinBuild your own NES Emulator... with Kotlin
Build your own NES Emulator... with Kotlin
Artur Skowroński
 
Agentic AI - The New Era of Intelligence
Agentic AI - The New Era of IntelligenceAgentic AI - The New Era of Intelligence
Agentic AI - The New Era of Intelligence
Muzammil Shah
 
Supercharge Your AI Development with Local LLMs
Supercharge Your AI Development with Local LLMsSupercharge Your AI Development with Local LLMs
Supercharge Your AI Development with Local LLMs
Francesco Corti
 
With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...
With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...
With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...
SOFTTECHHUB
 
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Lorenzo Miniero
 
Splunk Leadership Forum Wien - 20.05.2025
Splunk Leadership Forum Wien - 20.05.2025Splunk Leadership Forum Wien - 20.05.2025
Splunk Leadership Forum Wien - 20.05.2025
Splunk
 
The fundamental misunderstanding in Team Topologies
The fundamental misunderstanding in Team TopologiesThe fundamental misunderstanding in Team Topologies
The fundamental misunderstanding in Team Topologies
Patricia Aas
 
Fully Open-Source Private Clouds: Freedom, Security, and Control
Fully Open-Source Private Clouds: Freedom, Security, and ControlFully Open-Source Private Clouds: Freedom, Security, and Control
Fully Open-Source Private Clouds: Freedom, Security, and Control
ShapeBlue
 
GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...
GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...
GDG Cloud Southlake #43: Tommy Todd: The Quantum Apocalypse: A Looming Threat...
James Anderson
 
Talk: On an adventure into the depths of Maven - Kaya Weers
Talk: On an adventure into the depths of Maven - Kaya WeersTalk: On an adventure into the depths of Maven - Kaya Weers
Talk: On an adventure into the depths of Maven - Kaya Weers
Kaya Weers
 
Building Agents with LangGraph & Gemini
Building Agents with LangGraph &  GeminiBuilding Agents with LangGraph &  Gemini
Building Agents with LangGraph & Gemini
HusseinMalikMammadli
 
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)
Eugene Fidelin
 
cloudgenesis cloud workshop , gdg on campus mita
cloudgenesis cloud workshop , gdg on campus mitacloudgenesis cloud workshop , gdg on campus mita
cloudgenesis cloud workshop , gdg on campus mita
siyaldhande02
 
Content and eLearning Standards: Finding the Best Fit for Your-Training
Content and eLearning Standards: Finding the Best Fit for Your-TrainingContent and eLearning Standards: Finding the Best Fit for Your-Training
Content and eLearning Standards: Finding the Best Fit for Your-Training
Rustici Software
 
What’s New in Web3 Development Trends to Watch in 2025.pptx
What’s New in Web3 Development Trends to Watch in 2025.pptxWhat’s New in Web3 Development Trends to Watch in 2025.pptx
What’s New in Web3 Development Trends to Watch in 2025.pptx
Lisa ward
 
Optimize IBM i with Consulting Services Help
Optimize IBM i with Consulting Services HelpOptimize IBM i with Consulting Services Help
Optimize IBM i with Consulting Services Help
Alice Gray
 
Security Operations and the Defense Analyst - Splunk Certificate
Security Operations and the Defense Analyst - Splunk CertificateSecurity Operations and the Defense Analyst - Splunk Certificate
Security Operations and the Defense Analyst - Splunk Certificate
VICTOR MAESTRE RAMIREZ
 
What is DePIN? The Hottest Trend in Web3 Right Now!
What is DePIN? The Hottest Trend in Web3 Right Now!What is DePIN? The Hottest Trend in Web3 Right Now!
What is DePIN? The Hottest Trend in Web3 Right Now!
cryptouniversityoffi
 
UiPath Community Zurich: Release Management and Build Pipelines
UiPath Community Zurich: Release Management and Build PipelinesUiPath Community Zurich: Release Management and Build Pipelines
UiPath Community Zurich: Release Management and Build Pipelines
UiPathCommunity
 
Artificial Intelligence (Kecerdasan Buatan).pdf
Artificial Intelligence (Kecerdasan Buatan).pdfArtificial Intelligence (Kecerdasan Buatan).pdf
Artificial Intelligence (Kecerdasan Buatan).pdf
NufiEriKusumawati
 
Build your own NES Emulator... with Kotlin
Build your own NES Emulator... with KotlinBuild your own NES Emulator... with Kotlin
Build your own NES Emulator... with Kotlin
Artur Skowroński
 
Agentic AI - The New Era of Intelligence
Agentic AI - The New Era of IntelligenceAgentic AI - The New Era of Intelligence
Agentic AI - The New Era of Intelligence
Muzammil Shah
 
Supercharge Your AI Development with Local LLMs
Supercharge Your AI Development with Local LLMsSupercharge Your AI Development with Local LLMs
Supercharge Your AI Development with Local LLMs
Francesco Corti
 
With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...
With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...
With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...
SOFTTECHHUB
 
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025
Lorenzo Miniero
 
Splunk Leadership Forum Wien - 20.05.2025
Splunk Leadership Forum Wien - 20.05.2025Splunk Leadership Forum Wien - 20.05.2025
Splunk Leadership Forum Wien - 20.05.2025
Splunk
 
The fundamental misunderstanding in Team Topologies
The fundamental misunderstanding in Team TopologiesThe fundamental misunderstanding in Team Topologies
The fundamental misunderstanding in Team Topologies
Patricia Aas
 

Hadoop/Mahout/HBaseで テキスト分類器を作ったよ

  • 1. Hadoop/Mahout/HBase 2011/04/10 #TokyoWebmining10-2 yanaoki 2011 4 18
  • 2. • HBase • Mahout • Naive Bayes • • Web 2011 4 18
  • 3. • naoki yanai • • • … • • • Hadoop • • 2011 4 18
  • 4. HBase • KeyValue • read/write • goal is the hosting of very large tables -- billions of rows , millions of columns ... • Hadoop • CAP C,P • C: ,A: ,P: • Sharding • Hadoop/MapReduce 2011 4 18
  • 5. HBase • • ― • ― • qualifier 2011 4 18
  • 6. Mahout • • Hadoop • • HBase • • • Classifier / Clustering / Pattern Mining • Recommenders / Collaborative Filtering • Evolutionary Algorithms ... 2011 4 18
  • 7. Mahout • • • • Mahout • Mahout in Action PDF • hamadakoichi • TokyoWebmining 2011 4 18
  • 8. Naive Bayes • F1,...,Fn C • C • 2011 4 18
  • 9. Naive Bayes • • • • • • • 2011 4 18
  • 10. Naive Bayes • • • • • • • • 2011 4 18
  • 11. Web • • • • • 2011 4 18
  • 12. 2011 4 18
  • 13. Ruby • ExtractContent require "open-uri" require "extractcontent" html = open("http:// news.nifty.com/....htm").read body, title = ExtractContent::analyse(html) puts body.toutf8 #=> HTML 2011 4 18
  • 14. Ruby • scrAPI require 'scrapi' require 'open-uri' scr = Scraper.define do process "div.tweet", "tweets[]"=> :text result :tweets end tweets = scr.scrape(URI.parse("http://togetter.com/li/ 121476"), :parser_options => {:char_encoding => 'utf8'}) tweets.each{ |tw| puts tw } #=> 2011 4 18
  • 15. RSS HBase • (URL) content categories http://togetter/1.html category:src=”togetter” ... category:cat=”social” http:// category:src=”nifty” news.nifty.com/....html AKB ... category:cat=”entertainment” http://groups.google.com/ 10 group/webmining-tokyo/ … http://ameblo.jp/....html KARA … 2011 4 18
  • 16. HBase category_id <TAB> • HBase MaprReduce HDFS • • • • Wikipedia • 2011 4 18
  • 17. mahout $ mahout trainclassifier ... $ mahout testclassifier … • mahout • --input/--output / • --dataSource HDFS HBase • --gramSize N-gram • --classifierType • --alpha • --minDF/--minSupport / 2011 4 18
  • 18. HBase • ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances          :       1884       82.2348% Incorrectly Classified Instances        :        407       17.7652% Total Classified Instances              :       2291 ======================================================= Confusion Matrix ------------------------------------------------------- a       b       c       d       e       <--Classified as 216     32      22      155     0        |  425         a     = t 0       514     13      70      0        |  597         b     = s 0       2       514     9       0        |  525         c     = e 1       8       13      638     0        |  660         d     = b 0       0       67      15      2        |  84          e     = a Default Category: unknown: 5 2011 4 18
  • 19. • reducer HBase // BayesParameters params = new BayesParameters(); params.set("alpha_i", "1"); algorithm = new CBayesAlgorithm(); datastore = new HBaseBayesDatastore("model_table_name", params); classifier = new ClassifierContext(algorithm, datastore); // ClassifierResult category = classifier.classifyDocument(doc.toArray(new String [doc.size()]), "default"); String label = category.getLabel(); 2011 4 18
  • 20. (URL) content categories http://togetter/1.html category:src=”togetter” ... category:cat=”social” http:// category:src=”nifty” news.nifty.com/....html AKB ... category:cat=”entertainment” http://groups.google.com/ 10 group/webmining-tokyo/ category:cat=”technology” … http://ameblo.jp/....html KARA … category:cat=”entertainment” 2011 4 18
  • 21. Web 2011 4 18
  • 22. Web • Google News Togetter RSS • • … • … • a 935 5.2M b 5,112 7.2M e 3,746 8.1M s 4,737 12M t 3,969 9.2M 2011 4 18
  • 23. 4/18 Web • • ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances          :      13388        91.6798% Incorrectly Classified Instances        :       1215         8.3202% Total Classified Instances              :      14603 ======================================================= Confusion Matrix ------------------------------------------------------- a         b         c         d         e         <--Classified as 2328      19        515       250       0          |  3112       a     = t 3         2939      54        20        0          |  3016       b     = e 32        3         3542      109       0          |  3686       c     = s 33        16        128       3877      0          |  4054       d     = b 1         27        2         3         702        |  735        e     = a Default Category: unknown: 5 2011 4 18
  • 24. Web • • • alpha 1 0.5 0.1 0.01 0.001 65.38% 65.83% 66.73% 66.82% 67.02% 2011 4 18
  • 25. 4/18 Web • • • N-Gram unigram bigram 63.57% 66.09% 2011 4 18
  • 26. Web • • • + 56.8% 65.38% 2011 4 18
  • 27. 4/18 Web • • • 67.02% 67.88% 2011 4 18
  • 28. • • HBase/Mahout • • HBase 2011 4 18
  • 29. 2011 4 18