SlideShare a Scribd company logo
In Search Of...
Ian Barber
@ianbarber
http://phpir.com
ian@ibuildings.com
integrating site search
Friday, 29 October 2010
2
How Search Works
Integrating Search
Improving Results
Using Search
Search Performance
Questions
Friday, 29 October 2010
3
Friday, 29 October 2010
4
Index
DocumentDocumentDocumentDocumentAnalyser
Query
Parser
QueryQueryQueryQuery
ResultResultResultResult
Friday, 29 October 2010
5
With AT&T’s help, the F.B.I
Miami-Dade office had recovered
$1.1 million from O’Healy’s Ponzi
scheme, 10-15% more than
expected.
Tokenisation
“
”Friday, 29 October 2010
6
PHP Tokenisation
function tokenise($string) {
$string = strtolower($string);
preg_match_all('/w+/', $string,
$matches, PREG_OFFSET_CAPTURE);
return $matches[0];
}
Friday, 29 October 2010
7
Document Term Pairs
Document ID Term
1 the
1 best
1 of
1 the
... ...
204 and
204 what
204 would
Friday, 29 October 2010
8
Inverted Index
Term Documents
best 1 (4, 16), 4 (422), 129 (344) ...
what 24 (50, 98), 75 (33, 208) ...
would 99 (32, 599), 201 (344) ..
... ...
Friday, 29 October 2010
9
Boolean Query Merge
Query: Best Western Hotel
Result: Document 298
best 1 4 129 298 305 338
western 4 95 194 204 298 305
hotel 2 40 200 298 355 402
working 4 298 305
Friday, 29 October 2010
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Sed sit amet ante vitae enim
elementum semper sodales quis ipsum. Aliquam
vel condimentum neque. Curabitur ornare
feugiat ornare. Donec consectetur elit metus.
Nulla eleifend tincidunt massa et euismod.
Vestibulum vestibulum, justo vel egestas
elementum, purus enim ornare quam, vel
gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel
risus vitae mauris vehicula facilisis sit amet in
mi. Nulla ut turpis id felis sollicitudin dictum
sed non ipsum. Praesent ut risus nulla, sed
blandit leo. Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec dapibus
fringilla arcu, et semper lacus egestas non.
Quisque eu purus ut lacus egestas dapibus.
Integer in velit id est dictum bibendum in id mi.
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacusLorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Friday, 29 October 2010
11
TF-IDF
function getWeight($docID, $term, $total) {
$tf = count($term[$docID]);
$idf = log($total / count($term), 2);
return $tf * $idf;
}
Friday, 29 October 2010
12
Document Vector
socket what heavy steel ...
Doc 1 0.02 0.3 0.001 0 ...
Doc 2 0 0 0 0 ...
Doc 3 0.001 0.2 0 0 ...
Doc 4 0 0 0.002 0.003 ...
Friday, 29 October 2010
best 23 42 179 246 333 703
weight 0.008 0.002 0.023 0.039 0.014 0.001
western 42 88 120 179 246 798
weight 0.003 0.004 0.023 0.001 0.034 0.004
1 - 246: 0.073
2 - 179: 0.024
3 - 120: 0.023
Ranked Query Merge
13
Friday, 29 October 2010
14
PHP Similarity
function score($queryString, $index) {
$query = tokenize($queryString);
$matches = array();
foreach($query as $qterm) {
$postings = $index[$qterm];
foreach($postings as $id => $posting) {
$matches[$id] += $posting['score'];
}
}
return arsort($matches);
}
Friday, 29 October 2010
15
Integrating Search
Friday, 29 October 2010
16
CREATE TABLE example (
id INT(11) NOT NULL auto_increment,
title VARCHAR(255),
content TEXT,
PRIMARY KEY(id),
FULLTEXT(title,content)
) Engine=MyISAM;
INSERT INTO example (title, content) VALUES
('Mikko & Bacon','Mikko loves bacon'),
('Marcello & Bacon','Marcello hates bacon'),
('Jo & Sausages','Johanna loves sausages'),
('Hollywood & Garlic','Lorenzo hates garlic'),
('James & Cheddar','James is keen on cheeses');
MySQL Full Text Search
Friday, 29 October 2010
17
MySQL FTI Query
SELECT * FROM example WHERE
MATCH(title,content) AGAINST('loves bacon');
+----+------------------+------------------------+
| id | title | content |
+----+------------------+------------------------+
| 1 | Mikko & Bacon | Mikko loves bacon |
| 2 | Marcello & Bacon | Marcello hates bacon |
| 3 | Jo & Sausages | Johanna loves sausages |
+----+------------------+------------------------+
3 rows in set (0.00 sec)
Friday, 29 October 2010
18
Sphinx
http://www.sphinxsearch.com
Friday, 29 October 2010
19
Sphinx Configuration
source posts
{
type = mysql
sql_host = localhost
sql_user = user
sql_pass = password
sql_db = search
sql_query = 
SELECT id, title, content FROM example;
sql_attr_multi = uint tag from query; 
SELECT example_id, tag_id FROM tags;
}
Friday, 29 October 2010
20
index posts
{
source = posts
path = /var/data/sphinx/example
morphology = stem_en
min_word_len = 3
min_prefix_len = 3
min_infix_len = 0
enable_star = 1
}
Friday, 29 October 2010
21
Stemming
happening
happened
happens
http://tartarus.org/~martin/PorterStemmer
- happen
- happen
- happen
Friday, 29 October 2010
22
Command Line Searching
indexer --config /etc/sphinx.conf --all
search --config /etc/sphinx.conf love bacon
displaying matches:
1. document=1, weight=3, tag=(1,2)
! id=1
! title=Mikko & Bacon
! content=Mikko loves bacon
words:
1. 'love': 2 documents, 2 hits
2. 'bacon': 2 documents, 4 hits
searchd --config /etc/sphinx.conf
Friday, 29 October 2010
23
Sphinx From PHP
$cl = new SphinxClient();
$cl->SetServer('localhost', 3312);
$cl->SetMatchMode(SPH_MATCH_ANY);
$result = $cl->Query('bac*');
$docIDs = array_keys($result["matches"]);
$cl->SetFilter('tag', array(1));
$result = $cl->Query('bac*');
$docIDs = array_keys($result["matches"]);
Friday, 29 October 2010
24
Swish-E
http://swish-e.org
pecl install swish-beta
Friday, 29 October 2010
Filesystem Index With Swish-E
IndexDir /var/data/documents
IndexFile fs-swish-e.index
IndexOnly .doc .docx .pdf
FuzzyIndexingMode Stemming_en1
FileFilter .pdf /usr/local/bin/swish_filter.pl
FileFilter .doc /usr/local/bin/swish_filter.pl
fs-swish-e.conf
/usr/local/bin/swish-e -S fs -c fs-swish-e.conf
Friday, 29 October 2010
Crawling Content
IndexDir /usr/local/lib/swish-e/spider.pl
IndexFile www-swish-e.index
SwishProgParameters default http://phpir.com/
FuzzyIndexingMode Stemming_en1
DefaultContents HTML
www-swish-e.conf
/usr/local/bin/swish-e -S prog -c www-swish-e.conf
Friday, 29 October 2010
Swish-E With Multiple Indices
$swish = new Swish(
'www-swish-e.index fs-swish-e.index'
);
$search = $swish->prepare();
$queryStr = 'search string goes here';
$result = $search->execute($queryStr);
$total = $result->hits;
while($r = $result->nextResult()) {
echo $r->swishdocpath; // url
}
Friday, 29 October 2010
28
Lucene
Friday, 29 October 2010
29
$index = Zend_Search_Lucene::create('idx');
foreach($documents as $title => $content) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(
Zend_Search_Lucene_Field::Text(
'title', $title));
$doc->addField(
Zend_Search_Lucene_Field::UnStored(
'content', $content));
$index->addDocument($doc);
}
Build Index
Friday, 29 October 2010
30
$results = $index->find('loves bacon');
foreach($results as $result) {
echo $result->score, " ";
echo $result->title, "n";
}
Output:
0.81656279309067 Mikko and Bacon
0.24800278854758 Marcello & Bacon
Query Zend Search Lucene
Friday, 29 October 2010
31
$file = file_get_contents($url);
$doc = Zend_Search_Lucene_Document_Html::
loadHTML($file);
$doc->addField(
Zend_Search_Lucene_Field::Text(
'url', $url
);
$index->addDocument($doc)
Index HTML
Friday, 29 October 2010
32
Solr
http://lucene.apache.org/solr/
Friday, 29 October 2010
33
Solr Search Index
$options = array( 'hostname' => 'localhost',
'port' => 8983 );
$client = new SolrClient($options);
$doc = new SolrInputDocument();
$doc->addField('id', $id);
$doc->addField('cat', $category);
$doc->addField('title', $title);
$doc->addField('text', $text);
$response = $client->addDocument($doc);
$client->commit();
Friday, 29 October 2010
34
Solr Search Client
$client = new SolrClient($options);
$query = new SolrQuery('bacon');
$response = $client->query($query);
$r = $response->getResponse();
foreach($r['response']['docs'] as $d) {
echo $d->title[0] . "n";
}
Friday, 29 October 2010
35
Xapian
http://xapian.org
Friday, 29 October 2010
36
Xapian In PHP
$db = new XapianWritableDatabase(
'idx', Xapian::DB_CREATE_OR_OPEN);
$i = new XapianTermGenerator();
$i->set_stemmer(new XapianStem("english"));
$doc = new XapianDocument();
$doc->set_data($content);
$doc->add_value(1, $title);
$i->set_document($doc);
$i->index_text($content);
$db->add_document($doc);
Friday, 29 October 2010
37
Xapian Search In PHP
$database = new XapianDatabase('idx');
$enquire = new XapianEnquire($database);
$qp = new XapianQueryParser();
$qp->set_stemmer(new XapianStem("english"));
$qp->set_database($database);
$qp->set_stemming_strategy(
XapianQueryParser::STEM_SOME);
$query = $qp->parse_query($queryString);
$enquire->set_query($query);
Friday, 29 October 2010
38
$matches = $enquire->get_mset(0, 10);
$i = $matches->begin();
while(!$i->equals($matches->end())) {
$n = $i->get_rank() + 1;
$data = $i->get_document()->get_data();
$title = $i->get_document()->get_value(1);
$score = $i->get_percent();
$i->next();
}
Friday, 29 October 2010
39
Improving Results
Friday, 29 October 2010
40
Anchor Text
Friday, 29 October 2010
41
$p = file_get_contents('http://phpir.com');
libxml_use_internal_errors(true);
$dom = DomDocument::loadHTML($p);
$links = $dom->getElementsByTagName('a');
foreach($links as $link) {
$href = $link->getAttribute('href');
$text = $link->nodeValue;
}
Parse Anchor Text
Friday, 29 October 2010
42
1
2
3
Zone Weighting
Friday, 29 October 2010
43
ZSL Zone Weighting
$doc = new Zend_Search_Lucene_Document();
$tfield = Zend_Search_Lucene_Field::Text
('title', $title);
$tfield->boost = 1.3;
$doc->addField($tfield);
$doc->addField(
Zend_Search_Lucene_Field::UnStored
('content', $content));
$index->addDocument($doc);
Friday, 29 October 2010
44
Document Authority
Friday, 29 October 2010
45
Document Weights in ZSL
$doc = new Zend_Search_Lucene_Document();
$doc->addField(
Zend_Search_Lucene_Field::Text
('title', $title));
$doc->addField(
Zend_Search_Lucene_Field::UnStored
('content', $content));
$doc->boost = 1 + ($numComments / 100);
$index->addDocument($doc);
Friday, 29 October 2010
46
Using Search
Friday, 29 October 2010
47
Summaries & Highlighting
Friday, 29 October 2010
48
Sphinx Extract & Highlight
$cl = new SphinxClient();
$cl->SetServer( "localhost", 3312 );
$q = 'bacon';
$r = $cl->Query($q);
foreach ($r["matches"] as $doc => $info) {
$text[$doc] = getTextFromDB($doc);
}
$e = $cl->BuildExcerpts($text, 'posts', $q);
foreach($extracts as $extract) {
echo $extract;
}
Friday, 29 October 2010
Friday, 29 October 2010
50
Xapian Spelling Correction
$indexer = new XapianTermGenerator();
$indexer->set_database($database);
$indexer->set_flags(
XapianTermGenerator::FLAG_SPELLING);
Indexer
$queryString = "strreplace or str_cmp";
$q = new XapianQueryParser();
$q->set_database($database);
$query = $q->parse_query($queryString,
XapianQueryParser::FLAG_SPELLING_CORRECTION);
echo "Did you mean: " .
$q->get_corrected_query_string() . "n";
Searcher
Friday, 29 October 2010
51
Spelling Correction Output
php xapsearch.php
Did you mean: str_replace or strcmp
4644 results found for “strreplace or str_cmp”:
1: 2% docid=572
[phpdocs/html/cc.license.html]
2: 2% docid=7169
[phpdocs/html/imagick.constants.html]
3: 2% docid=10086
[phpdocs/html/sqlite3result.fetcharray.html]
4: 2% docid=6132
[phpdocs/html/function.swf-posround.html]
Friday, 29 October 2010
52
Results Sorting
Friday, 29 October 2010
53
Sorting in ZSL
$q = Zend_Search_Lucene_Search_QueryParser::
parse('search string');
$results = $index->find($q, 'title');
foreach($results as $result) {
echo '<h3>', $result->title, "</h3>n";
$doc = getDocumentFromDB($result->did);
echo
$q->htmlFragmentHighlightMatches($doc);
}
Friday, 29 October 2010
54
Faceted Search
Friday, 29 October 2010
55
Faceted Search In Solr
$client = new SolrClient($options);
$query = new SolrQuery('bacon');
$response = $client->query($query);
$query->setFacet(true);
$query->addFacetField('cat');
$r = $response->getResponse();
$f = $r['facet_counts']['facet_fields'];
foreach($f['cat'] as $facet => $count) {
echo $facet . " " . $count . "n";
}
Friday, 29 October 2010
56
More Like This
Friday, 29 October 2010
57
More Like This
$rset = new XapianRset();
$rset->add_document(5959); // str_replace
$e = $enquire->get_eset(40, $rset);
$t = $e->begin();
for($t; !$t->equals($e->end()); $t->next()){
$qs[] = new XapianQuery($t->get_term(),
intval($t->get_weight()));
}
$query = new XapianQuery(
XapianQuery::OP_OR, $qs);
Friday, 29 October 2010
58
More Like This Example
php xapsim.php
1656 results found:
1: 100% docid=5959
[phpdocs/html/function.str-replace.html]
2: 47% docid=5956
[phpdocs/html/function.str-ireplace.html]
3: 24% docid=5328
[phpdocs/html/function.preg-replace.html]
4: 18% docid=5958
[phpdocs/html/function.str-repeat.html]
Friday, 29 October 2010
59
Search Performance
Friday, 29 October 2010
60
Index Updates
Docs
Main
New
Delta
Delta Main
Query
Delta Main
Main
DocsDocsDocs
Friday, 29 October 2010
61
Search Speed
$index = Zend_Search_Lucene::open('index');
$index->optimize();
indexer --merge main delta --rotate
Zend Search Lucene
Sphinx
$client = new SolrClient($options);
$client->optimize();
Solr
xapian-compact xapindex xapindex2
Xapian
Friday, 29 October 2010
62
Distributing Search
Index
Application
Index Index
DocumentDocumentDocumentDocument
Friday, 29 October 2010
63
Large Scale Search
http://www.nutch.org
http://hadoop.apache.org
Friday, 29 October 2010
64
Image Credits
Title http://www.flickr.com/photos/generated/2084287794/
What Do You Want http://www.flickr.com/photos/the_justified_sinner/
2498066986/You Are Here http://www.flickr.com/photos/alecvuijlsteke/2692475420/
Integrating Search http://www.flickr.com/photos/squeaks2569/3700355684/
Sphinx http://www.flickr.com/photos/generated/2084287794/
Lucene http://www.flickr.com/photos/mypanda/7731447/
Swish-e http://www.flickr.com/photos/ryan_fung/2239687100/
Solr http://www.flickr.com/photos/m-j-s/2724756177/
Xapian http://www.flickr.com/photos/olibac/3522056495/
Using Search http://www.flickr.com/photos/eneas/175027945/
Improving Search http://www.flickr.com/photos/x-ray_delta_one/3928200642/
Search Performance http://www.flickr.com/photos/maisonbisson/1634408/
Large Scale Search http://www.flickr.com/photos/zedzap/3663508847/
Friday, 29 October 2010
Questions?
65
Friday, 29 October 2010
Thank You!
Ian Barber
@ianbarber
http://phpir.com
ian@ibuildings.com
Friday, 29 October 2010

More Related Content

DOCX
Resume
PPTX
Pitch deck premium
DOCX
Malvern show flyer
PDF
Dscm barcelona
PDF
Continuous Improvement in PHP projects
PDF
"Participating in a World of Choice practical aspects about Open Source and M...
PPTX
Cloud Computing with PHP and Azure
ODP
PHP Barcelona 2010 - Architecture and testability
Resume
Pitch deck premium
Malvern show flyer
Dscm barcelona
Continuous Improvement in PHP projects
"Participating in a World of Choice practical aspects about Open Source and M...
Cloud Computing with PHP and Azure
PHP Barcelona 2010 - Architecture and testability

Similar to In Search Of: Integrating Site Search (PHP Barcelona) (20)

PPTX
Bring Your Own Policy: Internet Use/BYOD Policy by consensus
DOCX
Techwards uploadfile updated changes
PPTX
Epsilon.pptx
PPTX
4.3 mixed scheme
PPTX
4.3 blue scheme
PPTX
4.3 mixed scheme dark version
PPTX
4.3 red scheme
PPTX
Newspaper
PDF
Five Typography Tips for Better UX
PDF
State of the Art Presentation Templates- Compilation 5
PPTX
Plantillas, presentaciones varias marketing 1.pptx
PDF
Stark PowerPoint Template
DOCX
Running head KONY 2017 SAMPLE TEMPLATE .docx
PDF
week3_garst_107357_mockupv1
PPTX
materi book sharing Mindset by Carol Dweck.pptx
PDF
Pitch Deck Premium Classic
PPTX
Vision - Mission Business Template.pptx
PPTX
presentacion tipo spotify, para uso libre
PPTX
16.9 blue scheme
PPTX
16.9 mixed scheme dark version
Bring Your Own Policy: Internet Use/BYOD Policy by consensus
Techwards uploadfile updated changes
Epsilon.pptx
4.3 mixed scheme
4.3 blue scheme
4.3 mixed scheme dark version
4.3 red scheme
Newspaper
Five Typography Tips for Better UX
State of the Art Presentation Templates- Compilation 5
Plantillas, presentaciones varias marketing 1.pptx
Stark PowerPoint Template
Running head KONY 2017 SAMPLE TEMPLATE .docx
week3_garst_107357_mockupv1
materi book sharing Mindset by Carol Dweck.pptx
Pitch Deck Premium Classic
Vision - Mission Business Template.pptx
presentacion tipo spotify, para uso libre
16.9 blue scheme
16.9 mixed scheme dark version
Ad

More from Ian Barber (13)

PDF
How to stand on the shoulders of giants
PDF
ZeroMQ: Messaging Made Simple
PDF
Teaching Your Machine To Find Fraudsters
PDF
ZeroMQ Is The Answer: PHP Tek 11 Version
PDF
Debugging: Rules And Tools - PHPTek 11 Version
PDF
ZeroMQ Is The Answer: DPC 11 Version
PDF
ZeroMQ Is The Answer
PDF
Deployment Tactics
PDF
Debugging: Rules & Tools
PDF
In Search Of... (Dutch PHP Conference 2010)
PDF
In Search Of... integrating site search
KEY
Document Classification In PHP - Slight Return
KEY
Document Classification In PHP
How to stand on the shoulders of giants
ZeroMQ: Messaging Made Simple
Teaching Your Machine To Find Fraudsters
ZeroMQ Is The Answer: PHP Tek 11 Version
Debugging: Rules And Tools - PHPTek 11 Version
ZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer
Deployment Tactics
Debugging: Rules & Tools
In Search Of... (Dutch PHP Conference 2010)
In Search Of... integrating site search
Document Classification In PHP - Slight Return
Document Classification In PHP
Ad

In Search Of: Integrating Site Search (PHP Barcelona)

  • 1. In Search Of... Ian Barber @ianbarber http://phpir.com [email protected] integrating site search Friday, 29 October 2010
  • 2. 2 How Search Works Integrating Search Improving Results Using Search Search Performance Questions Friday, 29 October 2010
  • 5. 5 With AT&T’s help, the F.B.I Miami-Dade office had recovered $1.1 million from O’Healy’s Ponzi scheme, 10-15% more than expected. Tokenisation “ ”Friday, 29 October 2010
  • 6. 6 PHP Tokenisation function tokenise($string) { $string = strtolower($string); preg_match_all('/w+/', $string, $matches, PREG_OFFSET_CAPTURE); return $matches[0]; } Friday, 29 October 2010
  • 7. 7 Document Term Pairs Document ID Term 1 the 1 best 1 of 1 the ... ... 204 and 204 what 204 would Friday, 29 October 2010
  • 8. 8 Inverted Index Term Documents best 1 (4, 16), 4 (422), 129 (344) ... what 24 (50, 98), 75 (33, 208) ... would 99 (32, 599), 201 (344) .. ... ... Friday, 29 October 2010
  • 9. 9 Boolean Query Merge Query: Best Western Hotel Result: Document 298 best 1 4 129 298 305 338 western 4 95 194 204 298 305 hotel 2 40 200 298 355 402 working 4 298 305 Friday, 29 October 2010
  • 10. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus egestas non. Quisque eu purus ut lacus egestas dapibus. Integer in velit id est dictum bibendum in id mi. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacusLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Friday, 29 October 2010
  • 11. 11 TF-IDF function getWeight($docID, $term, $total) { $tf = count($term[$docID]); $idf = log($total / count($term), 2); return $tf * $idf; } Friday, 29 October 2010
  • 12. 12 Document Vector socket what heavy steel ... Doc 1 0.02 0.3 0.001 0 ... Doc 2 0 0 0 0 ... Doc 3 0.001 0.2 0 0 ... Doc 4 0 0 0.002 0.003 ... Friday, 29 October 2010
  • 13. best 23 42 179 246 333 703 weight 0.008 0.002 0.023 0.039 0.014 0.001 western 42 88 120 179 246 798 weight 0.003 0.004 0.023 0.001 0.034 0.004 1 - 246: 0.073 2 - 179: 0.024 3 - 120: 0.023 Ranked Query Merge 13 Friday, 29 October 2010
  • 14. 14 PHP Similarity function score($queryString, $index) { $query = tokenize($queryString); $matches = array(); foreach($query as $qterm) { $postings = $index[$qterm]; foreach($postings as $id => $posting) { $matches[$id] += $posting['score']; } } return arsort($matches); } Friday, 29 October 2010
  • 16. 16 CREATE TABLE example ( id INT(11) NOT NULL auto_increment, title VARCHAR(255), content TEXT, PRIMARY KEY(id), FULLTEXT(title,content) ) Engine=MyISAM; INSERT INTO example (title, content) VALUES ('Mikko & Bacon','Mikko loves bacon'), ('Marcello & Bacon','Marcello hates bacon'), ('Jo & Sausages','Johanna loves sausages'), ('Hollywood & Garlic','Lorenzo hates garlic'), ('James & Cheddar','James is keen on cheeses'); MySQL Full Text Search Friday, 29 October 2010
  • 17. 17 MySQL FTI Query SELECT * FROM example WHERE MATCH(title,content) AGAINST('loves bacon'); +----+------------------+------------------------+ | id | title | content | +----+------------------+------------------------+ | 1 | Mikko & Bacon | Mikko loves bacon | | 2 | Marcello & Bacon | Marcello hates bacon | | 3 | Jo & Sausages | Johanna loves sausages | +----+------------------+------------------------+ 3 rows in set (0.00 sec) Friday, 29 October 2010
  • 19. 19 Sphinx Configuration source posts { type = mysql sql_host = localhost sql_user = user sql_pass = password sql_db = search sql_query = SELECT id, title, content FROM example; sql_attr_multi = uint tag from query; SELECT example_id, tag_id FROM tags; } Friday, 29 October 2010
  • 20. 20 index posts { source = posts path = /var/data/sphinx/example morphology = stem_en min_word_len = 3 min_prefix_len = 3 min_infix_len = 0 enable_star = 1 } Friday, 29 October 2010
  • 22. 22 Command Line Searching indexer --config /etc/sphinx.conf --all search --config /etc/sphinx.conf love bacon displaying matches: 1. document=1, weight=3, tag=(1,2) ! id=1 ! title=Mikko & Bacon ! content=Mikko loves bacon words: 1. 'love': 2 documents, 2 hits 2. 'bacon': 2 documents, 4 hits searchd --config /etc/sphinx.conf Friday, 29 October 2010
  • 23. 23 Sphinx From PHP $cl = new SphinxClient(); $cl->SetServer('localhost', 3312); $cl->SetMatchMode(SPH_MATCH_ANY); $result = $cl->Query('bac*'); $docIDs = array_keys($result["matches"]); $cl->SetFilter('tag', array(1)); $result = $cl->Query('bac*'); $docIDs = array_keys($result["matches"]); Friday, 29 October 2010
  • 25. Filesystem Index With Swish-E IndexDir /var/data/documents IndexFile fs-swish-e.index IndexOnly .doc .docx .pdf FuzzyIndexingMode Stemming_en1 FileFilter .pdf /usr/local/bin/swish_filter.pl FileFilter .doc /usr/local/bin/swish_filter.pl fs-swish-e.conf /usr/local/bin/swish-e -S fs -c fs-swish-e.conf Friday, 29 October 2010
  • 26. Crawling Content IndexDir /usr/local/lib/swish-e/spider.pl IndexFile www-swish-e.index SwishProgParameters default http://phpir.com/ FuzzyIndexingMode Stemming_en1 DefaultContents HTML www-swish-e.conf /usr/local/bin/swish-e -S prog -c www-swish-e.conf Friday, 29 October 2010
  • 27. Swish-E With Multiple Indices $swish = new Swish( 'www-swish-e.index fs-swish-e.index' ); $search = $swish->prepare(); $queryStr = 'search string goes here'; $result = $search->execute($queryStr); $total = $result->hits; while($r = $result->nextResult()) { echo $r->swishdocpath; // url } Friday, 29 October 2010
  • 29. 29 $index = Zend_Search_Lucene::create('idx'); foreach($documents as $title => $content) { $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'content', $content)); $index->addDocument($doc); } Build Index Friday, 29 October 2010
  • 30. 30 $results = $index->find('loves bacon'); foreach($results as $result) { echo $result->score, " "; echo $result->title, "n"; } Output: 0.81656279309067 Mikko and Bacon 0.24800278854758 Marcello & Bacon Query Zend Search Lucene Friday, 29 October 2010
  • 31. 31 $file = file_get_contents($url); $doc = Zend_Search_Lucene_Document_Html:: loadHTML($file); $doc->addField( Zend_Search_Lucene_Field::Text( 'url', $url ); $index->addDocument($doc) Index HTML Friday, 29 October 2010
  • 33. 33 Solr Search Index $options = array( 'hostname' => 'localhost', 'port' => 8983 ); $client = new SolrClient($options); $doc = new SolrInputDocument(); $doc->addField('id', $id); $doc->addField('cat', $category); $doc->addField('title', $title); $doc->addField('text', $text); $response = $client->addDocument($doc); $client->commit(); Friday, 29 October 2010
  • 34. 34 Solr Search Client $client = new SolrClient($options); $query = new SolrQuery('bacon'); $response = $client->query($query); $r = $response->getResponse(); foreach($r['response']['docs'] as $d) { echo $d->title[0] . "n"; } Friday, 29 October 2010
  • 36. 36 Xapian In PHP $db = new XapianWritableDatabase( 'idx', Xapian::DB_CREATE_OR_OPEN); $i = new XapianTermGenerator(); $i->set_stemmer(new XapianStem("english")); $doc = new XapianDocument(); $doc->set_data($content); $doc->add_value(1, $title); $i->set_document($doc); $i->index_text($content); $db->add_document($doc); Friday, 29 October 2010
  • 37. 37 Xapian Search In PHP $database = new XapianDatabase('idx'); $enquire = new XapianEnquire($database); $qp = new XapianQueryParser(); $qp->set_stemmer(new XapianStem("english")); $qp->set_database($database); $qp->set_stemming_strategy( XapianQueryParser::STEM_SOME); $query = $qp->parse_query($queryString); $enquire->set_query($query); Friday, 29 October 2010
  • 38. 38 $matches = $enquire->get_mset(0, 10); $i = $matches->begin(); while(!$i->equals($matches->end())) { $n = $i->get_rank() + 1; $data = $i->get_document()->get_data(); $title = $i->get_document()->get_value(1); $score = $i->get_percent(); $i->next(); } Friday, 29 October 2010
  • 41. 41 $p = file_get_contents('http://phpir.com'); libxml_use_internal_errors(true); $dom = DomDocument::loadHTML($p); $links = $dom->getElementsByTagName('a'); foreach($links as $link) { $href = $link->getAttribute('href'); $text = $link->nodeValue; } Parse Anchor Text Friday, 29 October 2010
  • 43. 43 ZSL Zone Weighting $doc = new Zend_Search_Lucene_Document(); $tfield = Zend_Search_Lucene_Field::Text ('title', $title); $tfield->boost = 1.3; $doc->addField($tfield); $doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content)); $index->addDocument($doc); Friday, 29 October 2010
  • 45. 45 Document Weights in ZSL $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text ('title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content)); $doc->boost = 1 + ($numComments / 100); $index->addDocument($doc); Friday, 29 October 2010
  • 48. 48 Sphinx Extract & Highlight $cl = new SphinxClient(); $cl->SetServer( "localhost", 3312 ); $q = 'bacon'; $r = $cl->Query($q); foreach ($r["matches"] as $doc => $info) { $text[$doc] = getTextFromDB($doc); } $e = $cl->BuildExcerpts($text, 'posts', $q); foreach($extracts as $extract) { echo $extract; } Friday, 29 October 2010
  • 50. 50 Xapian Spelling Correction $indexer = new XapianTermGenerator(); $indexer->set_database($database); $indexer->set_flags( XapianTermGenerator::FLAG_SPELLING); Indexer $queryString = "strreplace or str_cmp"; $q = new XapianQueryParser(); $q->set_database($database); $query = $q->parse_query($queryString, XapianQueryParser::FLAG_SPELLING_CORRECTION); echo "Did you mean: " . $q->get_corrected_query_string() . "n"; Searcher Friday, 29 October 2010
  • 51. 51 Spelling Correction Output php xapsearch.php Did you mean: str_replace or strcmp 4644 results found for “strreplace or str_cmp”: 1: 2% docid=572 [phpdocs/html/cc.license.html] 2: 2% docid=7169 [phpdocs/html/imagick.constants.html] 3: 2% docid=10086 [phpdocs/html/sqlite3result.fetcharray.html] 4: 2% docid=6132 [phpdocs/html/function.swf-posround.html] Friday, 29 October 2010
  • 53. 53 Sorting in ZSL $q = Zend_Search_Lucene_Search_QueryParser:: parse('search string'); $results = $index->find($q, 'title'); foreach($results as $result) { echo '<h3>', $result->title, "</h3>n"; $doc = getDocumentFromDB($result->did); echo $q->htmlFragmentHighlightMatches($doc); } Friday, 29 October 2010
  • 55. 55 Faceted Search In Solr $client = new SolrClient($options); $query = new SolrQuery('bacon'); $response = $client->query($query); $query->setFacet(true); $query->addFacetField('cat'); $r = $response->getResponse(); $f = $r['facet_counts']['facet_fields']; foreach($f['cat'] as $facet => $count) { echo $facet . " " . $count . "n"; } Friday, 29 October 2010
  • 56. 56 More Like This Friday, 29 October 2010
  • 57. 57 More Like This $rset = new XapianRset(); $rset->add_document(5959); // str_replace $e = $enquire->get_eset(40, $rset); $t = $e->begin(); for($t; !$t->equals($e->end()); $t->next()){ $qs[] = new XapianQuery($t->get_term(), intval($t->get_weight())); } $query = new XapianQuery( XapianQuery::OP_OR, $qs); Friday, 29 October 2010
  • 58. 58 More Like This Example php xapsim.php 1656 results found: 1: 100% docid=5959 [phpdocs/html/function.str-replace.html] 2: 47% docid=5956 [phpdocs/html/function.str-ireplace.html] 3: 24% docid=5328 [phpdocs/html/function.preg-replace.html] 4: 18% docid=5958 [phpdocs/html/function.str-repeat.html] Friday, 29 October 2010
  • 60. 60 Index Updates Docs Main New Delta Delta Main Query Delta Main Main DocsDocsDocs Friday, 29 October 2010
  • 61. 61 Search Speed $index = Zend_Search_Lucene::open('index'); $index->optimize(); indexer --merge main delta --rotate Zend Search Lucene Sphinx $client = new SolrClient($options); $client->optimize(); Solr xapian-compact xapindex xapindex2 Xapian Friday, 29 October 2010
  • 64. 64 Image Credits Title http://www.flickr.com/photos/generated/2084287794/ What Do You Want http://www.flickr.com/photos/the_justified_sinner/ 2498066986/You Are Here http://www.flickr.com/photos/alecvuijlsteke/2692475420/ Integrating Search http://www.flickr.com/photos/squeaks2569/3700355684/ Sphinx http://www.flickr.com/photos/generated/2084287794/ Lucene http://www.flickr.com/photos/mypanda/7731447/ Swish-e http://www.flickr.com/photos/ryan_fung/2239687100/ Solr http://www.flickr.com/photos/m-j-s/2724756177/ Xapian http://www.flickr.com/photos/olibac/3522056495/ Using Search http://www.flickr.com/photos/eneas/175027945/ Improving Search http://www.flickr.com/photos/x-ray_delta_one/3928200642/ Search Performance http://www.flickr.com/photos/maisonbisson/1634408/ Large Scale Search http://www.flickr.com/photos/zedzap/3663508847/ Friday, 29 October 2010