0% found this document useful (0 votes)
10K views

MongoDB Full Text Search With Sphinx

This document discusses using MongoDB and Sphinx for full text search of OpenCourseWare materials. It summarizes that MongoDB was used to store course materials in a document database and xmlpipe2 was used to index the MongoDB documents in Sphinx. It also describes some pitfalls in document IDs, UTF-8 encoding, and solutions used regarding UTF-8 in Sphinx configuration and a FixEncoding function.

Uploaded by

pierrefar
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10K views

MongoDB Full Text Search With Sphinx

This document discusses using MongoDB and Sphinx for full text search of OpenCourseWare materials. It summarizes that MongoDB was used to store course materials in a document database and xmlpipe2 was used to index the MongoDB documents in Sphinx. It also describes some pitfalls in document IDs, UTF-8 encoding, and solutions used regarding UTF-8 in Sphinx configuration and a FixEncoding function.

Uploaded by

pierrefar
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

www.ocwsearch.

com

MongoDB Full Text Search


with Sphinx
Pierre Far, PhD
Twitter: @ocwsearch
Web: www.ocwsearch.com
Email: [email protected]
About
www.ocwsearch.com

A search engine of the full text of OpenCourseWare


course materials.
2600+ courses, 10 universities, 11 OCW collections
Courses in English, Japanese, Spanish, Dutch
Why MongoDB?
MongoDB?
www.ocwsearch.com

Very helpful community

Document DB

Schemaless
Technology Stack www.ocwsearch.com

Website (HTML), API (JSON)

Query

Index

mongos3 xmlpipe2
Amazon S3

Adaptor Scripts
xmlpipe2
www.ocwsearch.com

An XML documents input into Sphinx


Any XML source so...

Read courses from MongoDB and stream as XML

sphinxsearch.com/wiki/doku.php?id=sphinx_xmlpipe2_tutorial
Pitfall 1: Document ID
www.ocwsearch.com

ALL DOCUMENT IDS MUST BE UNIQUE


UNSIGNED NON-ZERO INTEGER NUMBERS

Generate a unique 10-digit numeric ID for each course.


Must be deterministic
Unique index on field.
Pitfall 2: UTF-
UTF-8
www.ocwsearch.com

Fatal error: Uncaught exception 'MongoException' with


message 'non-utf8 string

Encoding: its a lie.


mb_detect_encoding() unreliable.

2-part solution
1. $HTML = @mb_convert_encoding($HTML, 'HTML-ENTITIES', 'utf-8');
2. $Text = FixEncoding($Text);
FixEncoding();
FixEncoding();
www.ocwsearch.com

A set of real encoding detection functions


http://lachy.id.au/dev/2005/11/encoding-functions-source

FixEncoding() is a wrapper for these functions


UTF--8 in Sphinx
UTF
www.ocwsearch.com

In sphinx.conf:
charset_type = utf-8
ngram_chars
charset_table

sphinxsearch.com/wiki/doku.php?id=charset_tables
mongos3
www.ocwsearch.com

MongoDB document = S3 object

Backup tool for MongoDB

$Contents = gzencode(json_encode($Course), 9);


www.ocwsearch.com

Thanks!
Any questions?
Twitter: @ocwsearch
Web: www.ocwsearch.com
Email: [email protected]

You might also like