HOW TO BUILD A SEARCH ENGINE
Let's imagine the web as a hypothetical database. You can think of the web as a 2 column table: URL
and page contents. URL contains the URL of a page, and page contents contains its contents. URL is
the primary key. You type in the URL in your browser, the browser looks up by the primary key in the
database, gets the row and shows you the page content. Good, right?
Well, not good for search! Why? Because if you are searching for pizza, you will have to go through
all the records and scan the page content column of every row for the word pizza. Bad, no? It has a
performance of O(n), and when you are talking about the web, that n is going to get big.
So, what do you do? You flip this table around. Make the page content the key and the URL the
value. Not only do you do this, you split the page content into individual terms, and create a record
for each term.
So, let's say, you have 3 very small web pages that say
1. http://yummypizza.com - Italian Pizza
2. http://yummierpizza.com Sicilian Pizza
3. http://sexyshoes.com- Italian Shoes
Our table now looks like this
Italian - http://yummypizza.com, http://sexyshoes.com
Pizza - http://yummypizza.com, http://yummierpizza.com
Sicilian - http://yummierpizza.com
Shoes - http://sexyshoes.com
Now, when someone searches for Pizza, you simply look up the record for Pizza and you get both the
web pages. Fastest search engine in the world, right? All it needs to do is look up one record. There,
you beat google!
But wait! What if someone searches for Italian Pizza? Ooooh. What you can do is find the URLS that
are common between the first entry and second sentry. Mathematically speaking, you are
performing an intersection of 2 sets. So, you need some sort of algorithm that can do a fast
intersection across large data sets
Now, you got it. Buy a few thousand servers, beat google at its own game. Yeah!
But, wait! What if someone searches Italy Pizza? You want Italian pizza to show up right? So, you
need to expand your terms using synonyms. Essentially, you need a dictionary of synonyms (words
grouped together by meaning).
Then when you insert records into a row, you insert the records into its synonyms. So, your table
looks like this
Italy - http://yummypizza.com, http://sexyshoes.com
Italian - http://yummypizza.com, http://sexyshoes.com
Pizza - http://yummypizza.com, http://yummierpizza.com
Sicilian - http://yummierpizza.com
Shoes - http://sexyshoes.com
Now, you ready to beat Google! Are you looking for investors? Because I'm there!
But wait, what if someone searches I want Italian Pizza. And your web page for Italian pizza doesn’t
have the words "I want ". Well, it doesn’t make sense to search for "I" and "want", right? So, you
need to chuck out some words from the input. These are called stop words. So, "I want Italian Pizza"
becomes "Italian Pizza". All you need is a dictionary of stop words. You can even use this to stop
people from searching by bad words
Ready to beat Google? Angel Investors all lines up?
No wait. People make spelling mistakes. There are algorithms that convert words into codes based
on how they sound. You can convert all the terms into Soundex code, and convert your search term
into soundex. If you don't have enough results when you search without Soundex, you fall back to
soundex
Great! Now, I'm ready to rule the world!
But, wait, what about ranking? When you are talking about showing millions of results, what you
show as the first result matters. So, how do you sort them. Alphabetically? Lame! Sorted by time?
Lame! Or how about if the word Pizza comes up more times in the web page you rank it higher?
(And then SEO people will fill the page with Pizza Pizza Pizza Pizza... Anyone remember the days
when SEO experts will ask you to put bunch of shit in your page).
You can't just let the user pick his/her own sorting! Because resorting millions of will kill your server.
So, you come up with your own way of sorting the results and you physically store the records in
that order. This means that you can read your column partially and you don’t have to sort on every
search.
This is where you can’t beat Google, because of Google's PageRank algorithm. That's proprietary,
and that's Google's secret sauce
Of course, you might not want to compete with google. You might be building a search engine to
search through your library of kindle books, right? However, the core problem of algorithmically
ranking search results is very hard to nail down. You might find that how you rank depends on what
you are searching. This where most search engines have trouble with. And that's why Google
basically beat every other web search engine, you don't have to implement all this yourself.
So which is built on Lucene provides all of this. You need to provide the data and decide how to rank
your result.
Software for provides Java-based indexing and search technology, as well as spellchecking, hit
highlighting and advanced analysis/tokenization capabilities – Apache Lucene
Apache Solr is a high performance search server built using Lucene Core, with XML/HTTP and
JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, and a web admin
interface.
Reference- http://lucene.apache.org/
Apache Nutch is a highly extensible and scalable open source web crawler software project.