Elastic Search Indexing Internals

Elastic Search:
Indexing Under The
Hood
-Gaurav Kukal
Linkedin@Gaurav.Kukal

 Lets start with a fresh index
 Document “1” gets indexed.
 It is put in memory buffer and transaction log at the same time on each node(
primary shard and replica nodes)
 There is no segment created as yet.
 Assume the refresh interval ( soft commit time is 30 seconds)
 So this picture is of first 30 seconds from the time first documents is indexed

Indexing : Under the Hood
 After soft commit interval( also called refresh time) is over, a
segment is written to the “file system cache”
 File system cache is not disk
 A new searcher is opened so that read calls can see document “1”.
 See that transaction logs still has document “1”

 Document “2” and “3” also get indexed. They come to in memory
buffer first and after refresh interval is over a new segment is
created in File system cache( not disk still)
 A new searcher is opened with 2 segments( yellow ones) and read
calls can see all documents 1,2 and 3
 Transaction logs still keeps capturing all documents that come.

 Document “4,5,6” get indexed now. They are there in
memory buffer and also in Transaction logs.
 At this point read calls will not see Documents 4,5 and 6
 Transaction log is now full.

 Because transaction log is now full, a hard commit happens.
 Hard commit can happen if 30 minutes are over or transaction log is
full, which ever comes early. This time is configurable.
 See all the segments generated previously have been fsynch to disk.

 In memory buffer is cleared
 Transaction log is cleared
 Anything in “memory buffers” is spilled to new segment and fsynch as well.
 Commit point is written so that all 3 segments are known to exist in case of failure
 Searchers is opened and all documents are visible for read calls.
 All is great. If machine fails at this point, it is perfectly positioned to recover faultlessly.

 New documents 7 is indexed. Refresh happens after that.
 Then document 8 is indexed.
 Segment in yellow is in File system cache and just contains 7. Searcher is
opened for it. So read calls can see document 7
 Document 8 is still in memory buffer. And all these new docs are in transaction
logs.
 Read calls cannot see document 8 at this point.

 What happens when there are too many segments?
 Merging to the rescue.
 Smaller segments are merged to bigger segment, based on policies
 IO bound operation.
 Merging saves space.
 Happens in parallel to searching. Searcher is changed to new segment.

Merging saves space. Why?
 Deleting a document creates a new document and .del file to keep
track that document is deleted
 Updating a document keeps original document and marks it as
deleted and creates a new document.
 Merging will remove those “multiple” copies of the same document
in different segments into one, thus saving space.
 It helps in search as well, because now there are less number of
lucene indices it needs to search on.

Elastic Search Indexing Internals

More Related Content

What's hot

Viewers also liked

Similar to Elastic Search Indexing Internals

Recently uploaded

Elastic Search Indexing Internals