Elastic Search:
Indexing Under The
Hood
-Gaurav Kukal
Linkedin@Gaurav.Kukal
 Lets start with a fresh index
 Document “1” gets indexed.
 It is put in memory buffer and transaction log at the same time on each node(
primary shard and replica nodes)
 There is no segment created as yet.
 Assume the refresh interval ( soft commit time is 30 seconds)
 So this picture is of first 30 seconds from the time first documents is indexed
Indexing : Under the Hood
 After soft commit interval( also called refresh time) is over, a
segment is written to the “file system cache”
 File system cache is not disk
 A new searcher is opened so that read calls can see document “1”.
 See that transaction logs still has document “1”
 Document “2” and “3” also get indexed. They come to in memory
buffer first and after refresh interval is over a new segment is
created in File system cache( not disk still)
 A new searcher is opened with 2 segments( yellow ones) and read
calls can see all documents 1,2 and 3
 Transaction logs still keeps capturing all documents that come.
 Document “4,5,6” get indexed now. They are there in
memory buffer and also in Transaction logs.
 At this point read calls will not see Documents 4,5 and 6
 Transaction log is now full.
 Because transaction log is now full, a hard commit happens.
 Hard commit can happen if 30 minutes are over or transaction log is
full, which ever comes early. This time is configurable.
 See all the segments generated previously have been fsynch to disk.
 In memory buffer is cleared
 Transaction log is cleared
 Anything in “memory buffers” is spilled to new segment and fsynch as well.
 Commit point is written so that all 3 segments are known to exist in case of failure
 Searchers is opened and all documents are visible for read calls.
 All is great. If machine fails at this point, it is perfectly positioned to recover faultlessly.
 New documents 7 is indexed. Refresh happens after that.
 Then document 8 is indexed.
 Segment in yellow is in File system cache and just contains 7. Searcher is
opened for it. So read calls can see document 7
 Document 8 is still in memory buffer. And all these new docs are in transaction
logs.
 Read calls cannot see document 8 at this point.
 What happens when there are too many segments?
 Merging to the rescue.
 Smaller segments are merged to bigger segment, based on policies
 IO bound operation.
 Merging saves space.
 Happens in parallel to searching. Searcher is changed to new segment.
Merging saves space. Why?
 Deleting a document creates a new document and .del file to keep
track that document is deleted
 Updating a document keeps original document and marks it as
deleted and creates a new document.
 Merging will remove those “multiple” copies of the same document
in different segments into one, thus saving space.
 It helps in search as well, because now there are less number of
lucene indices it needs to search on.

Elastic Search Indexing Internals

  • 1.
  • 2.
     Lets startwith a fresh index  Document “1” gets indexed.  It is put in memory buffer and transaction log at the same time on each node( primary shard and replica nodes)  There is no segment created as yet.  Assume the refresh interval ( soft commit time is 30 seconds)  So this picture is of first 30 seconds from the time first documents is indexed
  • 3.
    Indexing : Underthe Hood  After soft commit interval( also called refresh time) is over, a segment is written to the “file system cache”  File system cache is not disk  A new searcher is opened so that read calls can see document “1”.  See that transaction logs still has document “1”
  • 4.
     Document “2”and “3” also get indexed. They come to in memory buffer first and after refresh interval is over a new segment is created in File system cache( not disk still)  A new searcher is opened with 2 segments( yellow ones) and read calls can see all documents 1,2 and 3  Transaction logs still keeps capturing all documents that come.
  • 5.
     Document “4,5,6”get indexed now. They are there in memory buffer and also in Transaction logs.  At this point read calls will not see Documents 4,5 and 6  Transaction log is now full.
  • 6.
     Because transactionlog is now full, a hard commit happens.  Hard commit can happen if 30 minutes are over or transaction log is full, which ever comes early. This time is configurable.  See all the segments generated previously have been fsynch to disk.
  • 7.
     In memorybuffer is cleared  Transaction log is cleared  Anything in “memory buffers” is spilled to new segment and fsynch as well.  Commit point is written so that all 3 segments are known to exist in case of failure  Searchers is opened and all documents are visible for read calls.  All is great. If machine fails at this point, it is perfectly positioned to recover faultlessly.
  • 8.
     New documents7 is indexed. Refresh happens after that.  Then document 8 is indexed.  Segment in yellow is in File system cache and just contains 7. Searcher is opened for it. So read calls can see document 7  Document 8 is still in memory buffer. And all these new docs are in transaction logs.  Read calls cannot see document 8 at this point.
  • 9.
     What happenswhen there are too many segments?  Merging to the rescue.  Smaller segments are merged to bigger segment, based on policies  IO bound operation.  Merging saves space.  Happens in parallel to searching. Searcher is changed to new segment.
  • 10.
    Merging saves space.Why?  Deleting a document creates a new document and .del file to keep track that document is deleted  Updating a document keeps original document and marks it as deleted and creates a new document.  Merging will remove those “multiple” copies of the same document in different segments into one, thus saving space.  It helps in search as well, because now there are less number of lucene indices it needs to search on.