Add recursive chunker #126866

dan-rubinstein · 2025-04-15T18:37:13Z

Description

This change adds the recursive chunking strategy (see Javadocs) of the RecursiveChunker.java to learn how it works. The recursive chunker comes with a default separator set for plaintext along with one for markdown documents.

Example usage:

PUT _inference/sparse_embedding/my-endpoint
{
  "service": "<service>",
  "service_settings": {
     ...
  },
  "chunking_settings": {
    "strategy": "recursive",
    "separator_set": "markdown",
    "max_chunk_size": 250
  }
}

PUT _inference/sparse_embedding/my-endpoint
{
  "service": "<service>",
  "service_settings": {
     ...
  },
  "chunking_settings": {
    "strategy": "recursive",
    "separators": [<separator1>, <separator2>, ...],
    "max_chunk_size": 250
  }
}

Testing

Unit testing
Manually tested chunking the CONTRIBUTING.md document.

elasticsearchmachine · 2025-04-15T18:37:37Z

Hi @dan-rubinstein, I've created a changelog YAML for you.

…ettings

...nference/src/test/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunkerTests.java

...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java

elasticsearchmachine · 2025-04-28T18:03:13Z

Pinging @elastic/ml-core (Team:ML)

.../plugin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/SeparatorSet.java

...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java

davidkyle · 2025-06-16T14:58:55Z

...nference/src/test/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunkerTests.java

+        for (int i = 0; i < numSentences - 1; i++) {
+            splittersAfterSentences.add(randomFrom(validSplittersAfterSentences));
+        }
+        RecursiveChunkingSettings settings = generateChunkingSettings(15, separators);


Because of the small chunk size the generated chunks will never contain more than 1 sentence. Can you structure the test so that some chunks contain multiple heading sections.

For example if, if chunks size was 100 words and given the document

# heading1.1 ## heading1.2.1 TEST_SENTENCE * 3 ## heading1.2.2 TEST_SENTENCE * 2 # heading2.1 ## heading2.2.1 TEST_SENTENCE * 9 ## heading2.2.2 TEST_SENTENCE ### heading2.3.1 TEST_SENTENCE ### heading2.3.2 TEST_SENTENCE

In this case, given an ordered list of separators, I would expect # heading1.1 -> # heading2.1 to be a single chunks. Then 2 more chunks for heading2.2.1 and heading2.2.2

Please add tests on longer documents that capture the hierarchical nature of the chunker

Discussed offline with Dave. Adding this into the existing long document tests that randomly generate a document would require essentially re-writing the chunking logic into the testing file to generate the expected chunk limits. We've instead decided it makes sense to add a new test with a smaller fixed length document to cover this case.

davidkyle

LGTM

dan-rubinstein · 2025-06-18T15:55:18Z

@elasticmachine merge upstream

elasticsearchmachine · 2025-06-18T17:30:51Z

💚 Backport successful

Status	Branch	Result
✅	8.19

* Add recursive chunker (#126866) * Add recursive chunker * Update docs/changelog/126866.yaml * Clean up separator sets and add asMap function for RecrusiveChunkingSettings * Add javadoc for chunker, add tests, reduce word counting operations * Remove split merging and add long document unit test * [CI] Auto commit changes from spotless * Add markdown chunking tests and reduce substring calls * Clean up matcher logic * Add testing for not splitting after valid chunk is found --------- Co-authored-by: elasticsearchmachine <[email protected]> Co-authored-by: Elastic Machine <[email protected]> * Update getFirst to get in recursive chunker tests --------- Co-authored-by: elasticsearchmachine <[email protected]> Co-authored-by: Elastic Machine <[email protected]>

Add recursive chunker

5167b21

dan-rubinstein added >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.19.0 v9.1.0 labels Apr 15, 2025

Update docs/changelog/126866.yaml

7d9e07c

dan-rubinstein added 2 commits April 16, 2025 14:28

Merge branch 'main' into recursive-chunking-strategy

8418223

Clean up separator sets and add asMap function for RecrusiveChunkingS…

0685124

…ettings

davidkyle reviewed Apr 25, 2025

View reviewed changes

Add javadoc for chunker, add tests, reduce word counting operations

f40947a

dan-rubinstein marked this pull request as ready for review April 28, 2025 18:02

dan-rubinstein added 2 commits June 11, 2025 13:31

Merge branch 'main' into recursive-chunking-strategy

6f649fc

Remove split merging and add long document unit test

c8a5f0c

dan-rubinstein commented Jun 11, 2025

View reviewed changes

.../plugin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/SeparatorSet.java Show resolved Hide resolved

davidkyle reviewed Jun 12, 2025

View reviewed changes

...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java Outdated Show resolved Hide resolved

...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java Outdated Show resolved Hide resolved

dan-rubinstein and others added 3 commits June 13, 2025 10:08

Merge branch 'main' into recursive-chunking-strategy

6f337a8

[CI] Auto commit changes from spotless

6035d76

Add markdown chunking tests and reduce substring calls

29498f7

davidkyle reviewed Jun 16, 2025

View reviewed changes

...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java Show resolved Hide resolved

...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java Outdated Show resolved Hide resolved

Clean up matcher logic

0d6b461

davidkyle reviewed Jun 16, 2025

View reviewed changes

Add testing for not splitting after valid chunk is found

3edf75e

davidkyle approved these changes Jun 18, 2025

View reviewed changes

dan-rubinstein added the auto-backport Automatically create backport pull requests when merged label Jun 18, 2025

Merge branch 'main' into recursive-chunking-strategy

3ac8b94

dan-rubinstein merged commit 4275bc7 into elastic:main Jun 18, 2025
27 checks passed

dan-rubinstein mentioned this pull request Jun 18, 2025

[8.19] Add recursive chunker (#126866) #129656

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add recursive chunker #126866

Add recursive chunker #126866

Uh oh!

dan-rubinstein commented Apr 15, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Apr 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidkyle Jun 16, 2025

Uh oh!

dan-rubinstein Jun 17, 2025

Uh oh!

davidkyle left a comment

Uh oh!

dan-rubinstein commented Jun 18, 2025

Uh oh!

Uh oh!

elasticsearchmachine commented Jun 18, 2025

Uh oh!

Uh oh!

Add recursive chunker #126866

Add recursive chunker #126866

Uh oh!

Conversation

dan-rubinstein commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Uh oh!

elasticsearchmachine commented Apr 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidkyle Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

dan-rubinstein Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

dan-rubinstein commented Jun 18, 2025

Uh oh!

Uh oh!

elasticsearchmachine commented Jun 18, 2025

💚 Backport successful

Uh oh!

Uh oh!

dan-rubinstein commented Apr 15, 2025 •

edited

Loading