Skip to content

Add recursive chunker #126866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

dan-rubinstein
Copy link
Member

@dan-rubinstein dan-rubinstein commented Apr 15, 2025

Description

This change adds the recursive chunking strategy (see Javadocs) of the RecursiveChunker.java to learn how it works. The recursive chunker comes with a default separator set for plaintext along with one for markdown documents.

Example usage:

PUT _inference/sparse_embedding/my-endpoint
{
  "service": "<service>",
  "service_settings": {
     ...
  },
  "chunking_settings": {
    "strategy": "recursive",
    "separator_set": "markdown",
    "max_chunk_size": 250
  }
}

PUT _inference/sparse_embedding/my-endpoint
{
  "service": "<service>",
  "service_settings": {
     ...
  },
  "chunking_settings": {
    "strategy": "recursive",
    "separators": [<separator1>, <separator2>, ...],
    "max_chunk_size": 250
  }
}

Testing

  • Unit testing
  • Manually tested chunking the CONTRIBUTING.md document.

@dan-rubinstein dan-rubinstein added >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.19.0 v9.1.0 labels Apr 15, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @dan-rubinstein, I've created a changelog YAML for you.

@dan-rubinstein dan-rubinstein marked this pull request as ready for review April 28, 2025 18:02
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

for (int i = 0; i < numSentences - 1; i++) {
splittersAfterSentences.add(randomFrom(validSplittersAfterSentences));
}
RecursiveChunkingSettings settings = generateChunkingSettings(15, separators);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the small chunk size the generated chunks will never contain more than 1 sentence. Can you structure the test so that some chunks contain multiple heading sections.

For example if, if chunks size was 100 words and given the document

# heading1.1
## heading1.2.1 
TEST_SENTENCE * 3

## heading1.2.2 
TEST_SENTENCE * 2

# heading2.1
## heading2.2.1 
TEST_SENTENCE * 9

## heading2.2.2 
TEST_SENTENCE 

### heading2.3.1
TEST_SENTENCE 

### heading2.3.2
TEST_SENTENCE  

In this case, given an ordered list of separators, I would expect # heading1.1 -> # heading2.1 to be a single chunks. Then 2 more chunks for heading2.2.1 and heading2.2.2

Please add tests on longer documents that capture the hierarchical nature of the chunker

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with Dave. Adding this into the existing long document tests that randomly generate a document would require essentially re-writing the chunking logic into the testing file to generate the expected chunk limits. We've instead decided it makes sense to add a new test with a smaller fixed length document to cover this case.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dan-rubinstein dan-rubinstein added the auto-backport Automatically create backport pull requests when merged label Jun 18, 2025
@dan-rubinstein
Copy link
Member Author

@elasticmachine merge upstream

@dan-rubinstein dan-rubinstein merged commit 4275bc7 into elastic:main Jun 18, 2025
27 checks passed
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.19

dan-rubinstein added a commit that referenced this pull request Jun 18, 2025
* Add recursive chunker (#126866)

* Add recursive chunker

* Update docs/changelog/126866.yaml

* Clean up separator sets and add asMap function for RecrusiveChunkingSettings

* Add javadoc for chunker, add tests, reduce word counting operations

* Remove split merging and add long document unit test

* [CI] Auto commit changes from spotless

* Add markdown chunking tests and reduce substring calls

* Clean up matcher logic

* Add testing for not splitting after valid chunk is found

---------

Co-authored-by: elasticsearchmachine <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>

* Update getFirst to get in recursive chunker tests

---------

Co-authored-by: elasticsearchmachine <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants