-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Add recursive chunker #126866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add recursive chunker #126866
Conversation
Hi @dan-rubinstein, I've created a changelog YAML for you. |
...nference/src/test/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunkerTests.java
Outdated
Show resolved
Hide resolved
...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java
Outdated
Show resolved
Hide resolved
...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java
Show resolved
Hide resolved
...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java
Outdated
Show resolved
Hide resolved
Pinging @elastic/ml-core (Team:ML) |
.../plugin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/SeparatorSet.java
Show resolved
Hide resolved
...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java
Outdated
Show resolved
Hide resolved
...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java
Outdated
Show resolved
Hide resolved
...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java
Show resolved
Hide resolved
...gin/inference/src/main/java/org/elasticsearch/xpack/inference/chunking/RecursiveChunker.java
Outdated
Show resolved
Hide resolved
for (int i = 0; i < numSentences - 1; i++) { | ||
splittersAfterSentences.add(randomFrom(validSplittersAfterSentences)); | ||
} | ||
RecursiveChunkingSettings settings = generateChunkingSettings(15, separators); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the small chunk size the generated chunks will never contain more than 1 sentence. Can you structure the test so that some chunks contain multiple heading sections.
For example if, if chunks size was 100 words and given the document
# heading1.1
## heading1.2.1
TEST_SENTENCE * 3
## heading1.2.2
TEST_SENTENCE * 2
# heading2.1
## heading2.2.1
TEST_SENTENCE * 9
## heading2.2.2
TEST_SENTENCE
### heading2.3.1
TEST_SENTENCE
### heading2.3.2
TEST_SENTENCE
In this case, given an ordered list of separators, I would expect # heading1.1
-> # heading2.1
to be a single chunks. Then 2 more chunks for heading2.2.1
and heading2.2.2
Please add tests on longer documents that capture the hierarchical nature of the chunker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline with Dave. Adding this into the existing long document tests that randomly generate a document would require essentially re-writing the chunking logic into the testing file to generate the expected chunk limits. We've instead decided it makes sense to add a new test with a smaller fixed length document to cover this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@elasticmachine merge upstream |
💚 Backport successful
|
* Add recursive chunker (#126866) * Add recursive chunker * Update docs/changelog/126866.yaml * Clean up separator sets and add asMap function for RecrusiveChunkingSettings * Add javadoc for chunker, add tests, reduce word counting operations * Remove split merging and add long document unit test * [CI] Auto commit changes from spotless * Add markdown chunking tests and reduce substring calls * Clean up matcher logic * Add testing for not splitting after valid chunk is found --------- Co-authored-by: elasticsearchmachine <[email protected]> Co-authored-by: Elastic Machine <[email protected]> * Update getFirst to get in recursive chunker tests --------- Co-authored-by: elasticsearchmachine <[email protected]> Co-authored-by: Elastic Machine <[email protected]>
Description
This change adds the recursive chunking strategy (see Javadocs) of the RecursiveChunker.java to learn how it works. The recursive chunker comes with a default separator set for plaintext along with one for markdown documents.
Example usage:
Testing
CONTRIBUTING.md
document.