-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Deduplicate mappings #69772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplicate mappings #69772
Conversation
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few small comments on the cluster state persistence changes
private static final String DATA_FIELD_NAME = "data"; | ||
private static final String GLOBAL_TYPE_NAME = "global"; | ||
private static final String INDEX_TYPE_NAME = "index"; | ||
private static final String MAPPING_TYPE_NAME = "mapping"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remember to update the comment at the top describing the new schema :)
|
||
Map<String, MappingMetadata> mappings = new HashMap<>(); | ||
consumeFromType(searcher, MAPPING_TYPE_NAME, bytes -> { | ||
MappingMetadata mappingMetadata = MappingMetadata.fromXContent(XContentFactory.xContent(XContentType.SMILE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we include trace logging for loading each mapping too?
if (indexUUIDs.add(indexMetadata.getIndexUUID()) == false) { | ||
throw new IllegalStateException("duplicate metadata found for " + indexMetadata.getIndex() + " in [" + dataPath + "]"); | ||
} | ||
if (indexMetadata.mapping() != null && mappings.containsKey(indexMetadata.mapping().id())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we enforce some stronger invariant here? E.g. if the mapping has an ID then it must be in mappings
? Maybe also that the rest of the mapping serialized with the index is empty?
It looks like this transparently detects when the mappings are the same. From the "outside" this is invisible. This makes me quite happy. |
Hash the mapping source of a MappingMetadata instance and then cache it in Metadata class. A mapping with the same hash will use a cached MappingMetadata instance. This can significantly reduce the number of MappingMetadata instances for data streams and index patterns. Idea originated from #69772, but just focusses on the jvm heap memory savings. And hashes the mapping instead of assigning it an uuid. Relates to #77466
Backporting elastic#80348 to 8.0 branch. Hash the mapping source of a MappingMetadata instance and then cache it in Metadata class. A mapping with the same hash will use a cached MappingMetadata instance. This can significantly reduce the number of MappingMetadata instances for data streams and index patterns. Idea originated from elastic#69772, but just focusses on the jvm heap memory savings. And hashes the mapping instead of assigning it an uuid. Relates to elastic#77466
Backporting #80348 to 8.0 branch. Hash the mapping source of a MappingMetadata instance and then cache it in Metadata class. A mapping with the same hash will use a cached MappingMetadata instance. This can significantly reduce the number of MappingMetadata instances for data streams and index patterns. Idea originated from #69772, but just focusses on the jvm heap memory savings. And hashes the mapping instead of assigning it an uuid. Relates to #77466
This change modifies cluster state to avoid duplication of mappings between indices. This helps when there are many indices sharing the same big mappings. This should lower memory consumption and speed up cluster state updates as there will be less things to write each time. Deduplication is done on 2 levels:
org.elasticsearch.cluster.metadata.Metadata.Builder
which makes sure that the same mappings use the same instanceUsing following test on master vs this branch there's 2.34x reduction in cluster state size (1508kB vs 644kB):
Possible changes/improvements:
Metadata
instead of building it inMetadata.Builder
so we don't have to rebuild it every time (it's rather quick process though)CompressedXContent
so we can deduplicate mappings inIndexMetadata
andIndexTemplateMetadata
(not a big gain if there are many indices using the same template)