Skip to content

Conversation

probakowski
Copy link
Contributor

This change modifies cluster state to avoid duplication of mappings between indices. This helps when there are many indices sharing the same big mappings. This should lower memory consumption and speed up cluster state updates as there will be less things to write each time. Deduplication is done on 2 levels:

  • on runtime by adding cache to org.elasticsearch.cluster.metadata.Metadata.Builder which makes sure that the same mappings use the same instance
  • when saving cluster state to disk we store mappings metadata separately from index metadata (where we only store mapping id)

Using following test on master vs this branch there's 2.34x reduction in cluster state size (1508kB vs 644kB):

curl https://gist.githubusercontent.com/probakowski/f14623080d7e28e74e572f564db5bff3/raw/583b32091985a315304ad1cf7564242a82273973/auditbeat.json > auditbeat.json
curl -XPUT  -u elastic-admin:elastic-password 'http://localhost:9200/_template/auditbeat-7.8.0'  --header 'Content-Type: application/json' --data-binary "@auditbeat.json"
for i in $(seq 1 500); do curl -XPUT -u elastic-admin:elastic-password http://localhost:9200/auditbeat-$i; done

master
deduplicated

Possible changes/improvements:

  • store mappings cache in Metadata instead of building it in Metadata.Builder so we don't have to rebuild it every time (it's rather quick process though)
  • move id to CompressedXContent so we can deduplicate mappings in IndexMetadata and IndexTemplateMetadata (not a big gain if there are many indices using the same template)
  • deduplicate settings - the same idea may be applied to the index settings which should generate another significant reduction in cluster state size

@probakowski
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small comments on the cluster state persistence changes

private static final String DATA_FIELD_NAME = "data";
private static final String GLOBAL_TYPE_NAME = "global";
private static final String INDEX_TYPE_NAME = "index";
private static final String MAPPING_TYPE_NAME = "mapping";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remember to update the comment at the top describing the new schema :)


Map<String, MappingMetadata> mappings = new HashMap<>();
consumeFromType(searcher, MAPPING_TYPE_NAME, bytes -> {
MappingMetadata mappingMetadata = MappingMetadata.fromXContent(XContentFactory.xContent(XContentType.SMILE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we include trace logging for loading each mapping too?

if (indexUUIDs.add(indexMetadata.getIndexUUID()) == false) {
throw new IllegalStateException("duplicate metadata found for " + indexMetadata.getIndex() + " in [" + dataPath + "]");
}
if (indexMetadata.mapping() != null && mappings.containsKey(indexMetadata.mapping().id())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we enforce some stronger invariant here? E.g. if the mapping has an ID then it must be in mappings? Maybe also that the rest of the mapping serialized with the index is empty?

@nik9000
Copy link
Member

nik9000 commented Mar 5, 2021

This change modifies cluster state to avoid duplication of mappings between indices.

It looks like this transparently detects when the mappings are the same. From the "outside" this is invisible. This makes me quite happy.

martijnvg added a commit that referenced this pull request Nov 25, 2021
Hash the mapping source of a MappingMetadata instance and then
cache it in Metadata class. A mapping with the same hash
will use a cached MappingMetadata instance. This can
significantly reduce the number of MappingMetadata instances
for data streams and index patterns.

Idea originated from #69772, but just focusses on the jvm heap memory savings.
And hashes the mapping instead of assigning it an uuid.

Relates to #77466
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Nov 25, 2021
Backporting elastic#80348 to 8.0 branch.

Hash the mapping source of a MappingMetadata instance and then
cache it in Metadata class. A mapping with the same hash
will use a cached MappingMetadata instance. This can
significantly reduce the number of MappingMetadata instances
for data streams and index patterns.

Idea originated from elastic#69772, but just focusses on the jvm heap memory savings.
And hashes the mapping instead of assigning it an uuid.

Relates to elastic#77466
martijnvg added a commit that referenced this pull request Nov 25, 2021
Backporting #80348 to 8.0 branch.

Hash the mapping source of a MappingMetadata instance and then
cache it in Metadata class. A mapping with the same hash
will use a cached MappingMetadata instance. This can
significantly reduce the number of MappingMetadata instances
for data streams and index patterns.

Idea originated from #69772, but just focusses on the jvm heap memory savings.
And hashes the mapping instead of assigning it an uuid.

Relates to #77466
@probakowski probakowski deleted the dedup_mappings branch January 17, 2022 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants