Skip to content

Optimize ST_EXTENT_AGG for geo_shape and cartesian_shape #119889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Jan 16, 2025

Conversation

craigtaverner
Copy link
Contributor

Support for ST_EXTENT_AGG was added in #118829, and then partially optimized in #118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB BBOX geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation.

Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by:

  • Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape
  • Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs)
  • Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found

@craigtaverner craigtaverner added >enhancement :Analytics/Geo Indexing, search aggregations of geo points and shapes Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) auto-backport Automatically create backport pull requests when merged :Analytics/ES|QL AKA ESQL v8.18.0 v9.0.0 labels Jan 9, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @craigtaverner, I've created a changelog YAML for you.

…uster issues

Currently the fact that this index is also used for ENRICH means the test infrastructure duplicates the index across the cluster, which leads to incorrect results in COUNT stats (double the rows). We avoid this by using airports index only if we want to get counts, and ENRICH against airport boundaries to get the geo_shapes.
These versions cannot use COUNT() due to the issue with indexes that are also used in enrich policies being duplicated across multi-cluster IT environments.
…hange in meaning

Of some int values in the block, particularly those that mean unset or infinity.
We had many different parameter orderings, but want to stardize on just two:
* For cartesian we use Rectangle class order (minX, maxX, maxY, minY)
* For geo we use Extent class order (top, bottom, negLeft, negRight, posLeft, posRight)

Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now.
@@ -116,6 +116,8 @@ public interface PointVisitor {
boolean isValid();

Rectangle getResult();

void reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a line of documentation on what this is supposed to do.

} else if (Double.isInfinite(minNegX)) {
return new Rectangle(minPosX, maxPosX, maxY, minY);
assert Double.isFinite(top);
// Due to this data coming through Extent (and aggs that use the same approach), which saves values as integers,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this comment. For one thing, why does the data coming in as integers mean we must use the below equation? For another, we don't actually do it! We just check for infinity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is meant to explain why the code works. We could change the code to be more like the comment, but I wanted to minimize code changes, while simultaneously increasing clarity by adding the comment. The key point here is when looking for infinities we have to look at posRight and negLeft.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll delete the comment and change the code to instead do that explicitly.

);
builder.appendBytesRef(new BytesRef(WellKnownBinary.toWKB(rectangle, ByteOrder.LITTLE_ENDIAN)));
// We store the 6 values as a single multi-valued field, in the same order as the fields in the Extent class
// This requires that consumers also know the meaning of the values, which they can learn from the Extent class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understood the second line here. Perhaps rephrase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot think of a clearer way of saying this. Can you make a suggestion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the word 'Consumer'? I consider this code the producer (produces an int[6] array). Any consumer that should read that array needs to know what each of the 6 values means. For example, it needs to know that a[1] means the bottom latitude (minimum y value).

Copy link
Contributor

@GalLalouche GalLalouche Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it could be rephrased as: "See that class for the meaning of these values".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now that you were asking for less verbocity, since my two comment lines overlap quite a bit. I've deleted the second line, as I agree it is excessive.

@@ -497,6 +511,23 @@ private void combineRawInputForBytesRef(MethodSpec.Builder builder, String block
builder.addStatement("$T.combine(state, $L.getBytesRef(i, scratch))", declarationType, blockVariable);
}

private void startWarningsBlock(MethodSpec.Builder builder) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since any piece of code which starts with startWarningBlock should always end with endWarningsBlock, I think this should be enforced, e.g., by passing in a lambda to a template which wraps it with those two functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is possible, but note that this is not new code and the previous version of this did not do that either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not enforced, and someone makes a mistake, the generated code will not compile, so we immediately get a failure. I think that is sufficient, and probably the reason this was not done before.


private String valueTypeString() {
String valueTypeString = TypeName.get(valueTypeMirror()).toString();
if (valuesIsArray) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using ternary expression instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -49,9 +50,20 @@
public class SpatialShapeBoundsExtraction extends ParameterizedOptimizerRule<AggregateExec, LocalPhysicalOptimizerContext> {
@Override
protected PhysicalPlan rule(AggregateExec aggregate, LocalPhysicalOptimizerContext ctx) {
var foundAttributes = new HashSet<Attribute>();
var foundAttributes = findSpatialShapeBoundsAttributes(aggregate, ctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use var here, as the type isn't clear from the name or expression.

return foundAttributes;
}

private PhysicalPlan transformFieldExtractExec(FieldExtractExec fieldExtractExec, Set<Attribute> foundAttributes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be static.

return fieldExtractExec.withBoundsAttributes(boundsAttributes);
}

private PhysicalPlan transformAggregateExec(AggregateExec agg, Set<Attribute> foundAttributes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be static.


return aggregate.transformDown(UnaryExec.class, exec -> {
private Set<Attribute> findSpatialShapeBoundsAttributes(AggregateExec aggregate, LocalPhysicalOptimizerContext ctx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be static.

for (int c = 0; c < docIndices.getPositionCount(); c++) {
int doc = docIndices.getInt(c);
int count = bytesRefBlock.getValueCount(doc);
int i = bytesRefBlock.getFirstValueIndex(doc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this nearer to where it is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted it close to the previous line, so we get the key information we need from the bytesRefBlock in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a way to move it and make it look better.

@craigtaverner craigtaverner merged commit 40c34cd into elastic:main Jan 16, 2025
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 119889

rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Jan 16, 2025
)

Support for `ST_EXTENT_AGG` was added in elastic#118829, and then partially optimized in elastic#118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation.

Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by:
* Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape
* Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs)
* Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found
* Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency)
* Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless)
* Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency)

Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now.
craigtaverner added a commit to craigtaverner/elasticsearch that referenced this pull request Feb 11, 2025
)

Support for `ST_EXTENT_AGG` was added in elastic#118829, and then partially optimized in elastic#118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation.

Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by:
* Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape
* Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs)
* Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found
* Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency)
* Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless)
* Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency)

Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now.
elasticsearchmachine pushed a commit that referenced this pull request Feb 12, 2025
…) (#122276)

* Optimize ST_EXTENT_AGG for geo_shape and cartesian_shape (#119889)

Support for `ST_EXTENT_AGG` was added in #118829, and then partially optimized in #118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation.

Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by:
* Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape
* Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs)
* Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found
* Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency)
* Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless)
* Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency)

Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now.

* Regenerate generated files
craigtaverner added a commit to craigtaverner/elasticsearch that referenced this pull request Feb 12, 2025
…ic#119889) (elastic#122276)

* Optimize ST_EXTENT_AGG for geo_shape and cartesian_shape (elastic#119889)

Support for `ST_EXTENT_AGG` was added in elastic#118829, and then partially optimized in elastic#118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation.

Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by:
* Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape
* Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs)
* Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found
* Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency)
* Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless)
* Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency)

Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now.

* Regenerate generated files
@craigtaverner
Copy link
Contributor Author

elasticsearchmachine pushed a commit that referenced this pull request Feb 13, 2025
…) (#122276) (#122420)

* Optimize ST_EXTENT_AGG for geo_shape and cartesian_shape (#119889)

Support for `ST_EXTENT_AGG` was added in #118829, and then partially optimized in #118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation.

Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by:
* Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape
* Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs)
* Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found
* Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency)
* Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless)
* Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency)

Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now.

* Regenerate generated files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL :Analytics/Geo Indexing, search aggregations of geo points and shapes auto-backport Automatically create backport pull requests when merged >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants