-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Optimize ST_EXTENT_AGG for geo_shape and cartesian_shape #119889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize ST_EXTENT_AGG for geo_shape and cartesian_shape #119889
Conversation
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Hi @craigtaverner, I've created a changelog YAML for you. |
This reverts commit 898d1ba.
…uster issues Currently the fact that this index is also used for ENRICH means the test infrastructure duplicates the index across the cluster, which leads to incorrect results in COUNT stats (double the rows). We avoid this by using airports index only if we want to get counts, and ENRICH against airport boundaries to get the geo_shapes.
These versions cannot use COUNT() due to the issue with indexes that are also used in enrich policies being duplicated across multi-cluster IT environments.
…hange in meaning Of some int values in the block, particularly those that mean unset or infinity.
We had many different parameter orderings, but want to stardize on just two: * For cartesian we use Rectangle class order (minX, maxX, maxY, minY) * For geo we use Extent class order (top, bottom, negLeft, negRight, posLeft, posRight) Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now.
@@ -116,6 +116,8 @@ public interface PointVisitor { | |||
boolean isValid(); | |||
|
|||
Rectangle getResult(); | |||
|
|||
void reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a line of documentation on what this is supposed to do.
} else if (Double.isInfinite(minNegX)) { | ||
return new Rectangle(minPosX, maxPosX, maxY, minY); | ||
assert Double.isFinite(top); | ||
// Due to this data coming through Extent (and aggs that use the same approach), which saves values as integers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this comment. For one thing, why does the data coming in as integers mean we must use the below equation? For another, we don't actually do it! We just check for infinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is meant to explain why the code works. We could change the code to be more like the comment, but I wanted to minimize code changes, while simultaneously increasing clarity by adding the comment. The key point here is when looking for infinities we have to look at posRight and negLeft.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll delete the comment and change the code to instead do that explicitly.
); | ||
builder.appendBytesRef(new BytesRef(WellKnownBinary.toWKB(rectangle, ByteOrder.LITTLE_ENDIAN))); | ||
// We store the 6 values as a single multi-valued field, in the same order as the fields in the Extent class | ||
// This requires that consumers also know the meaning of the values, which they can learn from the Extent class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understood the second line here. Perhaps rephrase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot think of a clearer way of saying this. Can you make a suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the word 'Consumer'? I consider this code the producer (produces an int[6] array). Any consumer that should read that array needs to know what each of the 6 values means. For example, it needs to know that a[1] means the bottom latitude (minimum y value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it could be rephrased as: "See that class for the meaning of these values".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now that you were asking for less verbocity, since my two comment lines overlap quite a bit. I've deleted the second line, as I agree it is excessive.
@@ -497,6 +511,23 @@ private void combineRawInputForBytesRef(MethodSpec.Builder builder, String block | |||
builder.addStatement("$T.combine(state, $L.getBytesRef(i, scratch))", declarationType, blockVariable); | |||
} | |||
|
|||
private void startWarningsBlock(MethodSpec.Builder builder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since any piece of code which starts with startWarningBlock
should always end with endWarningsBlock
, I think this should be enforced, e.g., by passing in a lambda to a template which wraps it with those two functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is possible, but note that this is not new code and the previous version of this did not do that either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is not enforced, and someone makes a mistake, the generated code will not compile, so we immediately get a failure. I think that is sufficient, and probably the reason this was not done before.
|
||
private String valueTypeString() { | ||
String valueTypeString = TypeName.get(valueTypeMirror()).toString(); | ||
if (valuesIsArray) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using ternary expression instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -49,9 +50,20 @@ | |||
public class SpatialShapeBoundsExtraction extends ParameterizedOptimizerRule<AggregateExec, LocalPhysicalOptimizerContext> { | |||
@Override | |||
protected PhysicalPlan rule(AggregateExec aggregate, LocalPhysicalOptimizerContext ctx) { | |||
var foundAttributes = new HashSet<Attribute>(); | |||
var foundAttributes = findSpatialShapeBoundsAttributes(aggregate, ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use var
here, as the type isn't clear from the name or expression.
return foundAttributes; | ||
} | ||
|
||
private PhysicalPlan transformFieldExtractExec(FieldExtractExec fieldExtractExec, Set<Attribute> foundAttributes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be static
.
return fieldExtractExec.withBoundsAttributes(boundsAttributes); | ||
} | ||
|
||
private PhysicalPlan transformAggregateExec(AggregateExec agg, Set<Attribute> foundAttributes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be static
.
|
||
return aggregate.transformDown(UnaryExec.class, exec -> { | ||
private Set<Attribute> findSpatialShapeBoundsAttributes(AggregateExec aggregate, LocalPhysicalOptimizerContext ctx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be static
.
for (int c = 0; c < docIndices.getPositionCount(); c++) { | ||
int doc = docIndices.getInt(c); | ||
int count = bytesRefBlock.getValueCount(doc); | ||
int i = bytesRefBlock.getFirstValueIndex(doc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this nearer to where it is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted it close to the previous line, so we get the key information we need from the bytesRefBlock in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a way to move it and make it look better.
…r/elasticsearch into st_extent_optimize_geoshape
We left the bug as a TODO for the next PR, so at least the primary optimization can get merged for tech-preview in 8.18.
…r/elasticsearch into st_extent_optimize_geoshape
…r/elasticsearch into st_extent_optimize_geoshape
This is to increase the chance that the same failure seen for geo_shape in serverless would also occur with cartesian_shape in serverless.
💔 Backport failed
You can use sqren/backport to manually backport by running |
) Support for `ST_EXTENT_AGG` was added in elastic#118829, and then partially optimized in elastic#118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation. Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by: * Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape * Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs) * Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found * Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency) * Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless) * Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency) Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now.
) Support for `ST_EXTENT_AGG` was added in elastic#118829, and then partially optimized in elastic#118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation. Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by: * Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape * Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs) * Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found * Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency) * Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless) * Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency) Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now.
…) (#122276) * Optimize ST_EXTENT_AGG for geo_shape and cartesian_shape (#119889) Support for `ST_EXTENT_AGG` was added in #118829, and then partially optimized in #118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation. Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by: * Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape * Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs) * Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found * Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency) * Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless) * Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency) Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now. * Regenerate generated files
…ic#119889) (elastic#122276) * Optimize ST_EXTENT_AGG for geo_shape and cartesian_shape (elastic#119889) Support for `ST_EXTENT_AGG` was added in elastic#118829, and then partially optimized in elastic#118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation. Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by: * Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape * Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs) * Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found * Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency) * Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless) * Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency) Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now. * Regenerate generated files
Backports done manually: |
…) (#122276) (#122420) * Optimize ST_EXTENT_AGG for geo_shape and cartesian_shape (#119889) Support for `ST_EXTENT_AGG` was added in #118829, and then partially optimized in #118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation. Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by: * Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape * Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs) * Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found * Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency) * Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless) * Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency) Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now. * Regenerate generated files
Support for
ST_EXTENT_AGG
was added in #118829, and then partially optimized in #118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKBBBOX
geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation.Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by: