Skip to content

Conversation

@imjalpreet
Copy link
Member

@imjalpreet imjalpreet commented Mar 11, 2021

This PR includes the base changes required for upgrading to Hive 3. This will help in supporting and bringing a number of interesting features in Hive 3 to Presto in the future.

depends on https://github.com/facebookexternal/presto-facebook/pull/1450

== RELEASE NOTES ==

Hive Changes
* Add Support for Hive Connector to work with Hive 3 metastore for basic tables (ACID Tables and tables using constraints are not supported yet) (:pr:`15805`)

@aweisberg aweisberg self-requested a review March 11, 2021 13:33
@aweisberg
Copy link
Contributor

Added you as a contributor. Not sure if that will let you restart tests yourself. I will need to force push to update just to get the FB integration going.

@aweisberg aweisberg force-pushed the HiveUpgrade branch 2 times, most recently from d7e93ae to 16957f4 Compare March 11, 2021 13:43
@aweisberg
Copy link
Contributor

@imjalpreet this is the same as the last PR just again modulo the one change I made to the Hive version (3.0.0-3 instead of 3.0.0)? We don't need to do a full review?

@aweisberg
Copy link
Contributor

aweisberg commented Mar 11, 2021

The collection delim fixes look correct to me. I added the "Cherry-pick of" to it. It would be nice to also have those for the "Update to HIve 3.0.0". Generally linking context a bit so people don't have to search is low effort and helpful.

I am going to test this now against the queries we had issues with and we can see if it works.

@imjalpreet
Copy link
Member Author

imjalpreet commented Mar 11, 2021

@aweisberg Yes this is the same PR as the earlier one. It just includes the fix for collection delimiter for Hive versions < 3.0.0.

It would be nice to also have those for the "Update to HIve 3.0.0"

Sure I can add that for reference. I did not add that before since this PR had some extra changes as well. So I had only added Co-authored tag.

imjalpreet and others added 2 commits March 11, 2021 20:04
Fix TestOrcBatchPageSourceMemoryTracking unit test failures

Fork BloomFilter class from org.apache.hive.common.util.BloomFilter

Fix presto-orc plugin unit test failures

Cherry-pick of trinodb/trino#57 (trinodb/trino#57) and trinodb/trino#203 (trinodb/trino#203)

Co-authored-by: David Phillips <[email protected]>
@aweisberg
Copy link
Contributor

I was able to test it on one of the failed queries and it works correctly now. The tables are gone for the other two, but I am confident that we fixed the underlying issue for each query.

@aweisberg
Copy link
Contributor

aweisberg commented Mar 11, 2021

Everything passed, but failed on a flaky test in presto-main.
I realized if you have multiple compute engines this won't work if it writes the correct version (for Hive 3) and the other engines don't support both.

@aweisberg
Copy link
Contributor

Trino doesn't appear to do anything special. I can't find colelction.delim in additional places in trino or trino-hive-apache.

It seems like a real concern to me, but maybe we just release note it? I imagine not many people are using TEXTFILE as an output format in Presto.

@pettyjamesm
Copy link
Contributor

Trino doesn't appear to do anything special. I can't find colelction.delim in additional places in trino or trino-hive-apache.

It seems like a real concern to me, but maybe we just release note it? I imagine not many people are using TEXTFILE as an output format in Presto.

I wonder how much work it would be to add a config option to emit both names or maybe just the "legacy" name if enabled? I suspect there is a relatively small number of users using text formats and also using other compute engines, but I'm sure it exists and they'll be upset if they have no recourse.

@aweisberg
Copy link
Contributor

I wonder how much work it would be to add a config option to emit both names or maybe just the "legacy" name if enabled? I suspect there is a relatively small number of users using text formats and also using other compute engines, but I'm sure it exists and they'll be upset if they have no recourse.

I was thinking about adding both, but is it the case that it will error out on unrecognized options if both are present?

@pettyjamesm
Copy link
Contributor

I was thinking about adding both, but is it the case that it will error out on unrecognized options if both are present?

Can't say that I know... seems like that would depend on the SerDe or other query engine in question, which makes it kind of a black box. It does seem kind of wonky to set both properties by default and keep propagating that silly typo into the future, so making the creation configurable but accepting either on when reading seems like the best compromise to me.

@aweisberg
Copy link
Contributor

I looked at the hive source and it seems like it does track what the list of supported properties are in an annotation, but not to error.

There is one case where it compares props from two different tables and this would break in an environment that mixes them.

TBH not sure what to do here, but the urge to ship it is strong.

@pettyjamesm
Copy link
Contributor

TBH not sure what to do here, but the urge to ship it is strong.

I agree, it's hard to know what the right thing to do here is in advance. It's probably the right call to just move forward, document the potential incompatibility, and see who comes forward with issues afterwards. Once we get a sense for who this broke we'll have a better answer for how to mitigate it- and it does seem like something that can be mitigated.

@rschlussel
Copy link
Contributor

TBH not sure what to do here, but the urge to ship it is strong.

I agree, it's hard to know what the right thing to do here is in advance. It's probably the right call to just move forward, document the potential incompatibility, and see who comes forward with issues afterwards. Once we get a sense for who this broke we'll have a better answer for how to mitigate it- and it does seem like something that can be mitigated.

I agree. If we know the use cases it broke for, we'll be able to know what solution is best. For now at least not breaking reads seems sufficient.

@rschlussel rschlussel merged commit a789c3c into prestodb:master Mar 11, 2021
@rschlussel
Copy link
Contributor

@aweisberg or @imjalpreet can you add the release note information to the PR description.

@imjalpreet
Copy link
Member Author

@rschlussel @aweisberg I have updated the description with the release note, please let me know in case anything else needs to be added.

@arhimondr arhimondr mentioned this pull request Mar 12, 2021
13 tasks
@shangxinli
Copy link
Collaborator

@imjalpreet, Can you explain why the dependency on parquet-xxx is removed in this change? By doing that, we are only using the Parquet via Hive but not directly using Parquet anymore. I am working on #14960, which is to upgrade Parquet version to 1.11.x, but surprisingly that doesn't work anymore because of this change. FYI @vkorukanti @zhenxiao @highker

@imjalpreet
Copy link
Member Author

@imjalpreet, Can you explain why the dependency on parquet-xxx is removed in this change? By doing that, we are only using the Parquet via Hive but not directly using Parquet anymore. I am working on #14960, which is to upgrade Parquet version to 1.11.x, but surprisingly that doesn't work anymore because of this change. FYI @vkorukanti @zhenxiao @highker

@shangxinli The parquet dependencies were only added here until the hive was upgraded. This could be done since the earlier version of the shaded version of Hive (we were using 1.2.2-2) had Parquet 1.6.0 and since Apache had changed the namespace of Parquet classes in the recent versions it allowed having both new and old parquet classes in the classpath.

With the upgrade to Hive 3, we don't need to separately include the parquet dependencies as we get them along with Hive itself.

Since we maintain shaded version of Hive for Presto, you can upgrade parquet to 1.11.x in presto-hive-apache and we can release that change and start using that. I think there is already a PR in progress to upgrade to 1.11.0.

Let me know in case you have any questions.

@shangxinli
Copy link
Collaborator

shangxinli commented Apr 11, 2021

Thanks @imjalpreet for the quick response! I can understand the reason and it is a good effort to bring into one version.

What is the release frequency of presto-hive-apache? It seems upgrading Parquet in Presto becomes more difficult as we have to 1) change presto-hive-apache 2) Wait till the release presto-hive-apache 3) change Presto. I already upgraded Iceberg to 1.11.1 and I also opened a ticket to upgrade to Parquet 1.12.0. When Iceberg connector catches up upgrading, that means we have to go through steps 1), 2), 3) very often.

Also, with this change when upgrading presto-hive-apache in Presto, that could implicitly upgrade Parquet to newer version. For PR #14960, several committers already have concerns and want us to wait till deployment in production. I am not sure if we still have this path of control if we implicitly upgrade Parquet in the future.

Are you open to add back the direct dependency on Parquet? If yes, I can add them back in my PR #14960.

@shangxinli
Copy link
Collaborator

Any thoughts @imjalpreet?

@imjalpreet
Copy link
Member Author

@shangxinli Sorry I missed your comment. Please find my response below.

What is the release frequency of presto-hive-apache? It seems upgrading Parquet in Presto becomes more difficult as we have to 1) change presto-hive-apache 2) Wait till the release presto-hive-apache 3) change Presto. I already upgraded Iceberg to 1.11.1 and I also opened a ticket to upgrade to Parquet 1.12.0. When Iceberg connector catches up upgrading, that means we have to go through steps 1), 2), 3) very often.

There isn't a set release frequency for presto-hive-apache, as far as I am aware it can be released whenever there is a change required or update required. @rschlussel can confirm.

Also, with this change when upgrading presto-hive-apache in Presto, that could implicitly upgrade Parquet to newer version. For PR #14960, several committers already have concerns and want us to wait till deployment in production. I am not sure if we still have this path of control if we implicitly upgrade Parquet in the future.

I assume those releases can be coordinated. Although @rschlussel @aweisberg would be better to comment on that.

Are you open to add back the direct dependency on Parquet? If yes, I can add them back in my PR #14960.

I don't think it is possible now to add the direct dependency to Parquet. It was possible earlier because previous version of Hive used to have a different classpath for Parquet dependencies, so it was possible to have different versions of Parquet present in the classpath.
But Hive > 3 has the new classpath for the parquet dependencies which Apache had modified in the recent versions of Parquet.

The point I am trying to make is even if you add a direct dependency to Parquet and exclude the Parquet dependencies coming from presto-hive-apache, on any upgrade of Parquet in Presto you will have to make some changes in presto-hive-apache since there can be breaking changes with an upgrade and hive read and write path would expect the older versions of Parquet classes since it had been build with that version.

Unfortunately there is no way to skip this.

Let me know in case you have any concerns.

@shangxinli
Copy link
Collaborator

Thanks @imjalpreet! If that is the only option, let's go with that. The only thing is we have a dependency on "upgrading to 1.11.x" to open source Parquet Column Index(PR #15454) and Column Encryption PRhttps://github.com//issues/12048). It would be great if you can help to expedite the PR.

@imjalpreet
Copy link
Member Author

@aweisberg @rschlussel would you be able to help on this?

@shangxinli
Copy link
Collaborator

@imjalpreet, I see PR46 has been merged. For the next steps, I guess we need to produce a new version of 'presto-hive-apache' and make a change in Presto. Is it something you will take care of, or you want me to do something? Let me know.

@imjalpreet
Copy link
Member Author

@imjalpreet, I see PR46 has been merged. For the next steps, I guess we need to produce a new version of 'presto-hive-apache' and make a change in Presto. Is it something you will take care of, or you want me to do something? Let me know.

@shangxinli First requirement will be to release the new version of presto-hive-apache and once that is done we can work on the changes to update the version here and if any other changes that arise due to that.

For releasing the new version of presto-hive-apache, @aweisberg @rschlussel can you please help?

@rschlussel
Copy link
Contributor

yeah, i can do the release

@rschlussel
Copy link
Contributor

presto-hive-apache 3.0.0-4 should now be released

@imjalpreet
Copy link
Member Author

@rschlussel thanks for the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants