-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Upgrade to Hive 3.0.0 #15805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to Hive 3.0.0 #15805
Conversation
|
Added you as a contributor. Not sure if that will let you restart tests yourself. I will need to force push to update just to get the FB integration going. |
d7e93ae to
16957f4
Compare
|
@imjalpreet this is the same as the last PR just again modulo the one change I made to the Hive version (3.0.0-3 instead of 3.0.0)? We don't need to do a full review? |
|
The collection delim fixes look correct to me. I added the "Cherry-pick of" to it. It would be nice to also have those for the "Update to HIve 3.0.0". Generally linking context a bit so people don't have to search is low effort and helpful. I am going to test this now against the queries we had issues with and we can see if it works. |
|
@aweisberg Yes this is the same PR as the earlier one. It just includes the fix for collection delimiter for Hive versions < 3.0.0.
Sure I can add that for reference. I did not add that before since this PR had some extra changes as well. So I had only added Co-authored tag. |
Fix TestOrcBatchPageSourceMemoryTracking unit test failures Fork BloomFilter class from org.apache.hive.common.util.BloomFilter Fix presto-orc plugin unit test failures Cherry-pick of trinodb/trino#57 (trinodb/trino#57) and trinodb/trino#203 (trinodb/trino#203) Co-authored-by: David Phillips <[email protected]>
Cherry-pick of https://github.com/trinodb/trino/pull/1733/files and https://github.com/trinodb/trino/pull/1321/files Co-authored-by: Anurag Sharma <[email protected]> Co-authored-by: wujianchao <[email protected]>
|
I was able to test it on one of the failed queries and it works correctly now. The tables are gone for the other two, but I am confident that we fixed the underlying issue for each query. |
|
Everything passed, but failed on a flaky test in presto-main. |
|
Trino doesn't appear to do anything special. I can't find It seems like a real concern to me, but maybe we just release note it? I imagine not many people are using TEXTFILE as an output format in Presto. |
I wonder how much work it would be to add a config option to emit both names or maybe just the "legacy" name if enabled? I suspect there is a relatively small number of users using text formats and also using other compute engines, but I'm sure it exists and they'll be upset if they have no recourse. |
I was thinking about adding both, but is it the case that it will error out on unrecognized options if both are present? |
Can't say that I know... seems like that would depend on the SerDe or other query engine in question, which makes it kind of a black box. It does seem kind of wonky to set both properties by default and keep propagating that silly typo into the future, so making the creation configurable but accepting either on when reading seems like the best compromise to me. |
|
I looked at the hive source and it seems like it does track what the list of supported properties are in an annotation, but not to error. There is one case where it compares props from two different tables and this would break in an environment that mixes them. TBH not sure what to do here, but the urge to ship it is strong. |
I agree, it's hard to know what the right thing to do here is in advance. It's probably the right call to just move forward, document the potential incompatibility, and see who comes forward with issues afterwards. Once we get a sense for who this broke we'll have a better answer for how to mitigate it- and it does seem like something that can be mitigated. |
I agree. If we know the use cases it broke for, we'll be able to know what solution is best. For now at least not breaking reads seems sufficient. |
|
@aweisberg or @imjalpreet can you add the release note information to the PR description. |
|
@rschlussel @aweisberg I have updated the description with the release note, please let me know in case anything else needs to be added. |
|
@imjalpreet, Can you explain why the dependency on parquet-xxx is removed in this change? By doing that, we are only using the Parquet via Hive but not directly using Parquet anymore. I am working on #14960, which is to upgrade Parquet version to 1.11.x, but surprisingly that doesn't work anymore because of this change. FYI @vkorukanti @zhenxiao @highker |
@shangxinli The parquet dependencies were only added here until the hive was upgraded. This could be done since the earlier version of the shaded version of Hive (we were using 1.2.2-2) had Parquet 1.6.0 and since Apache had changed the namespace of Parquet classes in the recent versions it allowed having both new and old parquet classes in the classpath. With the upgrade to Hive 3, we don't need to separately include the parquet dependencies as we get them along with Hive itself. Since we maintain shaded version of Hive for Presto, you can upgrade parquet to 1.11.x in presto-hive-apache and we can release that change and start using that. I think there is already a PR in progress to upgrade to 1.11.0. Let me know in case you have any questions. |
|
Thanks @imjalpreet for the quick response! I can understand the reason and it is a good effort to bring into one version. What is the release frequency of presto-hive-apache? It seems upgrading Parquet in Presto becomes more difficult as we have to 1) change presto-hive-apache 2) Wait till the release presto-hive-apache 3) change Presto. I already upgraded Iceberg to 1.11.1 and I also opened a ticket to upgrade to Parquet 1.12.0. When Iceberg connector catches up upgrading, that means we have to go through steps 1), 2), 3) very often. Also, with this change when upgrading presto-hive-apache in Presto, that could implicitly upgrade Parquet to newer version. For PR #14960, several committers already have concerns and want us to wait till deployment in production. I am not sure if we still have this path of control if we implicitly upgrade Parquet in the future. Are you open to add back the direct dependency on Parquet? If yes, I can add them back in my PR #14960. |
|
Any thoughts @imjalpreet? |
|
@shangxinli Sorry I missed your comment. Please find my response below.
There isn't a set release frequency for presto-hive-apache, as far as I am aware it can be released whenever there is a change required or update required. @rschlussel can confirm.
I assume those releases can be coordinated. Although @rschlussel @aweisberg would be better to comment on that.
I don't think it is possible now to add the direct dependency to Parquet. It was possible earlier because previous version of Hive used to have a different classpath for Parquet dependencies, so it was possible to have different versions of Parquet present in the classpath. The point I am trying to make is even if you add a direct dependency to Parquet and exclude the Parquet dependencies coming from presto-hive-apache, on any upgrade of Parquet in Presto you will have to make some changes in presto-hive-apache since there can be breaking changes with an upgrade and hive read and write path would expect the older versions of Parquet classes since it had been build with that version. Unfortunately there is no way to skip this. Let me know in case you have any concerns. |
|
Thanks @imjalpreet! If that is the only option, let's go with that. The only thing is we have a dependency on "upgrading to 1.11.x" to open source Parquet Column Index(PR #15454) and Column Encryption PRhttps://github.com//issues/12048). It would be great if you can help to expedite the PR. |
|
@aweisberg @rschlussel would you be able to help on this? |
|
@imjalpreet, I see PR46 has been merged. For the next steps, I guess we need to produce a new version of 'presto-hive-apache' and make a change in Presto. Is it something you will take care of, or you want me to do something? Let me know. |
@shangxinli First requirement will be to release the new version of presto-hive-apache and once that is done we can work on the changes to update the version here and if any other changes that arise due to that. For releasing the new version of presto-hive-apache, @aweisberg @rschlussel can you please help? |
|
yeah, i can do the release |
|
presto-hive-apache 3.0.0-4 should now be released |
|
@rschlussel thanks for the help. |
This PR includes the base changes required for upgrading to Hive 3. This will help in supporting and bringing a number of interesting features in Hive 3 to Presto in the future.
depends on https://github.com/facebookexternal/presto-facebook/pull/1450