Skip to content

HDDS-8211. S3G QoS #4421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft

HDDS-8211. S3G QoS #4421

wants to merge 7 commits into from

Conversation

mohan3d
Copy link
Contributor

@mohan3d mohan3d commented Mar 18, 2023

What changes were proposed in this pull request?

Added a throttler package implementing request scheduler and some utilities. And a new filter in S3 to reject/accept user requests depending on the load of the s3gateway and their consumption.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8211

How was this patch tested?

Unittest and Manual test.

Copy link
Contributor

@xBis7 xBis7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mohan3d Thanks for working on this. Can you add more details about the manual test? Where you able to get it manually to drop requests?

FairCallQueue, if backoff is enabled rejects requests when the queues are full. If a user has been submitting a lot of requests but the FCQ priority queues are not full, the system still processes these requests but less frequently.

If I understand your approach correctly, you are rejecting requests based on how many the user has already submitted. If a user has exceeded the maximum number of requests, you start dropping any new ones he submits. This raises a concern that it might be dropping requests that it shouldn't have.

@mohan3d
Copy link
Contributor Author

mohan3d commented Mar 20, 2023

@xBis7 Thanks for reviewing this.
Yeah exactly you got it right, New requests from a user who made too many requests already will be rejected at some point. If I understand well the new requests will be rejected at some point (if queues are full)?

@xBis7
Copy link
Contributor

xBis7 commented Mar 20, 2023

@mohan3d FairCallQueue has multiple priority queues (4 by default) and requests are placed into those queues based on their priorities. DecayRpcScheduler calculates the priority of a call and then the call goes into the queue with the corresponding priority (for instance, highest priority call will go into the queue with the highest priority).

Your approach simulates the behavior of backoff but with backoff in order to reject a request all queues must be full. Check here.

Let's say we have only 1 user and max requests is set to 10000. If the user exceeds that number of requests then your filter will start rejecting all new requests coming from that user while it shouldn't since he is the only one stressing the system.

@mohan3d
Copy link
Contributor Author

mohan3d commented Mar 20, 2023

@xBis7 Good example, what should happen with the user when they are making too many requests in same earlier example. assuming that FairCallQueue is being used with backoff enabled?

I think it will end up with the queues full (if the requests rate is too high) and it will start reject the requests from this single user?

@xBis7
Copy link
Contributor

xBis7 commented Mar 20, 2023

what should happen with the user when they are making too many requests in same earlier example. assuming that FairCallQueue is being used with backoff enabled?

@mohan3d To be honest, I haven't tested such a scenario, but for the requests to fail with a single user, it means that the system fails. I would expect it to slow down processing and not fail entirely.

Also, I can see that the number of handlers is updated during processing requests but is also the number of requests per user getting decremented?

@mohan3d
Copy link
Contributor Author

mohan3d commented Mar 20, 2023

@xBis7 so the correct idea should be to slow down the processing in case the systems is too busy instead of rejecting requests? I am trying to understand more about the expected behavior. I might be able to update the solution.

Also, I can see that the number of handlers is updated during processing requests but is also the number of requests per user getting decremented?

Yeah both will be updated the way you mentioned. requests per a user will be reduced by a factor after each interval.

@xBis7
Copy link
Contributor

xBis7 commented Mar 23, 2023

so the correct idea should be to slow down the processing in case the systems is too busy instead of rejecting requests?

@mohan3d Yes, you could also be rejecting requests but at the point where the system is at full capacity. When you have only 1 user, FCQ doesn't slow him down or drop his requests.

@adoroszlai
Copy link
Contributor

@mohan3d Thanks a lot for the patch. Please let us know if you plan to continue working on this. The tests will need to be migrated to use JUnit5 -- we can do that for you if the PR is not abandoned.

@mohan3d
Copy link
Contributor Author

mohan3d commented Jan 28, 2024

@adoroszlai Yeah, I would like to continue working on it. Aside from migrating the tests, what else do I need to update to get the PR merged.

@adoroszlai
Copy link
Contributor

Thanks @mohan3d for continued interest in this.

migrating the tests

Done, pushed to your fork.

what else do I need to update to get the PR merged.

I'm not familiar with FCQ, so I'll defer review to @duongkame, @kerneltime and @xBis7.

Copy link
Contributor

@ivandika3 ivandika3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left a minor comment.

Comment on lines 53 to 56
ctx.abortWith(Response
.status(Response.Status.SERVICE_UNAVAILABLE)
.entity("Too many requests")
.build());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor suggestion: We can wrap this in OS3Exception according to the SlowDown Error code. A new OS3Exception can be defined in the S3ErrorTable (https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html).

image

The AuthorizationFilter#wrapOS3Exception can be reused by moving it to S3Utils.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @ivandika3 for your review and the suggested improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivandika3 I pushed some updates, please let me know if it needs further changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. Looks good.

@mohan3d mohan3d requested a review from ivandika3 February 29, 2024 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants