Investigate Merge Requests API Postgres connection limit issue on 50k
After update to the Nightly Omnibus package 12.6.0-pre db9a0421981
one of the API performance tests, List project merge requests, started failing the 50k environment (1000 RPS):
* Environment: 50k
* Version: 12.7.0-pre `f347c4bd9a4`
* Option: 60s_1000rps
* Date: 2020-01-20
* Run Time: 43m 12.23s (Start: 04:28:43 UTC, End: 05:11:55 UTC)
NAME | RPS | RPS RESULT | RESPONSE P95 | REQUEST RESULTS | RESULT
---------------------------------------------------------|--------|----------------------|--------------|-----------------|-------
api_v4_projects_merge_requests | 1000/s | 904.67/s (>640.00/s) | 1130.69ms | 87.44% (>95%) | Failed
api_v4_projects_merge_requests_merge_request | 1000/s | 949.42/s (>800.00/s) | 103.09ms | 99.99% (>95%) | Passed
After some investigation it was was found that PgBouncer was reporting connection limit issues on Postgres:
ERROR S: login failed: FATAL: remaining connection slots are reserved for non-replication superuser connections
Our metrics dashboard - https://snapshot.raintank.io/dashboard/snapshot/qqvHou0ZZ1ze1BtIgkNmFpgZvP6SwUTX?orgId=2 - shows that a queue does appear to form in PgBouncer:
And that notably Postgres took up 100% CPU usage throughout (how this is also seen on smaller environments being tested at corresponding smaller throughputs with no issue):
In addition the OOTB Postgres dashboard can be found here - https://snapshot.raintank.io/dashboard/snapshot/kDWzL82cSDH1YjQyh9HbcGFcrG6gD1py?orgId=2
(based on our limited historical data, this endpoint appears to have always maxed the Postgres CPU but not for the whole time of the test, only towards the end)
When this issue first started happened we hadn't changed any notable config such as the connection limit for Postgres on our environments (default is 200) or db_pool
settings (we have changed this now via reducing Puma threads and for db_pool
to follow the default dynamic setting - gitlab-org/gitlab#32562 (closed) but the issue was happening before and after).
Through investigation we've currently found that increasing the Postgres connection limit eventually "solves" the problem. At 300 it made no difference but going up to 1000 did "solve" the problem.
Task is to investigate why this is the case and to see if the increase is expected or not. There have been changes made to this endpoint that have improved performance but this may be a knock on effect from these.
If this is a real issue this task should be closed and an issue raised in the main product.