[DocDB] Fix race condition with ConnectionContextWithCallId::calls_being_handled_ #20783

mdbridge · 2024-01-25T16:17:48Z

Jira Link: DB-9780

Description

[DocDB] Fix race condition with ConnectionContextWithCallId::calls_being_handled_

This variable holds information about RPCs received on the server that are currently being handled:

~/code/yugabyte-db/src/yb/rpc/rpc_with_call_id.h:55:
  // Calls which have been received on the server and are currently
  // being handled.
  std::unordered_map<uint64_t, InboundCall*> calls_being_handled_;

Unfortunately, it is not protected by any lock and can be simultaneously read and written, which is a data race. Worse, this data race can cause the TServer to crash with a segmentation fault because the iterator can point to garbage when the data structure is changed out from under the iterator.

We saw a crash where the reading code:

~/code/yugabyte-db/src/yb/rpc/rpc_with_call_id.cc:30:
void ConnectionContextWithCallId::DumpPB(const DumpRunningRpcsRequestPB& req,
                                         RpcConnectionPB* resp) {
  for (const auto &entry : calls_being_handled_) {
    entry.second->DumpPB(req, resp->add_calls_in_flight());
  }
}

died because entry.second was -1 due to the data structure moving to a different part of the heap in order to grow.

In terms of urgency of fixing, we suspect the read path only runs when the /rpcz end point is hit. We have not seen a crash due to two attempts to write to this data structure at the same time. Not sure if upstream is serialized somehow or if data races due to writes at once just result in incorrect results instead of crashing.

The crash was with 2.16.6 so we probably want to backport at least as far as 2.18.

This bug was introduced by https://phorge.dev.yugabyte.com/D2353.

UPDATE:

Amit informs me that there is concurrency control here: although it is not documented, this field is supposed to be only accessed from the reactor thread corresponding to the connection containing the RPC calls in question. If true and consistently applied, that should have prevented the crash we saw.

Given that we saw a crash, presumably there is bug somewhere where we violate this concurrency control. Possibly fixed after 2.16.6.

In terms of trying to narrow down where the problem is, in the crash there were both YSQL and CQL calls ongoing. The crash occurred on a TServer.

At a minimum, for this task add comments where needed describing the concurrency preconditions. If possible, add DCHECKs or the like to make sure it is being followed.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

The text was updated successfully, but these errors were encountered:

mdbridge · 2024-01-25T18:32:59Z

please add a test that catches this problem via TSAN or the like and verify that it fails before the fix and passes afterwards

amitanandaiyer · 2024-01-25T23:48:16Z

@mdbridge as I understand this these calls are only supposed to run on the reactor thread.

  // Dump information about status of this connection context to protobuf.
  virtual void DumpPB(const DumpRunningRpcsRequestPB& req, RpcConnectionPB* resp)
      ON_REACTOR_THREAD = 0;

It should be the only thread which accesses these.

  // Takes ownership of call_data content.
  Status HandleCall(const ConnectionPtr& connection, CallData* call_data)
      ON_REACTOR_THREAD override;
      
      this calls Store(..)

amitanandaiyer · 2024-01-25T23:51:03Z

If this is indeed the cause though, the likelihood of running into this would increase drastically once we enable/add ash. ASH relies on the DumpPB mechanism to pull the desired wait-states.

This would be the equivalent of curling /rpcz every so often (1 sec).

mdbridge · 2024-01-27T00:48:13Z

as I posted elsewhere, there is more than one reactor thread in the system

mdbridge added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Jan 25, 2024

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jan 25, 2024

mdbridge mentioned this issue Jan 25, 2024

[DocDB] Check if we have basic coverage of our endpoints in tests #20787

Open

1 task

yugabyte-ci added priority/highest Highest priority issue priority/high High Priority and removed priority/medium Medium priority issue priority/highest Highest priority issue status/awaiting-triage Issue awaiting triage labels Jan 25, 2024

yugabyte-ci assigned spolitov Jan 25, 2024

yugabyte-ci added priority/medium Medium priority issue and removed priority/high High Priority labels Aug 9, 2024

yugabyte-ci added priority/low Low priority and removed priority/medium Medium priority issue labels Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DocDB] Fix race condition with ConnectionContextWithCallId::calls_being_handled_ #20783

[DocDB] Fix race condition with ConnectionContextWithCallId::calls_being_handled_ #20783

mdbridge commented Jan 25, 2024 •

edited

Loading

mdbridge commented Jan 25, 2024

Uh oh!

amitanandaiyer commented Jan 25, 2024 •

edited

Loading

Uh oh!

amitanandaiyer commented Jan 25, 2024

Uh oh!

mdbridge commented Jan 27, 2024

Uh oh!

[DocDB] Fix race condition with ConnectionContextWithCallId::calls_being_handled_ #20783

[DocDB] Fix race condition with ConnectionContextWithCallId::calls_being_handled_ #20783

Comments

mdbridge commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

mdbridge commented Jan 25, 2024

Uh oh!

amitanandaiyer commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amitanandaiyer commented Jan 25, 2024

Uh oh!

mdbridge commented Jan 27, 2024

Uh oh!

mdbridge commented Jan 25, 2024 •

edited

Loading

amitanandaiyer commented Jan 25, 2024 •

edited

Loading