Skip to content

[DocDB] Fix race condition with ConnectionContextWithCallId::calls_being_handled_ #20783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
mdbridge opened this issue Jan 25, 2024 · 4 comments
Open
1 task done
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/low Low priority

Comments

@mdbridge
Copy link
Contributor

mdbridge commented Jan 25, 2024

Jira Link: DB-9780

Description

[DocDB] Fix race condition with ConnectionContextWithCallId::calls_being_handled_

This variable holds information about RPCs received on the server that are currently being handled:

~/code/yugabyte-db/src/yb/rpc/rpc_with_call_id.h:55:
  // Calls which have been received on the server and are currently
  // being handled.
  std::unordered_map<uint64_t, InboundCall*> calls_being_handled_;

Unfortunately, it is not protected by any lock and can be simultaneously read and written, which is a data race.  Worse, this data race can cause the TServer to crash with a segmentation fault because the iterator can point to garbage when the data structure is changed out from under the iterator.

We saw a crash where the reading code:

~/code/yugabyte-db/src/yb/rpc/rpc_with_call_id.cc:30:
void ConnectionContextWithCallId::DumpPB(const DumpRunningRpcsRequestPB& req,
                                         RpcConnectionPB* resp) {
  for (const auto &entry : calls_being_handled_) {
    entry.second->DumpPB(req, resp->add_calls_in_flight());
  }
}

died because entry.second was -1 due to the data structure moving to a different part of the heap in order to grow.

In terms of urgency of fixing, we suspect the read path only runs when the /rpcz end point is hit.  We have not seen a crash due to two attempts to write to this data structure at the same time.  Not sure if upstream is serialized somehow or if data races due to writes at once just result in incorrect results instead of crashing.

The crash was with 2.16.6 so we probably want to backport at least as far as 2.18.

This bug was introduced by https://phorge.dev.yugabyte.com/D2353.

UPDATE:

Amit informs me that there is concurrency control here: although it is not documented, this field is supposed to be only accessed from the reactor thread corresponding to the connection containing the RPC calls in question. If true and consistently applied, that should have prevented the crash we saw.

Given that we saw a crash, presumably there is bug somewhere where we violate this concurrency control. Possibly fixed after 2.16.6.

In terms of trying to narrow down where the problem is, in the crash there were both YSQL and CQL calls ongoing. The crash occurred on a TServer.

At a minimum, for this task add comments where needed describing the concurrency preconditions. If possible, add DCHECKs or the like to make sure it is being followed.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@mdbridge mdbridge added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Jan 25, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jan 25, 2024
@mdbridge
Copy link
Contributor Author

please add a test that catches this problem via TSAN or the like and verify that it fails before the fix and passes afterwards

@yugabyte-ci yugabyte-ci added priority/highest Highest priority issue priority/high High Priority and removed priority/medium Medium priority issue priority/highest Highest priority issue status/awaiting-triage Issue awaiting triage labels Jan 25, 2024
@amitanandaiyer
Copy link
Contributor

amitanandaiyer commented Jan 25, 2024

@mdbridge as I understand this these calls are only supposed to run on the reactor thread.

  // Dump information about status of this connection context to protobuf.
  virtual void DumpPB(const DumpRunningRpcsRequestPB& req, RpcConnectionPB* resp)
      ON_REACTOR_THREAD = 0;

It should be the only thread which accesses these.

  // Takes ownership of call_data content.
  Status HandleCall(const ConnectionPtr& connection, CallData* call_data)
      ON_REACTOR_THREAD override;
      
      this calls Store(..)

@amitanandaiyer
Copy link
Contributor

If this is indeed the cause though, the likelihood of running into this would increase drastically once we enable/add ash. ASH relies on the DumpPB mechanism to pull the desired wait-states.

This would be the equivalent of curling /rpcz every so often (1 sec).

@mdbridge
Copy link
Contributor Author

as I posted elsewhere, there is more than one reactor thread in the system

@yugabyte-ci yugabyte-ci added priority/medium Medium priority issue and removed priority/high High Priority labels Aug 9, 2024
@yugabyte-ci yugabyte-ci added priority/low Low priority and removed priority/medium Medium priority issue labels Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/low Low priority
Projects
None yet
Development

No branches or pull requests

4 participants