Skip to content

Add HA support for the Azure storage backend #8464

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

sding3
Copy link
Contributor

@sding3 sding3 commented Mar 4, 2020

This implementation borrows heavily from the gcs HA implementation. The CAS mechanism is realized through Azure storage conditional access headers and ETag.

Partially addresses #4576.

@sding3
Copy link
Contributor Author

sding3 commented Mar 25, 2020

@calvn, thanks for adding the azure/storage tag. Let me know if there's anything I can do to help move this PR along.

@jefferai , looking at git commit history, it appears you were the reviewer for the few gcs_ha.go commits, where HA support for GCS was added. So, I'm just curious if you could review this PR (since this one borrows heavily from gcs_ha.go) ?

@sding3 sding3 force-pushed the sding3/add-ha-support-to-azuer-storage-backend branch from 7de4bc8 to 3b61812 Compare March 12, 2021 23:59
@vercel vercel bot temporarily deployed to Preview – vault-storybook March 12, 2021 23:59 Inactive
@sding3
Copy link
Contributor Author

sding3 commented Mar 13, 2021

@calvn - it has been a year, any suggestion on how to get this reviewed and merged? Thanks.

@sding3 sding3 force-pushed the sding3/add-ha-support-to-azuer-storage-backend branch from 3b61812 to 287dd05 Compare March 13, 2021 00:46
@vercel vercel bot temporarily deployed to Preview – vault-storybook March 13, 2021 00:46 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook March 13, 2021 01:05 Inactive
@vercel vercel bot temporarily deployed to Preview – vault March 13, 2021 01:05 Inactive
@vercel vercel bot temporarily deployed to Preview – vault March 13, 2021 01:09 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook March 13, 2021 01:09 Inactive
@mathcantin
Copy link

Any news for that?

}
conds.IfMatch = r.etag // CAS mechanism
} else {
// IfNoneMatch: "*" = perform the operation iff the resource does not exist
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just spotted a small typo.

Suggested change
// IfNoneMatch: "*" = perform the operation iff the resource does not exist
// IfNoneMatch: "*" = perform the operation if the resource does not exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is short for "if and only if"

blob := l.backend.container.NewBlobURL(l.key)
bProp, err := blob.GetProperties(context.Background(), azblob.BlobAccessConditions{})
if err != nil {
if bProp != nil && bProp.StatusCode() == http.StatusNotFound {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some testing with this PR and had some trouble at bootstrapping. The absence of a lock is treated as error: The vault instances always yielded messages like lock: attempt lock: write lock: failed to read blob properties for "core/lock":, followed by a 404 error.

It seems that this line here is intended to treat 404s differently than other errors, as these 404s are part of normal operation, as one Vault instance has to create the lock initially (I imagine).

When doing a little bit of debugging, I found that bProp was nil also for 404s though, hence not satisfying this if clause, causing effectively a deadlock with aforementioned log output.

Unfortunately, as of now, I cannot provide a solution, though.

Copy link
Contributor Author

@sding3 sding3 Jul 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 6d55c99

thanks for testing this. as part of switching from zure-sdk-for-go to azure-storage-blob-go, I neglected to test the change, I fixed this along with something else in 8bb35cf

@ncabatoff
Copy link
Collaborator

@sding3 I just added support for testing in CI using Azurite in Docker (#12057), would you mind merging in main and adding a call to ExerciseHABackend? This won't obviate the need for real testing using Azure, since it's a different implementation, but it would at least give us some protection against regressions and some feedback in CI.

Merging in main will probably also fix those failing tests you're seeing.

sding3 added 3 commits July 16, 2021 21:16
Borrowed heavily from gcs_ha.go, and implemented using analogous
features from Azure blob storage.
@sding3 sding3 force-pushed the sding3/add-ha-support-to-azuer-storage-backend branch from 0f5e081 to 7d0f598 Compare July 17, 2021 02:18
@vercel vercel bot temporarily deployed to Preview – vault July 17, 2021 02:18 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook July 17, 2021 02:18 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook July 17, 2021 02:21 Inactive
@vercel vercel bot temporarily deployed to Preview – vault July 17, 2021 02:21 Inactive
@vercel vercel bot temporarily deployed to Preview – vault July 17, 2021 02:48 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook July 17, 2021 02:48 Inactive
I was previously pointing the two HA backends at two different azurite
backends, of course that didn't work. Fixed that by pointing them at
the same azurite backend.
@vercel vercel bot temporarily deployed to Preview – vault July 17, 2021 03:41 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook July 17, 2021 03:41 Inactive
var conds azblob.BlobAccessConditions

// Read the record
r, err := l.get(ctx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a race - someone could acquire a lock in between our get and our update, and then we'd blow it away. I don't know enough Azure to offer good advice here, but I think you should try to eliminate the get if possible, instead relying solely on the conds to protect the update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outcome of the get informs how the access condition should be built. The access condition influences the behavior of the subsequent write.

If the lock file doesn't already exist, the cond will be If-None-Match: *, which means that the file must not exist in order for the subsequent write to succeed.

If the lock file already exists, the cond will be If-Match: {eTag_of_lock_file}, which means that the file must exist and has not been modified for the subsequent write to succeed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about instead of doing the get, you take note of the Etag in the BlockBLobUploadResponse you get back from the blob.Upload call? Then the renews can make use of that for their IfMatch condition, and if you don't have an Etag (because this is your first writeLock call, i.e. it's not a renewal) you use IfNoneMatch:* as before?

As to the timestamp check below, maybe instead of that, use IfUnmodifiedSince in your update?

These two changes would save you a roundtrip and also eliminate my race concern.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it a bit more I don't think my suggestion above quite works. Let me try again.

For the initial case, we want to grab the lock unless there's a lock that has been recently updated. So for that initial case, maybe the only condition needed is IfUnmodifiedSince=(now-ttl). But I don't know how IfUnmodifiedSince behaves if the blob doesn't exist, so maybe we also need If-None-match:*. But the docs say that these conditions are ANDed together whereas we really want an OR, and I don't really understand the table they give describing the combined behaviour here: https://docs.microsoft.com/en-us/rest/api/storageservices/specifying-conditional-headers-for-blob-service-operations.

One option in the initial case if we can't find a way to express the desired condition: first try to update with If-None-Match:*, and if that fails attempt it with IfUnmodifiedSince. Or the other way around, whichever.

For the renew case, we want to grab the lock if we still own it. So there the only condition needed is IfMatch = (etag returned by our Upload). We don't really care about the ttl in this case.

Copy link
Contributor Author

@sding3 sding3 Jul 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggesting these alternate methods.

The current implementation does not have a race condition. Further, the current implementation is consistent with the way gcs implements it. gcs is an analogous storage technology to azure blob.

@ncabatoff
Copy link
Collaborator

I'm sorry that this has been ignored for so long. Thank you hanging in there.

This isn't quite a review yet, just some initial thoughts. I'll have to enlist the help of someone who knows Azure better to validate the parts that actually talk to Azure.

I regret not having gotten around yet to writing some docs to guide storage backend authors, because I recognize the model you used for this, and I prefer a slightly different one. Back when HA support was being added to the PostgreSQL backend someone submitted a PR much like this one, and I found it a bit hard to understand. I came up with a slightly simpler approach that I think could work here too. The idea is to get rid fo the "watch" goroutine, since it's not really necessary: we can use renewal failure as our mechanism to detect when we've lost the lock, assuming the renewal operation is constrained to not update the lock record when someone else has a valid lock (which should already be the case). I also don't really like the held field, since it runs the risk of getting out of sync with the actual lock state.

Anyway, I don't want to impose this model on you. You've already done a lot of hard work here, and been extremely patient with us. But if you're amenable have a look at physical/postgresql and see what you think.

@vercel vercel bot temporarily deployed to Preview – vault-storybook July 18, 2021 00:45 Inactive
@vercel vercel bot temporarily deployed to Preview – vault July 18, 2021 00:45 Inactive
@sding3
Copy link
Contributor Author

sding3 commented Jul 24, 2021

@ncabatoff - thanks for pointing me to the Azurite testing set up. It's done well. I like how I could simply run go test -v ./physical/azure -run=TestAzureHABackend and be able to exercise my implementation against a container that emulates the azure storage API.

As for your other suggestion about implementing the HA support in a similar fashion as the PostgreSQL backend, my current inkling is that I should approach it in the same manner as how gcs_ha.go approaches it, because gcs is very similar to azure blob, it's basically the competing product from Google, so being consistent with that makes more sense. However, I did take a brief look at the PostgreSQL backend's HA implementation, and I find it cleaner than the gcs HA implementation. I'll find some time in the next few days to look at the PostgreSQL backend's HA implementation more closely, and maybe I'll push some updates here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants