Skip to content

OM 2.0: OM protobuf future #296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bwplotka opened this issue Mar 7, 2025 · 3 comments
Open

OM 2.0: OM protobuf future #296

bwplotka opened this issue Mar 7, 2025 · 3 comments
Assignees

Comments

@bwplotka
Copy link
Member

bwplotka commented Mar 7, 2025

Problem Statement

OM proto is not currently adopted (Prometheus libs and main binary is not aware of it).

Prometheus ecosystem still use and invest in Prometheus Proto although in the past it was attempted to be deprecated. (proto3 version). Currently it's on the way to be used as a default scrape configuration (it's default for native histograms and bunch of other feature flags).

Given that, it's not clear if, as a part of OM 2.0 WG we should continue OM proto, improve it or remove from OM completely and recommend the existing Prometheus proto. Note that this is a separate topic to the OM text which is the main area of the OM 2.0 focus.

OM Proto vs Prometheus Proto

Protocols are pretty similar, both uses similar MetricFamily abstraction and have similar gauge, counter histogram, summary structures. They do differ a little bit though too:

OM proto:

Prometheus Proto (proto3)

  • Uses non repeated "implicit oneof" defined directly into Metric for each metric value.
  • Every value has to be double.
  • Supports native histogram.
  • Uses the delimited format that allows to send each metric family in separate message allowing streaming parsers.
  • Misses Info and StateSet MetricTypes (both are interpreted as gauges in Prometheus as of now).
  • Has inconsistent timestamps. Some use google.protobuf.Timestamp timestamp = 3; // OpenMetrics-style., some use int64 timestamp_ms = 6;. The latter is easier (and faster) to use, but 0 means not set, which blocks the use of the exact 0 millisecond timestamp (implicitly accepted in many places in Prometheus e.g. Remote Write).

To sum up, PrometheusProto is closer to what Prometheus implements now, including native histograms. It also unblocks a bit more efficient parsing. On the other hand OM Proto is consistent with OM 1.0 types and makes it a bit easier (?) to send historical samples for the same series. OM proto is also strictly versioned (read below why that's important).

Protobuf versioning

During WG discussions there was a point made around protobuf versioning -- the fact it does not need strict minor/patch versioning as we can do a lot of changes without breaking users or user interaction.

I would argue, in the world of data heavy network protocols like OM or Remote Write that's not practically true. Generally, we need to use the same versioning structructure as for the text format.

Examples:

  • We add schemaURL attribute to MetricFamily one day. Adding field with this new information is not a breaking change. However, without a concrete minor version bump this change won't be well announced. This is also the same if our text format make a MUST on skipping unknown lines.
  • The addition of Info and StateSet metric types to Prometheus Proto. One could say it's not a breaking change. Normally adding fields to protobuf is not breaking and on the protocol correctness, it's true it will not crash encoding/decoding. However such a change is *practically semantically breaking, because when SDK/client upgrades and starts to generate MetricFamily for e.g. Info type it has to decide where to put it (a) as the new Info type, (b) old, deprecated for info metrics, Gauge type or (c) both. To not break user it would need to be (c), but it's not practically possible for complexity and efficiency reasons (not easily compressible duplicated data send over network, detecting duplicates on parse).

To sum up, some versioning and content negotiation might be needed for protobuf protocols as well.

Proposed solution

Implementing Protobuf support, efficiently was a big task, and PrometheusProto unblocks streaming and is already adopted. There's also not many differences vs OM Proto that would motivate the ecosystem to adopt OM proto either.

Perhaps the best course of action would be:

  1. Deprecate the OM 1.0 Proto.
  2. Release the OM 2.0 without Protobuf schema.
  3. Release the official versioned spec (1.0/0.1?) document for PrometheusProto (on prometheus.io docs) and iterate on it (e.g. 1.0/1.1/2.0 with OM types at some point and decision around timestamp 0s). Put the proto in one offcial place (prometheus/prometheus and buf registry), remove gogo parts (doable with new custom parser now).

Pros:

  • Allowing separate versioning/lifetime for text vs proto (also a downside, maybe consistency is useful).
  • Iterating on the adopted protocol instead of iterating on not used one, risking less adoption in future.
  • No need to reimplement parsers.
  • The most efficient option and we know even existing proto parsing has a lot of overhead (until we fix magic suffixes).
  • Clear state of PrometheusProto.
  • Less work?

Cons:

  • Losing "OM" badge for protobuf protocol, although OM is Prometheus since last year.
  • Inconsistency between OM 1.0 and 2.0.
  • Impacting existing OM 1.0 Proto users (we don't know of any, but there might be some).

Alternatives considered

  • Iterate on OM Proto 1.0 in OM 2.0, deprecate PrometheusProto.

We could add native histograms in OM 2.0. For efficiency we could introduce delimited format. Then we kind of reimplement PrometheusProto though under OM umbrella, which is Prometheus umbrella now. Perhaps not worth it?

Iterating on adopted protocol feels better for the ecosystem too.

  • Develop a completely new OM Proto 2.0 in OM 2.0, deprecate PrometheusProto.

Interesting, but do we have resources for this. The only benefit I see is the opportunity to rethink "MetricFamily" concept that does not exists (and does not make sense) in Prometheus. That would be only readability improvement, nothing more 🤔

  • Deprecate all proto protocols

At some point that was an intention. However protobuf was useful for experiments (it's the only protocol that has practical native histograms for the last few years) and it's likely to be more efficient once Prometheus switches to complex types and we finalize the gogo/custom generator aspect.

@bwplotka bwplotka self-assigned this Mar 7, 2025
@beorn7
Copy link
Member

beorn7 commented Mar 8, 2025

Few thoughts and remarks:

First of all, please take #256 into account. It fixes a few things in the existing OM proto spec and adds native histograms and also adds classic float histograms. While we might want to change things in detail, I think the basic ideas in this PR should be considered more or less part of OM already when deciding about OMv2 proto vs. Prometheus proto.

With that said, I think the most important next step is to check for payload differences between OM and classic Prometheus that have nothing to do with the proto version. OMv1 has enforced units in the metric name, and enforced addition of a _total suffix for counters and an _info suffix for infos. This implies that some expositions can be valid in classic Prometheus but invalid in OMv1 (i.e. impossible to express with correct OMv1 – infamously, this includes the standard Go metrics exposed by prometheus/client_golang). Imagine OMv2 has the same or similar restrictions, keeping Prometheus proto alive would mean that we would still have the situation where a scrape succeeds in protobuf but fails when switching to the (OMv2) text representation. In this (imaginary) case, it would probably be preferred to create a protobuf format (of whatever name) that has the exact same restrictions. However, I think the better approach would be to design OMv2 in a way that those restrictions do not exist anymore.

Once we are at that point, I think both approaches described above would converge towards something very similar, and whether we call it "OMv2.X proto" or "Prometheus vX.Y proto" is mostly cosmetic. With a versioned protobuf protocol, we could even do "formally breaking" changes in the Prometheus proto that we have shied away from. I don't think it would be hard for self-sustained generators and parsers to have parallel implementations. It is a bit harder for code that uses the proto spec as the internal data model, too, like prometheus/client_golang or prometheus/pushgateway, where the current proto spec is present throughout the code. But that shouldn't stop us if getting rid of legacy structures in the proto spec has a tangible benefit.

@beorn7
Copy link
Member

beorn7 commented Mar 8, 2025

About a few specific features of OM proto vs. Prometheus proto:

  • As said above, Native histogram support in proto spec #256 should be taken into account for this comparison.
  • Double vs. int: It's probably good to have both in the final result. Note that it's not universally done in OM. OM does not support classic float histograms (supported by Proto text, but not by OM text – support to Prometheus proto was added recently).
  • I see the OM MetricSet just as an option to have multiple metric families in the same proto message. I would expect that even OM proto exposition uses the delimited format to send multiple proto messages in a single scrape unless they have a reason not to. (This expectation could be clarified in a doc comment.) Note that protobuf implementations might put a limit on the total message size, so large expositions might not even fit into a single message.
  • Native support of Info metrics in Prometheus is being discussed (cf. info function experiment). This signals that full support of Info metrics should be in the protocol, even if we keep calling it "Prometheus proto". There are no similar efforts about StateSet, but I'm not opposed to keeping it in the protocol.
  • Timestamps should really be made consistent. The current inconsistency in Prometheus proto happened because the timestamps added later were following the "modern" approach also taken in OM consistently.
  • We should just use proto3 everywhere in the end.

@beorn7
Copy link
Member

beorn7 commented Mar 8, 2025

I guess the desired technical outcome will be that we have an OMv2 text format with no adoption blockers (neither for classic Prometheus usage nor for "modern" OTel interoperation), where exposers can be negotiated to use protobuf instead in a transparent fashion (i.e. without any change of the exposed metrics).

In that case, I would call that protobuf "OMv2 proto format". This is a non-technical preference, to avoid confusion, increase consistency in terminology, and keep the "brand" of OM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants