2 releases

Uses new Rust 2024

new 0.1.1	Jun 22, 2025
0.1.0	Jun 22, 2025

#113 in HTTP server

MIT license

33KB
474 lines

prometheus-http-exporter

Turn HTTP resources into Prometheus metrics.

[!IMPORTANT]
Excessive amounts of requests can lead to your being banned, and is generally regarded as a dick move. Use it responsibly.

Concepts
Quickstart
Configuration
Extractors
- jq
- Regex
Prometheus Configuration

Concepts

The configuration file contains a list of targets. Each target represents one URL (one foreign endpoint) that the exporter will scrape.

Each target contains a set of rules, which transform the URL's response into metrics.

Finally, the metrics are exposed on a configurable port, to be scraped by Prometheus. See here for how to configure Prometheus.

[!NOTE] Unlike most exporters, this project does not generate fresh metrics when scraped by Prometheus. Instead, each target keeps its own schedule, as defined in the config.

This is to support both low- and high-frequency metrics in the same exporter.

Quickstart

Save the following as config.yml:

scrape_on_startup: true
targets:
  - name: prometheus repository stats
    url: https://api.github.com/repos/prometheus/prometheus
    cron: "* 0 * * * *"
    rules:
      - name: prometheus_repo_watchers
        extract: .watchers
      - name: prometheus_repo_stars
        extract: .stargazers_count
      - name: prometheus_repo_forks
        extract: .forks

Now run the exporter:

$ docker run -v "${PWD}/config.yml:/config.yml" -it ghcr.io/mcofficer/prometheus-http-exporter:latest

...and check 0.0.0.0:3000/metrics:

$ curl http://0.0.0.0:3000/metrics
################### prometheus repository stats ###################

# TYPE prometheus_repo_watchers gauge
prometheus_repo_watchers 59046 1750334436432

# TYPE prometheus_repo_stars gauge
prometheus_repo_stars 59046 1750334436433

# TYPE prometheus_repo_forks gauge
prometheus_repo_forks 9601 1750334436433

Configuration

# The Address to bind to. Defaults to 0.0.0.0:3000
address: 0.0.0.0:8271
# Scrapes each target while starting up. Useful to test your config, don't use in production.
scrape_on_startup: true
# Log level, "info" by default
log_level: debug
targets:
  - name: crates.io summary
    url: https://crates.io/api/v1/summary
    cron: every 15 minutes
    extractor: jq # jq is the default, so this could be omitted
    headers:
      # crates.io requests that we identify ourselves & provide contact info
      User-Agent: "prometheus_http_exporter/0.1.0 (Hosted by John Doe <jd@example.org>)"
    rules:
      - name: crates_io_crates
        extract: ".num_crates"

Read on for an explanation of the most important parts, or see config.yml for a full example. There is also an auto-generated json schema available (config.schema.json).

cron

Specifies when the job should run. Two formats are supported:

english expressions, such as every 15 minutes or every day. See the english-to-cron crate for a table of valid patterns.
classic cron syntax, f.e. * */15 * * * * or @daily. The only caveat here is that the seconds field is not optional. Refer to the croner documentation for specifics.

Extractor

See Extractors

Headers

Custom headers can be included with the headers map. The exporter automatically sets the User-Agent, but you may override it here.

Extractors

Extractors are the heart of the exporter, being responsible for turning a response into metrics. The chosen extractor runs once for each rule, each time producing one (or several) metrics with the rule's name. The extract key on each rule is used to configure the extractor.

Currently, two extractors are supported: jq (the default), and Regex.

jq

jq describes itself as "a lightweight and flexible command-line JSON processor." In practice, its query language is Turing-complete and people have even implemented jq in jq.

This project uses the jaq, a rust clone of jq. For 99% of cases, jaq and jq are interchangeable, and you can usually expect queries from the jq playground to work with jaq.

Extracing single values

The simplest use-case for jq is to extract a single number from a JSON response. For instance, let's look at the QuickStart example, which uses the GitHub API to get information about the prometheus repository.

The GitHub API returns a response like this:

{
  "full_name": "prometheus/prometheus",
  ...
  "stargazers_count": 59046
}

... from which the query .stargazers_count extracts the value: 59046. ^{(jq Playground)}

If the value is a number, a metric is emitted using the rule's name, the value and the timestamp of the extraction:

# TYPE prometheus_repo_stars gauge
prometheus_repo_stars 59046 1750334436433

Extracting multiple values

Sometimes, a response contains more than a few values of interest. Rather than creating a bespoke rule for each, we can have a query that returns several at once. Consider the following JSON response:

{
  "yaks": {
    "shaved": 3,
    "total": 5
  }
}

Suppose we are interested in both shaved and total. We can use the query .yaks to return only the object containing both. ^{(jq Playground)}

The extractor will emit a metric for each key-value pair in the object (if the value is a number), with the key being preserved as a label:

# TYPE yaks gauge
yaks{key="shaved"} 3 1750338779649
yaks{key="total"} 5 1750338779649

Similarly, the extractor can also accommodate arrays:

[
  {
    "value": 3,
    "shaved": true
  },
  {
    "value": 2,
    "shaved": false
  }
]

In this case, each object containing a numeric value is turned into a metric. The object's other contents (except nested objects and arrays) are attached as labels:

# TYPE yaks gauge
yaks{shaved="true"} 3 1750340568904
yaks{shaved="false"} 2 1750340568904

[!WARNING]
Be careful when indiscriminately ingesting data. While every key/value is strictly sanitized, this does not protect you from large and unnecessary amounts of data. For example, one might end up with labels containing base64-encoded images.

Regex

Regular expressions have become a programming mainstay, in part due to their flexibility to find things in virtually any human-readable input. However, they can be all but inscrutable to beginners. If you're struggling, try using a regex debugger such as regex101 and maybe have a look at the Learning section of awesome-regex.

This project uses the regex crate, which omits some of the more compute-intensive regex features (most notably, look-arounds). The "Rust" flavor on regex101 uses the same crate. By default, all Regex flags except Unicode support are disabled.

[!NOTE]
Regex is far from a perfect format, and scraping content meant for humans is generally frowned upon by website operators. Regex support in this project is intended as a fallback feature - if at all possible, you should prefer an extractor that parses structured data.

Example

Let's try to extract the statistics from Steam's About page. Their HTML contains this snippet:

<div class="online_stat_label gamers_online">online</div>
    36,426,658                            </div>

A naive regex might look something like this: ^(regex101)

gamers_online.*?\s*?([\d,]+)

So let's plug that into a rule, enable scrape_on_startup, and-

called `Result::unwrap()` on an `Err` value: Regex matched, but the result could not be parsed as f64: '36,426,658'

Oh.

This is unfortunately a common problem with matching numbers made for humans - there are commas that the f64-parser can't make sense of. It's not the parser's fault; formatting rules vary depending on locale, and making a "best guess" is both difficult and unreliable.

Fortunately, there is a solution. If our regex has multiple matching capture groups, they will first be concatenated, then parsed as number. (This does not apply to non-capturing groups). So let's adjust our regex: ^(regex101)

gamers_online.*\s*(\d+),?(\d+),?(\d+)

It's ugly, but it gives us 3 groups of comma-separated segments. (If Steam ever goes over a billion players, we'll need to add a fourth.) This finally gives us results:

# TYPE steam_players_online gauge
steam_players_online 36426658 1750350123528

Rules

The Regex extractor follows these rules:

Only the first match is processed, all others are discarded.
If there are no matching capture groups, or no groups at all, parse the entire match.
If there are only unnamed capture groups, concatenate and then parse them.
If there are named capture groups, parse the group called value and add all others as labels.

Prometheus Configuration

Add this to your prometheus.yml:

scrape_configs:
  - job_name: prometheus-http-exporter
    honor_timestamps: true
    scrape_interval: 5m
    static_configs:
      - targets: [ 0.0.0.0:8271 ]

honor_timestamps is true by default and may be omitted. It is not recommended to set it to false; In that case Prometheus will give the metrics a fresh timestamp on every scrape, even if the exporter hasn't updated some metric in hours. It makes for better-looking dashboards, but at the cost of polluting Prometheus with misleading data.

scrape_interval may be set as low as possible, at least as low as the shortest scheduled target.

Remember to replace 0.0.0.0:8271 with the address defined in the prometheus-http-exporter config.

Dependencies

~15–32MB
~494K SLoC