Arroyo is joining Cloudflare logo Cloudflare to bring stream processing to everyone
Arroyo Logo

cloud-native
stream processing

Transform, filter, aggregate, and join data streams by writing SQL, with sub-second results Scale from zero to millions of events per second No ops team required

Trusted by teams from

Zomato logo
Flowdesk logo
Nextroll logo
Commure logo

Built by streaming experts from

Lyft logo
Splunk logo
Sift logo
Quantcast logo

Get started

Arroyo ships as a single, compact binary. Run locally on MacOS or Linux for development, deploy to production with Docker or Kubernetes.

$ curl -LsSf https://arroyo.dev/install.sh | sh
$ arroyo cluster
Release 0.14.0March 26, 2025Apache 2.0 License

Arroyo is a new kind of stream processing engine, built from the ground up to make real‑time easier than batch.

Analytical SQL that just works

Arroyo was designed from the start so that anyone with SQL experience can build reliable, efficient, and correct streaming pipelines.

Data scientists and engineers can build end-to-end real-time applications... without a separate team of streaming experts.

=> SQL docs
CREATE VIEW tags AS (
      SELECT btrim(unnest(tags), '"') as tag FROM (
          SELECT extract_json(value, '$.tags[*].name') AS tags
       FROM mastodon)
  );
  
  SELECT * FROM (
      SELECT *, ROW_NUMBER() OVER (
          PARTITION BY window
          ORDER BY count DESC) as row_num
      FROM (SELECT count(*) as count,
          tag,
          hop(interval '5 seconds',
            interval '15 minutes') as window
              FROM tags
              group by tag, window)) WHERE row_num <= 5;

Designed for the modern cloud

Your streaming pipelines shouldn't page someone just because Kubernetes decided to reschedule your pods. Arroyo is built to run in modern, elastic cloud environments, from simple container runtimes like Fargate to large, distributed deployments on Kubernetes Kubernetes.

In short: Arroyo is a stateful stream processing engine that behaves like a stateless one.

=> Deployment docs

Scales easily to any workload

Arroyo is for everyone who needs to process data in real-time. Small use-cases can run with just a few MBs of RAM and a fractional vCPU.

For larger streams, Arroyo can rescale vertically and horizontally to process tens of millions of events per seconds while maintaining exactly-once semantics.

Incredible performance

Arroyo is fast. Really really fast. Written in Rust, a high-performance systems language, and built around the Arrow in-memory analytics format, its performance exceeds similar systems like Apache Flink by 5x or more.

Features

Well connected

Arroyo ships with tons of connectors, making it easy to integrate into your data stack

Real-time with Arroyo

With Arroyo, you can build streaming pipelines by writing the same analytical SQL queries you are already running in your data warehouse.

Mastodon is a federated microblogging platform, similar to Twitter. This query operates over the stream of all Mastodon posts via its Server-Sent Events API and finds the top 5 hashtags in each 15-minute window.

See the full tutorial.

This query finds potentially fraudulent users by detecting accounts that appear in multiple states within a single day.

CREATE TABLE mastodon (
    value TEXT
) WITH (
    connector = 'sse',
    format = 'raw_string',
    endpoint = 'http://mastodon.arroyo.dev/api/v1/streaming/public',
    events = 'update'
);

CREATE VIEW tags AS (
    SELECT btrim(unnest(tags), '"') as tag FROM (
        SELECT extract_json(value, '$.tags[*].name') AS tags
     FROM mastodon)
);

SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER (
        PARTITION BY window
        ORDER BY count DESC) as row_num
    FROM (SELECT count(*) as count,
        tag,
        hop(interval '5 seconds', interval '15 minutes') as window
            FROM tags
        group by tag, window)) WHERE row_num <= 5;
CREATE TABLE page_views (
    userId INT,
    state TEXT
)  WITH (
    connector = 'kafka',
    type = 'source',
    bootstrap_servers = 'localhost:9092',
    topic = 'page_view_events',
    format = 'json'
);

CREATE TABLE suspicious (
    user_id INT,
) WITH (
    connector = 'kafka',
    type = 'sink',
    bootstrap_servers = 'localhost:9092',
    topic = 'suspicious',
    format = 'json'
);

INSERT INTO suspicious
    SELECT "userId" as suspicious_id
        FROM (
            SELECT "userId",
                tumble(interval '1 day') as window,
                COUNT(distinct state) as states
            FROM page_views
            GROUP BY 1, 2)
            WHERE states > 4;