Trusted by teams from
Built by streaming experts from
Arroyo ships as a single, compact binary. Run locally on MacOS or Linux for development, deploy to production with Docker or Kubernetes.
$ curl -LsSf https://arroyo.dev/install.sh | sh$ arroyo cluster
Arroyo is a new kind of stream processing engine, built from the ground up to make real‑time easier than batch.
Arroyo was designed from the start so that anyone with SQL experience can build reliable, efficient, and correct streaming pipelines.
Data scientists and engineers can build end-to-end real-time applications... without a separate team of streaming experts.
=> SQL docsCREATE VIEW tags AS (
SELECT btrim(unnest(tags), '"') as tag FROM (
SELECT extract_json(value, '$.tags[*].name') AS tags
FROM mastodon)
);
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY window
ORDER BY count DESC) as row_num
FROM (SELECT count(*) as count,
tag,
hop(interval '5 seconds',
interval '15 minutes') as window
FROM tags
group by tag, window)) WHERE row_num <= 5;
Your streaming pipelines shouldn't page someone just because Kubernetes decided to reschedule your pods. Arroyo is built to run in modern, elastic cloud environments, from simple container runtimes like Fargate to large, distributed deployments on Kubernetes.
In short: Arroyo is a stateful stream processing engine that behaves like a stateless one.
=> Deployment docsArroyo is for everyone who needs to process data in real-time. Small use-cases can run with just a few MBs of RAM and a fractional vCPU.
For larger streams, Arroyo can rescale vertically and horizontally to process tens of millions of events per seconds while maintaining exactly-once semantics.
Arroyo is fast. Really really fast. Written in Rust, a high-performance systems language, and built around the Arrow in-memory analytics format, its performance exceeds similar systems like Apache Flink by 5x or more.
Process data using sliding, tumbling, and session windows with watermark processing to determine when all data for a window has arrived.
Arroyo SQL covers a full set of streaming joins, including left, outer, inner, and full, which can be windowed or operate over updating data.
Arroyo ships with over 300 SQL window, aggregate, and scalar functions, covering math, arrays, regex, json, and more.
Exactly-once processing means no duplicated or dropped events, even with out-of-order data and machine failures.
Arroyo can natively read and write JSON, Avro, Parquet, and raw text and binary. Custom formats can be implemented with UDFs.
Extend the built-in SQL by writing Rust user-defined scalar, aggregate, and async functions, with Python coming soon.
Manage connections, develop and test SQL queries, and monitor pipelines from the powerful Arroyo Web UI.
Pipelines can be created, operated, and managed with the REST API, offering declarative orchestration at scale.
Arroyo ships with tons of connectors, making it easy to integrate into your data stack
With Arroyo, you can build streaming pipelines by writing the same analytical SQL queries you are already running in your data warehouse.
Mastodon is a federated microblogging platform, similar to Twitter. This query operates over the stream of all Mastodon posts via its Server-Sent Events API and finds the top 5 hashtags in each 15-minute window.
See the full tutorial.
This query finds potentially fraudulent users by detecting accounts that appear in multiple states within a single day.
CREATE TABLE mastodon (
value TEXT
) WITH (
connector = 'sse',
format = 'raw_string',
endpoint = 'http://mastodon.arroyo.dev/api/v1/streaming/public',
events = 'update'
);
CREATE VIEW tags AS (
SELECT btrim(unnest(tags), '"') as tag FROM (
SELECT extract_json(value, '$.tags[*].name') AS tags
FROM mastodon)
);
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY window
ORDER BY count DESC) as row_num
FROM (SELECT count(*) as count,
tag,
hop(interval '5 seconds', interval '15 minutes') as window
FROM tags
group by tag, window)) WHERE row_num <= 5;
CREATE TABLE page_views (
userId INT,
state TEXT
) WITH (
connector = 'kafka',
type = 'source',
bootstrap_servers = 'localhost:9092',
topic = 'page_view_events',
format = 'json'
);
CREATE TABLE suspicious (
user_id INT,
) WITH (
connector = 'kafka',
type = 'sink',
bootstrap_servers = 'localhost:9092',
topic = 'suspicious',
format = 'json'
);
INSERT INTO suspicious
SELECT "userId" as suspicious_id
FROM (
SELECT "userId",
tumble(interval '1 day') as window,
COUNT(distinct state) as states
FROM page_views
GROUP BY 1, 2)
WHERE states > 4;