Streamkap’s cover photo
Streamkap

Streamkap

Data Infrastructure and Analytics

San Francisco, CA 2,121 followers

The streaming alternative to batch ETL. Latest -> Iceberg guide https://streamkap.com/blog/apache-iceberg-guide

About us

Replace batch ETL with streaming for lower latency and cost.

Website
streamkap.com
Industry
Data Infrastructure and Analytics
Company size
11-50 employees
Headquarters
San Francisco, CA
Type
Privately Held
Founded
2022

Locations

Employees at Streamkap

Updates

  • 😎 Zero-Ops Data Streaming: * Sub-second latency with high-performance CDC—minimal impact on source DBs, continuous updates. * Streaming pipelines with no-code connectors and automatic schema-drift handling. * Python/SQL/JavaScript transformations (mask, hash, join, aggregate, unnest JSON, route by rules, enrich on the fly, even generate embeddings for RAG). * Event-driven without the headaches—use our fully managed Kafka (read/write directly), scale automatically, and keep eyes on everything with built-in monitoring & alerts. * Deploy your way—Streamkap Cloud for fastest time-to-value or BYOC to run inside your VPC/perimeter. Under the hood: Built on Kafka, Flink, and Debezium with enterprise security (SOC 2 Type 2, HIPAA, GDPR/CCPA), IaC/Terraform, environments, and topic visibility. Use cases our customers light up fast: * Real-time customer 360 & product analytics * GenAI that stays fresh (RAG + streaming updates) * Fraud/risk scoring, IoT, and ops dashboards * Event-driven microservices with durable, replayable streams 👉 See it in action | Read docs | Start streaming free #ZeroOps #DataStreaming #CDC #ApacheKafka #ApacheFlink #EventDriven #RealTimeAnalytics #DataEngineering #GenAI #RAG #DataOps #Security #BYOC

  • 🚨 𝐒𝐭𝐫𝐞𝐚𝐦𝐤𝐚𝐩 𝐓𝐮𝐭𝐨𝐫𝐢𝐚𝐥 𝐒𝐞𝐫𝐢𝐞𝐬 — 𝐀𝐖𝐒 𝐌𝐲𝐒𝐐𝐋 → 𝐆𝐨𝐨𝐠𝐥𝐞 𝐁𝐢𝐠𝐐𝐮𝐞𝐫𝐲 (𝐑𝐞𝐚𝐥 𝐓𝐢𝐦𝐞 • 𝐖𝐚𝐫𝐞𝐡𝐨𝐮𝐬𝐞-𝐍𝐚𝐭𝐢𝐯𝐞). 📘 𝐖𝐡𝐚𝐭’𝐬 𝐢𝐧 𝐭𝐡𝐢𝐬 𝐠𝐮𝐢𝐝𝐞 Spin up a sub-second CDC pipeline from AWS RDS MySQL to BigQuery—no batch jobs, no staging ETL. Streamkap captures changes; BigQuery serves analytics. 💡 𝐘𝐨𝐮’𝐥𝐥 𝐥𝐞𝐚𝐫𝐧 𝐡𝐨𝐰 𝐭𝐨 Tune RDS MySQL for CDC (GTID on, row binlogging, full row image). Set up BigQuery (service account, dataset, region). Configure Streamkap connectors (MySQL source → BigQuery destination). Enable snapshot + live tail; verify low-latency ingest. 🧭 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 𝐚𝐭 𝐚 𝐠𝐥𝐚𝐧𝐜𝐞 RDS MySQL binlogs ➜ Streamkap CDC ➜ BigQuery dataset ➜ BI/ML/SQL. 🔐 𝐒𝐞𝐜𝐮𝐫𝐢𝐭𝐲 & 𝐎𝐩𝐬 𝐧𝐮𝐠𝐠𝐞𝐭𝐬 Restrict RDS to allowlisted IPs/VPN; use TLS. Least-privilege GCP roles scoped to the target dataset. Monitor binlog growth/IOPS; right-size storage and scheduling. 🧪 𝐕𝐚𝐥𝐢𝐝𝐚𝐭𝐢𝐨𝐧 (𝐜𝐨𝐝𝐞-𝐟𝐫𝐞𝐞) Change a row in MySQL → watch it in Streamkap → confirm in BigQuery with metadata fields. 💸 𝐂𝐨𝐬𝐭 𝐬𝐧𝐚𝐩𝐬𝐡𝐨𝐭 RDS instance + binlogs • Streamkap CDC • BigQuery storage + queries.

    • No alternative text description for this image
  • ❗️❗️❗️Why Real Data Matters for Clinical Research. In multinational trials, “tonight’s refresh” is too late. When a site logs a safety event, teams need it on sponsor dashboards in seconds — for patient safety, regulatory readiness, and faster decisions. Industry context (why real-time matters) * Regulated by design: compliance can’t wait. * Studies last years; reviews happen daily (often minute-to-minute). * Truly global data entry across regions and time zones. * Immediate action is needed on anomalies and safety signals. Typical modern stack (and where batch creeps in) * NoSQL transactions: DynamoDB for speed and flexibility * Search & filtering: Elasticsearch/OpenSearch for keywords/wildcards * Analytics at scale: Snowflake or ClickHouse for heavy aggregation Most teams glue this with batch ETL. It works — until it doesn’t. Why batch breaks in clinical * Latency: safety visibility slips from seconds to hours * Schema drift: evolving NoSQL models break brittle jobs * Global inconsistency: staggered windows erode a single source of truth * Ops overhead: backfills, retries, surprise failures How Streamkap helps * Native DynamoDB → any sink, cross-region (PostgreSQL, warehouses, lakes, queues—pick one) * Real-time freshness for safety/ops dashboards * Automatic schema evolution (no manual scripts when models change) * BYOC Start trial in minutes, move to prod in days ahead of the deadlines.

    • No alternative text description for this image
  • We start by enabling MongoDB change streams/oplog, then create a least-privilege Streamkap user with a scoped IAM role, set up private networking (IP allowlist / SSH tunnel / TLS), and configure the MongoDB → Apache Iceberg on S3 (Glue catalog) connectors. We demo snapshot + live tail, how deletes/updates land in Iceberg, and maintenance tips (compaction, snapshot retention). You’ll also see Athena/Trino/Spark queries, lag dashboards, and a quick cost breakdown. ▶️ Watch the full tutorial on our YouTube channel—live now.

  • Streamkap reposted this

    View profile for Paul Dudley

    Co-Founder @ Streamkap

    How can CDC make your RAG rad? If you want LLM features that feel live, you can't rely on yesterday’s data. Put CDC at the center and keep your vector index current. A minimal, production-ready example: Source DB: MongoDB (Postgres/MySQL, etc. choose your poison) Change stream: Streamkap — snapshot once -> durable CDC, schema-safe evolution, idempotent backfills, retries/DLQs, selective sync Feature/Index store: Vector DB of your choice, let’s pick Redis Vector for this post for fast upserts Setup that won’t eat your week: MongoDB: grab the connection string; create a Streamkap user (readAnyDatabase + read on local), create streamkap_signal and grant readWrite on it; allowlist Streamkap IPs if needed. Streamkap: add MongoDB as a Source, pick DB/collections, save. Snapshot starts, then CDC keeps you current. Redis Vector: create DB, role, and keyspace; point your upsert code to Redis; enjoy live retrieval. Operator checklist (use on any vendor, including us): ->Cost you can forecast from volume ->Idempotent backfills (redo history, no dupes) ->Schema drift without a 2 a.m. Slack thread ->End-to-end observability #streamkap #rag #llm #cdc #streaming

    • No alternative text description for this image
  • 🚀 MongoDB -> Apache Iceberg on S3 (real-time, lakehouse-native) with Streamkap. Build: Near-real-time CDC from MongoDB to Iceberg on S3 + Glue. No batch windows. No staging tables. Setup must-haves • Enable MongoDB change streams/oplog; set retention 12–24h. • Add a stable business key (besides _id) for joins/dedup. • Create least-privilege access: one S3 prefix + one Glue DB. • Use TLS, IP allowlisting (or PrivateLink), and SSE-KMS at rest. Connector config • Snapshot + live tail from MongoDB. • Sink to Iceberg v2 on S3/Glue. • Upsert semantics with op metadata (insert/update/delete). • Partition by access patterns (e.g., order_date, tenant_id). Performance hygiene • Target file size 256 MB (128–512 MB OK). • Commit interval 1–5 min per table. • Compact every 1–6h (quiet hours); expire snapshots daily, keep 48–72h. • Enable merge-on-read; schedule manifest rewrites nightly. Validate (no code). • Make an insert/update/delete in MongoDB. • Watch Streamkap: snapshot → live. • Query via Athena/Trino/Spark; confirm latest state + history/time travel. Cost controls. • Tight S3 prefixes; lifecycle cold data after 30–90 days. • Prune partitions; avoid tiny files (tune commit + compaction). • No separate ingestion cluster—pay for CDC + storage only. Architecture • MongoDB change events → Streamkap CDC → S3 (Iceberg data) + Glue (catalog) → Query with Athena/Trino/Spark.

    • No alternative text description for this image

Similar pages

Browse jobs

Funding

Streamkap 2 total rounds

Last Round

Seed

US$ 2.7M

See more info on crunchbase