0% found this document useful (0 votes)
202 views5 pages

Open Source Tools for Data Engineering

Uploaded by

adede2009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
202 views5 pages

Open Source Tools for Data Engineering

Uploaded by

adede2009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

5/8/24, 2:30 PM (7) Open source tools for Data Engineering | LinkedIn

7 Unlo
Home My Network Jobs Messaging Notifications Me For Business

Open source tools for


Data Engineering
Midhun Pottammal
Data Engineer and Full Stack Expert | Hadoop, 14 articles Follow
Spark, Kafka, Python, and NoSQL (Hive, Hbas…

February 14, 2024

Open Immersive Reader

Data Integration

1. Apache NiFi: A powerful and easy-to-use tool for


moving data between systems.

2. Airbyte: An open-source data integration platform that


helps you replicate your data in your warehouses,
lakes, and databases.

3. Meltano: An open-source data integration tool that


simplifies the process of extracting, loading, and
transforming data.

4. Apache Inlong: A platform for real-time data ingestion


and complex event processing.

5. Apache SeaTunnel: A data transfer tool for efficiently


moving large volumes of data.

Storage

1. HDFS: The Hadoop Distributed File System, designed


for storing large files across multiple machines.

2. Apache Ozone: A scalable, redundant, and distributed


object store for Hadoop.

3. Ceph: A distributed object, block, and file storage


platform.

https://www.linkedin.com/pulse/open-source-tools-data-engineering-midhun-pottammal-cxvtf/ 1/5
5/8/24, 2:30 PM (7) Open source tools for Data Engineering | LinkedIn

4. MinIO: A high-performance, distributed object storage


server.

Data Lake Platform

1. Apache Hudi: A data lake solution for managing large


analytical datasets.

2. Apache Iceberg: A table format for storing huge, slow-


moving tabular data.

3. Delta: An open-source storage layer that brings ACID


transactions to Apache Spark.

4. Paimon: A data lake platform for managing and


analyzing data at scale.

Event Processing

1. Kafka: A distributed event streaming platform capable


of handling trillions of events a day.

2. Redpanda: A Kafka-compatible event streaming


platform with a focus on performance and scalability.

3. Pulsar: A cloud-native, distributed messaging and


streaming platform.

Data Processing & Computation

1. Apache Spark: An open-source, distributed computing


system that provides an interface for programming
entire clusters with implicit data parallelism and fault
tolerance.

2. Apache Flink: A framework and distributed processing


engine for stateful computations over unbounded and
bounded data streams.

3. Vaex: A Python library for lazy, out-of-core


DataFrames.

4. Ray: A fast and simple framework for building and


running distributed applications.

5. Dask: A flexible parallel computing library for analytic


computing.

6. Polars: A blazingly fast DataFrame library implemented


in Rust and using Apache Arrow.

Database

https://www.linkedin.com/pulse/open-source-tools-data-engineering-midhun-pottammal-cxvtf/ 2/5
5/8/24, 2:30 PM (7) Open source tools for Data Engineering | LinkedIn

OLTP:SQL — RDBMS(MySQL, Postgres), In


Memory(Apache Ignite)NoSQL — KV(Aerospike),
Document (MongoDB), Graph(Neo4J),
Multimodel(ArangoDb)

HTAP:NewSQL — stonedb, TiDB

OLAP:Oflline — Columnar(Databend), Time Series


(TimeScale)Realtime — Realtime OLAP (Druid, Pinot,
Clickhouse, StarRocks), Search Engine, Streaming
Database (Materialize, RisingWave)

Visualization

1. Superset: A modern, enterprise-ready business


intelligence web application.

2. Rath

3. Redash: A visualization and dashboarding tool.

4. Metabase: An easy way to generate charts and


dashboards, ask simple ad hoc queries without using
SQL, and see detailed information about rows in your
Database.

Data Infrastructure

Kubernetes: An open-source container orchestration


platform.

Ambari: A software project designed to enable system


administrators to provision, manage, and monitor a
Hadoop cluster.

Workflow Management & DataOps

1. Airflow: A platform to programmatically author,


schedule, and monitor workflows.

2. Dagster: A data orchestrator for machine learning,


analytics, and ETL.

3. Kestra: A workflow orchestrator for data pipeline


management.

4. Temporal: An open-source, stateful microservices


orchestration platform.

5. Mage: A workflow engine for orchestrating data


pipelines.

6. Windmill: A platform for building and running data


pipelines.

https://www.linkedin.com/pulse/open-source-tools-data-engineering-midhun-pottammal-cxvtf/ 3/5
5/8/24, 2:30 PM (7) Open source tools for Data Engineering | LinkedIn

7. DolphinScheduler: A distributed and easy-to-expand


visual DAG workflow scheduling system, dedicated to
solving the complex dependencies in data processing,
making the scheduling system out of the box for data
processing.

Monitoring

Prometheus + Mimir & Grafan + Loki

EFK

Metadata Management

1. Datahub: An open-source metadata platform for the


modern data stack.

2. Amundsen: A data discovery and metadata platform.

3. Marquez: An open-source metadata service for the


collection, aggregation, and visualization of a data
ecosystem's metadata.

Report this

Published by

Midhun Pottammal 14
Data Engineer and Full Stack Expert | Hadoop, Spark, Kafka, Python, and NoSQ… Follow
Published • 2mo articles

🌟 Exciting Tools in the World of Data Engineering! 🌟

#DataEngineering #OpenSource #TechTrends #DataEngineering #OpenSource


#TechTrends #DataIntegration #Storage #DataLake #DataProcessing #Database
#EventProcessing #Visualization #DataInfrastructure #WorkflowManagement
#DataOps #Monitoring #MetadataManagement

Like Comment Share 17

Reactions

+5

0 Comments

Add a comment…

Midhun Pottammal
Data Engineer and Full Stack Expert | Hadoop, Spark, Kafka, Python, and NoSQL (Hive,
Hbase, Iceberg) | Specialised in Informatica, Nifi, Cloudera CDP, and Databricks

https://www.linkedin.com/pulse/open-source-tools-data-engineering-midhun-pottammal-cxvtf/ 4/5
5/8/24, 2:30 PM (7) Open source tools for Data Engineering | LinkedIn

Follow

More from Midhun Pottammal

Apache Iceberg Schema Benefit of Data Observability:


Evolution Unlocking the Insights 🚗
Midhun Pottammal on Linke… Midhun Pottammal on Linke…

Star Schema vs Snowflake


Schema: Key Differences
Between The Two

Midhun Pottammal on Linke…

See all 14 articles

https://www.linkedin.com/pulse/open-source-tools-data-engineering-midhun-pottammal-cxvtf/ 5/5

You might also like