#databricks #assetbundles #devops #dataengineering #cicd #datainsightconsulting

Talk ETL. Talk Databricks. Talk to me. | Founder, Data Insight Consulting | Modular Data Stacks | Terraform, CI/CD

2mo

𝗦𝗰𝗮𝗹𝗮𝗯𝗹𝗲 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 𝗣𝗿𝗼𝗷𝗲𝗰𝘁𝘀: 𝗠𝗼𝗻𝗼𝗿𝗲𝗽𝗼 𝗼𝗿 𝗠𝗶𝗰𝗿𝗼-𝗕𝘂𝗻𝗱𝗹𝗲𝘀? 𝗔𝗻 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗮𝗹 𝗖𝗼𝗺𝗽𝗮𝗿𝗶𝘀𝗼𝗻 Hej! 👋 Your first project with Databricks Asset Bundles (DABs) is a success, and the CI/CD pipeline is running. But now the team is growing, new projects are being added, and the big question arises: How do we structure all of this to keep it maintainable, scalable, and efficient? At the core, there are two opposing architectural patterns: the monorepo and the micro-bundle approach. Choosing the right pattern is one of the most important strategic decisions for your Databricks development. Pattern 1: The Monorepo (One central bundle for everything) In this approach, all the code for multiple projects or teams resides in a single large Git repository, often managed by a central databricks.yml file with many targets. Advantages: - Easy Code-Sharing: Common libraries and utilities can be imported directly. A change is immediately available to everyone. - Atomic Commits: Changes to a library and the jobs that use it can be made in a single, traceable commit. - Easy Discoverability: All code is in one place, which simplifies finding things. Disadvantages: - Slow CI/CD: A small change can trigger a massive pipeline that tests and validates everything. - Tight Coupling: A failing test in an insignificant project can block deployment for all others. - Complex Access Management: It's difficult to restrict access to individual project folders within the repo. Ideal for: Smaller teams, tightly coupled projects, or when you want to get started quickly. Pattern 2: Micro-Bundles (A separate bundle per team/project) Here, each team or each major, independent data product has its own Git repository with its own databricks.yml. Advantages: - Independent Deployments: Teams can deploy their changes whenever they want without impacting others. This significantly increases velocity. - Clear Ownership: Responsibility for code, quality, and deployments is clearly assigned to the respective team. - Focused CI/CD Pipelines: The pipelines are fast, lean, and only test the relevant code. Disadvantages: - Shared Code is More Complex: Common utilities must be versioned as standalone Python packages and published in a package repository. - Potential Code Duplication: Without clean package management, there's a risk of teams reinventing the wheel. Ideal for: Larger organizations, multiple independent teams, and architectures in the style of Data Mesh. There is no universal answer, but a good rule of thumb is: "Start with a monolith, refactor to microservices." What strategy do you follow and why? Share your lessons learned in the comments! 👇 #Databricks #AssetBundles #DevOps #DataEngineering #CICD #DataInsightConsulting

3 Comments

Data Insight Consulting GmbH

2mo

👉 http://eepurl.com/jjb5jU

1 Reaction

Paulo Cera

Data Platform Frameworks Coordinator | Data Architecture and MLOps at MC Sonae

2mo

Francisco Amorim

1 Reaction

Martin Debus

Data, Analytics & AI with Databricks and Azure

2mo

With Microbundles, you might also end up having multiple dependencies to manage. So I am with you, start with a monorepo and decide when you need to break things down.

2 Reactions

See more comments

To view or add a comment, sign in

LinkedIn respects your privacy

Mengyu Shi’s Post

More from this author

Von der Krise zum Triumph: Wie Infrastructure as Code ein Projekt rettete

Der Alptraum beginnt um 6 Uhr morgens

Explore content categories

Mengyu Shi’s Post

More from this author

Von der Krise zum Triumph: Wie Infrastructure as Code ein Projekt rettete

Der Alptraum beginnt um 6 Uhr morgens

Explore related topics

Explore content categories