LinkedIn respects your privacy

LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

Select Accept to consent or Reject to decline non-essential cookies for this use. You can update your choices at any time in your settings.

Join now Sign in

From the course: DataOps with Apache Iceberg using Spark, Nessie, and Dremio

DataOps tooling

From the course: DataOps with Apache Iceberg using Spark, Nessie, and Dremio

Start my 1-month free trial Buy for my team

DataOps tooling

“

So now let's talk about DataOps Tooling. So now we've learned what DataOps is, what are the specific tools that are used to accomplish a lot of the goals of DataOps that we talked about. Okay. When you think about version control, well, Iceberg and Nessie are one tool for doing so because as you'll see later on, Nessie allows us to capture sort of different versions, not just the versions of an Iceberg table but versions of your entire catalog. And Iceberg tracks versions of the individual tables for datasets themselves. But also, I would also like to add things like Git here because if you're using a tool like dbt, which I'll mention on a little bit later, you're able to version the code you write for dbt using Git, allowing you to version your modeling. Okay. And, again, we'll talk more about version control, about the idea that you're able to track multiple versions so that way you can see how things look like in the past and be able to even move back if something goes wrong. CI/CD, GitHub Actions, we'll talk about this. So GitHub Actions is something that we can actually tie to our Git repo, so that way when we make changes to our code, we can actually trigger the automations or CI/CD pipelines that we want to trigger. So again, we want to automate in CI/CD. The idea is that we want to basically, so for example, someone makes a change to your dbt code and they make a pull request. What we would like to do is maybe trigger a workflow that validates those changes and make sure that they don't break anything. And then when that pull request gets accepted, then maybe trigger a running of that code. So the idea is you're going to want to trigger things to happen without necessarily someone manually having to go trigger this whole sort of process. And GitHub Actions is a great tool for doing so that we'll discuss. Automated testing: you can use tools like dbt. dbt not only orchestrates your different SQL workloads but also has some testing features. There's libraries like Great Expectations that allow you to sort of create different rules that your data gets tested against. There's monitoring using tools like Monte Carlo to establish certain things that you're looking at as far as your data to make sure that your data is correct and being able to trigger alerts when those things are not -- when those rules aren't adhered to. Containerization: we'll talk about Docker. Okay. We'll talk specifically about Docker in this course and how we can containerize individual applications into individual containers. But then you can sort of orchestrate lots of containers in a production deployment using a tool like Kubernetes. So the way I always like to think about it is like with Docker, I can create Docker containers which are more like albums in a jukebox. And then Kubernetes is like the jukebox that can automate the switching and moving around those containers. And then there's orchestration. Again, this is going to be a tool like Airflow, which will allow you to orchestrate multiple tasks across any tool that you can connect to, particularly in Python. Okay. And dbt, a library that allows you to orchestrate SQL being sent to a particular tool, okay, whether that tool is Spark, Dremio, whatever you can say, "Hey, I have all this SQL," and that SQL needs to be written in this run in this order and be able to kind of declaratively express that order. While Airflow will say, "Hey, you know what? I need to do this job in Spark, and then after that job in Spark is done, then I need to do this in Dremio, then I need to do this somewhere else." So each tool has its role in orchestrating the different layers of your data architecture. But these are the many tools in the DataOps world. We'll be going over a few of these within this course, but there's a lot to learn in this space. And, again, this course is serving you as a sort of introduction, a tiptoe into the pool of the world of DataOps. That way, you can understand the environment, get hands-on, and then go learn more from there. So with that, I'll see you in the next video.

Contents