Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions

Ebook672 pages4 hours

Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions

Name: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
Author: Neylson Crepalde
ISBN: 9781835468999

By Neylson Crepalde

Rating: 0 out of 5 stars

()

Read preview

Skip carousel

Computers

LanguageEnglish

PublisherPackt Publishing

Release dateJul 19, 2024

ISBN9781835468999

Author

Neylson Crepalde

Related authors

Skip carousel

Related to Big Data on Kubernetes

Related ebooks

Skip carousel

The Kubernetes Book: Navigate the world of Kubernetes with expertise
Ebook
The Kubernetes Book: Navigate the world of Kubernetes with expertise
byNigel Poulton
Rating: 0 out of 5 stars
0 ratings
MLOps with Red Hat OpenShift: A cloud-native approach to machine learning operations
Ebook
MLOps with Red Hat OpenShift: A cloud-native approach to machine learning operations
byRoss Brigoli
Rating: 0 out of 5 stars
0 ratings
Software Architecture with Kotlin: Combine various architectural styles to create sustainable and scalable software solutions
Ebook
Software Architecture with Kotlin: Combine various architectural styles to create sustainable and scalable software solutions
byJason (Tsz Shun) Chow
Rating: 0 out of 5 stars
0 ratings
The Ultimate Guide to Snowpark: Design and deploy Snowflake Snowpark with Python for efficient data workloads
Ebook
The Ultimate Guide to Snowpark: Design and deploy Snowflake Snowpark with Python for efficient data workloads
byShankar Narayanan SGS
Rating: 0 out of 5 stars
0 ratings
Mastering OpenShift: Deploy, Manage, and Scale Applications on Kubernetes
Ebook
Mastering OpenShift: Deploy, Manage, and Scale Applications on Kubernetes
byRobert Johnson
Rating: 0 out of 5 stars
0 ratings
Real-Time Big Data Analytics: Emerging Trends
Ebook
Real-Time Big Data Analytics: Emerging Trends
byTrilokesh Khatri
Rating: 0 out of 5 stars
0 ratings
MinIO Object Storage Architecture and Operations: The Complete Guide for Developers and Engineers
Ebook
MinIO Object Storage Architecture and Operations: The Complete Guide for Developers and Engineers
byWilliam Smith
Rating: 0 out of 5 stars
0 ratings
Databricks ML in Action: Learn how Databricks supports the entire ML lifecycle end to end from data ingestion to the model deployment
Ebook
Databricks ML in Action: Learn how Databricks supports the entire ML lifecycle end to end from data ingestion to the model deployment
byStephanie Rivera
Rating: 0 out of 5 stars
0 ratings
DevOps for Web Development
Ebook
DevOps for Web Development
byMitesh Soni
Rating: 0 out of 5 stars
0 ratings
The AI Content Creator
Ebook
The AI Content Creator
byClaire Donnelly
Rating: 0 out of 5 stars
0 ratings
Building AI Applications with OpenAI APIs: Leverage ChatGPT, Whisper, and DALL-E APIs to build 10 innovative AI projects
Ebook
Building AI Applications with OpenAI APIs: Leverage ChatGPT, Whisper, and DALL-E APIs to build 10 innovative AI projects
byMartin Yanev
Rating: 0 out of 5 stars
0 ratings
AI@Work: Humans@WORK
Ebook
AI@Work: Humans@WORK
byBobbe Baggio
Rating: 0 out of 5 stars
0 ratings
ElasticSearch Cookbook - Second Edition
Ebook
ElasticSearch Cookbook - Second Edition
byAlberto Paro
Rating: 0 out of 5 stars
0 ratings
Beacon: The Definitive Business Guide to AI Strategy and Transformation
Ebook
Beacon: The Definitive Business Guide to AI Strategy and Transformation
byRaymond A. Bordogna
Rating: 0 out of 5 stars
0 ratings
Kickstart Software Design Architecture: Transform Your Software Development Journey with Key Principles and Advanced Practices in Architecture, Microservices, and Cloud Computing (English Edition)
Ebook
Kickstart Software Design Architecture: Transform Your Software Development Journey with Key Principles and Advanced Practices in Architecture, Microservices, and Cloud Computing (English Edition)
byDr Edward D Lavieri Jr
Rating: 0 out of 5 stars
0 ratings
Learning Apache Mahout Classification
Ebook
Learning Apache Mahout Classification
byAshish Gupta
Rating: 0 out of 5 stars
0 ratings
AI Ethics, Security, and Governance Demystified
Ebook
AI Ethics, Security, and Governance Demystified
byMadhusmita P. Patil
Rating: 0 out of 5 stars
0 ratings
Generative AI Application Integration Patterns: Integrate large language models into your applications
Ebook
Generative AI Application Integration Patterns: Integrate large language models into your applications
byJuan Pablo Bustos
Rating: 0 out of 5 stars
0 ratings
Mastering AI System Design: Architect, Build and Deploy AI Systems Using 10 Domain Driven Blueprints and Interview Strategies (English Edition)
Ebook
Mastering AI System Design: Architect, Build and Deploy AI Systems Using 10 Domain Driven Blueprints and Interview Strategies (English Edition)
bySoudamini Sreepada
Rating: 5 out of 5 stars
5/5
Mastering OpenAI for Enterprise
Ebook
Mastering OpenAI for Enterprise
bySriram Subramanian
Rating: 0 out of 5 stars
0 ratings
Data Science and Machine Learning Interview Questions Using R: Crack the Data Scientist and Machine Learning Engineers Interviews with Ease
Ebook
Data Science and Machine Learning Interview Questions Using R: Crack the Data Scientist and Machine Learning Engineers Interviews with Ease
byVishwanathan Narayanan
Rating: 0 out of 5 stars
0 ratings
Python Text Processing with NLTK 2.0 Cookbook: LITE
Ebook
Python Text Processing with NLTK 2.0 Cookbook: LITE
byJacob Perkins
Rating: 4 out of 5 stars
4/5
Economic Multi Agent Systems: Design, Implementation, and Application
Ebook
Economic Multi Agent Systems: Design, Implementation, and Application
byGottfried Haber
Rating: 4 out of 5 stars
4/5
Ultimate Node.js for Cross-Platform App Development: Learn to Build Robust, Scalable, and Performant Server-Side JavaScript Applications with Node.js
Ebook
Ultimate Node.js for Cross-Platform App Development: Learn to Build Robust, Scalable, and Performant Server-Side JavaScript Applications with Node.js
byRamesh Kumar
Rating: 0 out of 5 stars
0 ratings
Machine Learning Engineering
Ebook
Machine Learning Engineering
byHenry Codwell
Rating: 0 out of 5 stars
0 ratings
Unlocking Python: A Comprehensive Guide for Beginners
Ebook
Unlocking Python: A Comprehensive Guide for Beginners
byRyan Mitchell
Rating: 0 out of 5 stars
0 ratings
Ultimate Snowflake Cortex AI for Generative AI Applications: Design, Build, and Deploy Generative AI Solutions with Snowflake Cortex for Real-World and Industry-Scale Applications (English Edition)
Ebook
Ultimate Snowflake Cortex AI for Generative AI Applications: Design, Build, and Deploy Generative AI Solutions with Snowflake Cortex for Real-World and Industry-Scale Applications (English Edition)
byKrishnan Srinivasan
Rating: 5 out of 5 stars
5/5
Practical Java Programming with ChatGPT
Ebook
Practical Java Programming with ChatGPT
byAlan S. Bluck
Rating: 0 out of 5 stars
0 ratings
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Ebook
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
byAdam Freeman
Rating: 0 out of 5 stars
0 ratings
Synthetic Data Generation: A Beginner’s Guide
Ebook
Synthetic Data Generation: A Beginner’s Guide
byRobert Johnson
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 4 out of 5 stars
4/5
The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms
Ebook
The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms
byCory Althoff
Rating: 0 out of 5 stars
0 ratings
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
A Guide to Electronic Dance Music Volume 1: Foundations
Ebook
A Guide to Electronic Dance Music Volume 1: Foundations
byMatti Carter
Rating: 5 out of 5 stars
5/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
UX/UI Design Playbook
Ebook
UX/UI Design Playbook
byOlha Bahaieva
Rating: 4 out of 5 stars
4/5
Tor and the Dark Art of Anonymity
Ebook
Tor and the Dark Art of Anonymity
byLance Henderson
Rating: 5 out of 5 stars
5/5
Quantum Computing For Dummies
Ebook
Quantum Computing For Dummies
bywhurley
Rating: 3 out of 5 stars
3/5
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 3 out of 5 stars
3/5
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
Ebook
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
Ebook
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
byJoe Shelley
Rating: 5 out of 5 stars
5/5
Visual Studio 2022 In-Depth: Explore the Fantastic Features of Visual Studio 2022 - 2nd Edition
Ebook
Visual Studio 2022 In-Depth: Explore the Fantastic Features of Visual Studio 2022 - 2nd Edition
byOckert J. du Preez
Rating: 0 out of 5 stars
0 ratings
How to Raise a Healthy Gamer: End Power Struggles, Break Bad Screen Habits, and Transform Your Relationship with Your Kids
Ebook
How to Raise a Healthy Gamer: End Power Struggles, Break Bad Screen Habits, and Transform Your Relationship with Your Kids
byAlok Kanojia, MD, MPH
Rating: 4 out of 5 stars
4/5
Networking Fundamentals: Develop the networking skills required to pass the Microsoft MTA Networking Fundamentals Exam 98-366
Ebook
Networking Fundamentals: Develop the networking skills required to pass the Microsoft MTA Networking Fundamentals Exam 98-366
byGordon Davies
Rating: 0 out of 5 stars
0 ratings
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
Storytelling with Data: Let's Practice!
Ebook
Storytelling with Data: Let's Practice!
byCole Nussbaumer Knaflic
Rating: 4 out of 5 stars
4/5
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
Ebook
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
byHadelin de Ponteves
Rating: 2 out of 5 stars
2/5
2022 Adobe® Premiere Pro Guide For Filmmakers and YouTubers
Ebook
2022 Adobe® Premiere Pro Guide For Filmmakers and YouTubers
byScott Bradley
Rating: 5 out of 5 stars
5/5
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics, and Eastern Mystics All Agree We Are in a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics, and Eastern Mystics All Agree We Are in a Video Game
byRizwan Virk
Rating: 3 out of 5 stars
3/5
Some Future Day: How AI Is Going to Change Everything
Ebook
Some Future Day: How AI Is Going to Change Everything
byMarc Beckman
Rating: 0 out of 5 stars
0 ratings
Pro Tools All-in-One For Dummies
Ebook
Pro Tools All-in-One For Dummies
byJeff Strong
Rating: 5 out of 5 stars
5/5
Fundamentals of Programming: Using Python
Ebook
Fundamentals of Programming: Using Python
byBruce Embry
Rating: 5 out of 5 stars
5/5
Learning the Chess Openings
Ebook
Learning the Chess Openings
byJef Kaan
Rating: 5 out of 5 stars
5/5
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
Mindhacker: 60 Tips, Tricks, and Games to Take Your Mind to the Next Level
Ebook
Mindhacker: 60 Tips, Tricks, and Games to Take Your Mind to the Next Level
byRon Hale-Evans
Rating: 4 out of 5 stars
4/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
Microsoft Office 365 for Business
Ebook
Microsoft Office 365 for Business
byDave Tosh
Rating: 4 out of 5 stars
4/5
Microsoft Azure For Dummies
Ebook
Microsoft Azure For Dummies
byJack A. Hyman
Rating: 0 out of 5 stars
0 ratings
Technical Writing For Dummies
Ebook
Technical Writing For Dummies
bySheryl Lindsell-Roberts
Rating: 0 out of 5 stars
0 ratings
Hacking: The Underground Guide to Computer Hacking, Including Wireless Networks, Security, Windows, Kali Linux and Penetration Testing
Ebook
Hacking: The Underground Guide to Computer Hacking, Including Wireless Networks, Security, Windows, Kali Linux and Penetration Testing
byAbraham K White
Rating: 5 out of 5 stars
5/5

Related categories

Skip carousel

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Big Data on Kubernetes - Neylson Crepalde

Cover.jpgpackt Logo

Big Data on Kubernetes

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Apeksha Shetty

Publishing Product Manager: Apeksha Shetty

Book Project Manager: Aparna Ravikumar Nair

Senior Editor: Sushma Reddy

Technical Editor: Kavyashree K S

Copy Editor: Safis Editing

Proofreader: Sushma Reddy

Indexer: Subalakshmi Govindhan

Production Designer: Gokul Raj S T

DevRel Marketing Executive: Nivedita Singh

First published: July 2024

Production reference: 1210624

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN: 978-1-83546-214-0

www.packtpub.com

To my wife, Sarah, and my son, Joao Pedro, for their love and support. To Silvio Salej Higgins for being a great mentor.

– Neylson Crepalde

Contributors

About the author

Neylson Crepalde is a generative AI strategist at Amazon Web Services (AWS). Before this, Neylson was chief technology officer at A3Data, a consulting business focused on data, analytics, and artificial intelligence. In his time as CTO, he worked with the company’s tech team to build a Big Data architecture on top of Kubernetes that inspired the writing of this book. Neylson holds a PhD in economic sociology, and he was a visiting scholar at the Centre des Sociologies des Organisations at Sciences Po, Paris. Neylson is also a frequent guest speaker at conferences and has taught MBA programs for more than 10 years.

I want to thank all the people who have worked with me in the development of this great architecture, especially Mayla Teixeira and Marcus Oliveira for their outstanding contributions.

About the reviewer

Thariq Mahmood has 16 years of experience in data technology and possesses a strong skillset in Kubernetes, Big Data, data engineering, and DevOps on the public cloud, the private cloud, and on-premise environments. He has expertise in data warehousing, data modeling, and data security. He actively contributes to projects on Git and has experience setting up batch and streaming pipelines for various production environments, using Databricks, Hadoop, Spark, Flink, and other cloud-native tools from AWS, Azure, and GCP. Also, he implemented MLOps and DevSecOps in numerous projects. He currently works on helping organizations optimize their Big Data infrastructure costs and implementing data-lake and one-lake architectures within Kubernetes.

Table of Contents

Preface

Part 1: Docker and Kubernetes

Getting Started with Containers

Technical requirements

Container architecture

Installing Docker

Windows

macOS

Linux

Getting started with Docker images

hello-world

NGINX

Julia

Building your own image

Batch processing job

API service

Summary

Kubernetes Architecture

Technical requirements

Kubernetes architecture

Control plane

Node components

Pods

Deployments

StatefulSets

Jobs

Services

ClusterIP Service

NodePort Service

LoadBalancer Service

Ingress and Ingress Controller

Gateway

Persistent Volumes

StorageClasses

ConfigMaps and Secrets

ConfigMaps

Secrets

Summary

Getting Hands-On with Kubernetes

Technical requirements

Installing kubectl

Deploying a local cluster using Kind

Installing kind

Deploying the cluster

Deploying an AWS EKS cluster

Deploying a Google Cloud GKE cluster

Deploying an Azure AKS cluster

Running your API on Kubernetes

Creating the deployment

Creating a service

Using an ingress to access the API

Running a data processing job in Kubernetes

Summary

Part 2: Big Data Stack

The Modern Data Stack

Data architectures

The Lambda architecture

The Kappa architecture

Comparing Lambda and Kappa

Data lake design for big data

Data warehouses

The rise of big data and data lakes

The rise of the data lakehouse

Implementing the lakehouse architecture

Batch ingestion

Storage

Batch processing

Orchestration

Batch serving

Data visualization

Real-time ingestion

Real-time processing

Real-time serving

Real-time data visualization

Summary

Big Data Processing with Apache Spark

Technical requirements

Getting started with Spark

Installing Spark locally

Spark architecture

Spark executors

Components of execution

Starting a Spark program

The DataFrame API and the Spark SQL API

Transformations

Actions

Lazy evaluation

Data partitioning

Narrow versus wide transformations

Analyzing the titanic dataset

Working with real data

How Spark performs joins

Joining IMDb tables

Summary

Building Pipelines with Apache Airflow

Technical requirements

Getting started with Airflow

Installing Airflow with Astro

Airflow architecture

Airflow’s distributed architecture

Building a data pipeline

Airflow integration with other tools

Summary

Apache Kafka for Real-Time Events and Data Ingestion

Technical requirements

Getting started with Kafka

Exploring the Kafka architecture

The PubSub design

How Kafka delivers exactly-once semantics

First producer and consumer

Streaming from a database with Kafka Connect

Real-time data processing with Kafka and Spark

Summary

Part 3: Connecting It All Together

Deploying the Big Data Stack on Kubernetes

Technical requirements

Deploying Spark on Kubernetes

Deploying Airflow on Kubernetes

Deploying Kafka on Kubernetes

Summary

Data Consumption Layer

Technical requirements

Getting started with SQL query engines

The limitations of traditional data warehouses

The rise of SQL query engines

The architecture of SQL query engines

Deploying Trino in Kubernetes

Connecting DBeaver with Trino

Deploying Elasticsearch in Kubernetes

How Elasticsearch stores, indexes and manages data

Elasticsearch deployment

Summary

Building a Big Data Pipeline on Kubernetes

Technical requirements

Checking the deployed tools

Building a batch pipeline

Building the Airflow DAG

Creating SparkApplication jobs

Creating a Glue crawler

Building a real-time pipeline

Deploying Kafka Connect and Elasticsearch

Real-time processing with Spark

Deploying the Elasticsearch sink connector

Summary

Generative AI on Kubernetes

Technical requirements

What generative AI is and what it is not

The power of large neural networks

Challenges and limitations

Using Amazon Bedrock to work with foundational models

Building a generative AI application on Kubernetes

Deploying the Streamlit app

Building RAG with Knowledge Bases for Amazon Bedrock

Adjusting the code for RAG retrieval

Building action models with agents

Creating a DynamoDB table

Configuring the agent

Deploying the application on Kubernetes

Summary

Where to Go from Here

Important topics for big data in Kubernetes

Kubernetes monitoring and application monitoring

Building a service mesh

Security considerations

Automated scalability

GitOps and CI/CD for Kubernetes

Kubernetes cost control

What about team skills?

Key skills for monitoring

Building a service mesh

Security considerations

Automated scalability

Skills for GitOps and CI/CD

Cost control skills

Summary

Index

Other Books You May Enjoy

Preface

In today’s data-driven world, the ability to process and analyze vast amounts of data has become a critical competitive advantage for businesses across industries. Big data technologies have emerged as powerful tools to handle the ever-increasing volume, velocity, and variety of data, enabling organizations to extract valuable insights and drive informed decision-making. However, managing and scaling these technologies can be a daunting task, often requiring significant infrastructure and operational overhead.

Enter Kubernetes, the open source container orchestration platform that has revolutionized the way we deploy and manage applications. By providing a standardized and automated approach to container management, Kubernetes has simplified the deployment and scaling of complex applications, including big data workloads. This book aims to bridge the gap between these two powerful technologies, guiding you through the process of implementing a robust and scalable big data architecture on Kubernetes.

Throughout the chapters, you will embark on a comprehensive journey, starting with the fundamentals of containers and Kubernetes architecture. You will learn how to build and deploy Docker images, understand the core components of Kubernetes, and gain hands-on experience in setting up local and cloud-based Kubernetes clusters. This solid foundation will prepare you for the subsequent chapters, where you will dive into the world of the modern data stack.

The book will introduce you to the most widely adopted tools in the big data ecosystem, such as Apache Spark for data processing, Apache Airflow for pipeline orchestration, and Apache Kafka for real-time data ingestion. You will not only learn the theoretical concepts behind these technologies but also gain practical experience in implementing them on Kubernetes. Through a series of hands-on exercises and projects, you will develop a deep understanding of how to build and deploy data pipelines, process large datasets, and orchestrate complex workflows on a Kubernetes cluster.

As the book progresses, you will explore advanced topics such as deploying a data consumption layer with tools such as Trino and Elasticsearch and integrating generative AI workloads using Amazon Bedrock. These topics will equip you with the knowledge and skills necessary to build and maintain a robust and scalable big data architecture on Kubernetes, ensuring efficient data processing, analysis, and analytics application deployment.

By the end of this book, you will have gained a comprehensive understanding of the synergy between big data and Kubernetes, enabling you to leverage the power of these technologies to drive innovation and business growth. Whether you are a data engineer, a DevOps professional, or a technology enthusiast, this book will provide you with the practical knowledge and hands-on experience needed to successfully implement and manage big data workloads on Kubernetes.

Who this book is for

If you are a data engineer, a cloud architect, a DevOps professional, a data or science manager, or a technology enthusiast, this book is for you. You should have a basic background in Python and SQL programming, and basic knowledge of Apache Spark, Apache Kafka, and Apache Airflow. A basic understanding of Docker and Git will also be helpful.

What this book covers

Chapter 1

, Getting Started with Containers, embarks on a journey to understand containers and Docker, the foundational technologies for modern application deployment. You’ll learn how to install Docker and run your first container image, experiencing the power of containerization firsthand. Additionally, you’ll dive into the intricacies of Dockerfiles, mastering the art of crafting concise and functional container images. Through practical examples, including the construction of a simple API and a data processing job with Python, you’ll grasp the nuances of containerizing services and jobs. By the end of this chapter, you’ll have the opportunity to solidify your newfound knowledge by building your own job and API, laying the groundwork for a portfolio of practical container-based applications.

Chapter 2

, Kubernetes Architecture, introduces you to the core components that make up the Kubernetes architecture. You will learn about the control plane components such as the API server, etcd, scheduler, and controller manager, as well as the worker node components such as kubelet, kube-proxy, and container runtime. The chapter will explain the roles and responsibilities of each component, and how they interact with each other to ensure the smooth operation of a Kubernetes cluster. Additionally, you will gain an understanding of the key concepts in Kubernetes, including pods, deployments, services, jobs, stateful sets, persistent volumes, ConfigMaps, and secrets. By the end of this chapter, you will have a solid foundation in the architecture and core concepts of Kubernetes, preparing you for hands-on experience in the subsequent chapters.

Chapter 3

, Kubernetes – Hands On, guides you through the process of deploying a local Kubernetes cluster using kind, and a cloud-based cluster on AWS using Amazon EKS. You will learn the minimal AWS account configuration required to successfully deploy an EKS cluster. After setting up the clusters, you will have the opportunity to choose between deploying your applications on the local or cloud environment. Regardless of your choice, you will retake the API and data processing jobs developed in Chapter 1

and deploy them to Kubernetes. This hands-on experience will solidify your understanding of Kubernetes concepts and prepare you for more advanced topics in the following chapters.

Chapter 4

, The Modern Data Stack, introduces you to the most well-known data architecture designs, with a focus on the lambda architecture. You will learn about the tools that make up the modern data stack, which is a set of technologies used to implement a data lake(house) architecture. Among these tools are Apache Spark for data processing, Apache Airflow for data pipeline orchestration, and Apache Kafka for real-time event streaming and data ingestion. This chapter will provide a conceptual introduction to these tools and how they work together to build the core technology assets of a data lake(house) architecture.

Chapter 5

, Big Data Processing with Apache Spark, introduces you to Apache Spark, one of the most popular tools for big data processing. You will understand the core components of a Spark program, how it scales and handles distributed processing, and best practices for working with Spark. You will implement simple data processing tasks using both the DataFrames API and the Spark SQL API, leveraging Python to interact with Spark. The chapter will guide you through installing Spark locally for testing purposes, enabling you to gain hands-on experience with this powerful tool before deploying it on a larger scale.

Chapter 6

, Apache Airflow for Building Pipelines, introduces you to Apache Airflow, a widely adopted open source tool for data pipeline orchestration. You will learn how to install Airflow using Docker and Astro CLI, making the setup process straightforward. The chapter will familiarize you with Airflow’s core features and the most commonly used operators for data engineering tasks. Additionally, you will gain insights into best practices for building resilient and efficient data pipelines that leverage Airflow’s capabilities to the fullest. By the end of this chapter, you will have a solid understanding of how to orchestrate complex data workflows using Airflow, a crucial skill for any data engineer or data architect working with big data on Kubernetes.

Chapter 7

, Apache Kafka for Real-Time Events and Data Ingestion, introduces you to Apache Kafka, a distributed event streaming platform that is widely used for building real-time data pipelines and streaming applications. You will understand Kafka’s architecture and how it scales while being resilient, enabling it to handle high volumes of real-time data with low latency. You will learn about Kafka’s distributed topics design, which underpins its robust performance for real-time events. The chapter will guide you through running Kafka locally with Docker and implementing basic reading and writing operations on topics. Additionally, you will explore different strategies for data replication and topic distribution, ensuring you can design and implement efficient and reliable Kafka clusters.

Chapter 8

, Deploying the Big Data Stack on Kubernetes, guides you through the process of deploying the big data tools you learned about in the previous chapters on a Kubernetes cluster. You will start by building bash scripts to deploy the Spark operator and run SparkApplications on Kubernetes. Next, you will deploy Apache Airflow to Kubernetes, enabling you to orchestrate data pipelines within the cluster. Additionally, you will deploy Apache Kafka on Kubernetes using both the ephemeral cluster and JBOD techniques. The Kafka Connect cluster will also be deployed, along with connectors to migrate data from SQL databases to persistent object storage. By the end of this chapter, you will have a fully functional big data stack running on Kubernetes, ready for further exploration and development.

Chapter 9

, Data Consumption Layer, guides you through the process of securely making data available for business analysts in a big data architecture deployed on Kubernetes. You will start by gaining an overview of working on a modern approach using a data lake engine instead of a data warehouse. In this chapter, you will become familiar with Trino for data consumption directly from a data lake through Kubernetes. You will understand how a data lake engine works, deploy it into Kubernetes, and monitor query execution and history. Additionally, for real-time data, you will get familiar with Elasticsearch and Kibana for data consumption. You will deploy these tools, and learn how to index data in them and how to build a simple data visualization with Kibana.

Chapter 10

, Building a Big Data Pipeline in Kubernetes, guides you through the process of deploying and orchestrating two complete data pipelines, one for batch processing and another for real-time processing, on a Kubernetes cluster. You will connect all the tools you’ve learned about throughout the book, such as Apache Spark, Apache Airflow, Apache Kafka, and Trino, to build a single, complex solution. You will deploy these tools on Kubernetes, write code for data processing and orchestration, and make the data available for querying through a SQL engine. By the end of this chapter, you will have hands-on experience in building and managing a comprehensive big data pipeline on Kubernetes, integrating various components and technologies into a cohesive and scalable architecture.

Chapter 11

, Generative AI on Kubernetes, guides you through the process of deploying a generative AI application on Kubernetes using Amazon Bedrock as a service suite for foundational models. You will learn how to connect your application to a knowledge base serving as a Retrieval-Augmented Generation (RAG) layer, which enhances the AI model’s capabilities by providing access to external information sources. Additionally, you will discover how to automate task execution by the AI models with agents, enabling seamless integration of generative AI into your workflows. By the end of this chapter, you will have a solid understanding of how to leverage the power of generative AI on Kubernetes, unlocking new possibilities for personalized customer experiences, intelligent assistants, and automated business analytics.

Chapter 12

, Where to Go from Here, guides you through the next steps in your journey toward mastering big data and Kubernetes. You will explore crucial concepts and technologies that are essential for building robust and scalable solutions on Kubernetes. This includes monitoring strategies for both Kubernetes and your applications, implementing a service mesh for efficient communication, securing your cluster and applications, enabling automated scalability, embracing GitOps and CI/CD practices for streamlined deployment and management, and Kubernetes cost control. For each topic, you’ll receive an overview and recommendations on the technologies to explore further, empowering you to deepen your knowledge and skills in these areas.

To get the most out of this book

Some basics in Python programming knowledge and experience with Spark, Docker, Airflow, Kafka, and Git will help you get the most out of this book.

All guidance needed for software installation will be provided in each chapter.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Bigdata-on-Kubernetes

. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/

. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: This command will pull the hello-world image from the Docker Hub public repository and run the application in it.

A block of code is set as follows:

import pandas as pd

url = '

https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'

df = pd.read_csv(url, header=None)

df[newcolumn] = df[5].apply(lambda x: x*2)

print(df.columns)

print(df.head())

print(df.shape)

Any command-line input or output is written as follows:

$ sudo apt install docker.io

This is how the filename above the code snippet will look:

Cjava.py

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: You should ensure that the Use WSL 2 instead of Hyper-V option is selected on the Configuration page.

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at customercare@packtpub.com

and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata

and fill in the form.

Piracy: If you come across any illegal copies of our works in any

Enjoying the preview?

Page 1 of 1

Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions

Neylson Crepalde

Related authors

Related to Big Data on Kubernetes

Related ebooks

The Kubernetes Book: Navigate the world of Kubernetes with expertise

MLOps with Red Hat OpenShift: A cloud-native approach to machine learning operations

Software Architecture with Kotlin: Combine various architectural styles to create sustainable and scalable software solutions

The Ultimate Guide to Snowpark: Design and deploy Snowflake Snowpark with Python for efficient data workloads

Mastering OpenShift: Deploy, Manage, and Scale Applications on Kubernetes

Real-Time Big Data Analytics: Emerging Trends

MinIO Object Storage Architecture and Operations: The Complete Guide for Developers and Engineers

Databricks ML in Action: Learn how Databricks supports the entire ML lifecycle end to end from data ingestion to the model deployment

DevOps for Web Development

The AI Content Creator

Building AI Applications with OpenAI APIs: Leverage ChatGPT, Whisper, and DALL-E APIs to build 10 innovative AI projects

AI@Work: Humans@WORK

ElasticSearch Cookbook - Second Edition

Beacon: The Definitive Business Guide to AI Strategy and Transformation

Kickstart Software Design Architecture: Transform Your Software Development Journey with Key Principles and Advanced Practices in Architecture, Microservices, and Cloud Computing (English Edition)

Learning Apache Mahout Classification

AI Ethics, Security, and Governance Demystified

Generative AI Application Integration Patterns: Integrate large language models into your applications

Mastering AI System Design: Architect, Build and Deploy AI Systems Using 10 Domain Driven Blueprints and Interview Strategies (English Edition)

Mastering OpenAI for Enterprise

Data Science and Machine Learning Interview Questions Using R: Crack the Data Scientist and Machine Learning Engineers Interviews with Ease

Python Text Processing with NLTK 2.0 Cookbook: LITE

Economic Multi Agent Systems: Design, Implementation, and Application

Ultimate Node.js for Cross-Platform App Development: Learn to Build Robust, Scalable, and Performant Server-Side JavaScript Applications with Node.js

Machine Learning Engineering

Unlocking Python: A Comprehensive Guide for Beginners

Ultimate Snowflake Cortex AI for Generative AI Applications: Design, Build, and Deploy Generative AI Solutions with Snowflake Cortex for Real-World and Industry-Scale Applications (English Edition)

Practical Java Programming with ChatGPT

Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js

Synthetic Data Generation: A Beginner’s Guide

Computers For You

The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology

The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms

Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees

A Guide to Electronic Dance Music Volume 1: Foundations

Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work

UX/UI Design Playbook

Tor and the Dark Art of Anonymity

Quantum Computing For Dummies

Dark Aeon: Transhumanism and the War Against Humanity

The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution

CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide

Visual Studio 2022 In-Depth: Explore the Fantastic Features of Visual Studio 2022 - 2nd Edition

How to Raise a Healthy Gamer: End Power Struggles, Break Bad Screen Habits, and Transform Your Relationship with Your Kids

Networking Fundamentals: Develop the networking skills required to pass the Microsoft MTA Networking Fundamentals Exam 98-366

Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition

Storytelling with Data: Let's Practice!

AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python

2022 Adobe® Premiere Pro Guide For Filmmakers and YouTubers

The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics, and Eastern Mystics All Agree We Are in a Video Game

Some Future Day: How AI Is Going to Change Everything

Pro Tools All-in-One For Dummies

Fundamentals of Programming: Using Python

Learning the Chess Openings

The Professional Voiceover Handbook: Voiceover training, #1

Mindhacker: 60 Tips, Tricks, and Games to Take Your Mind to the Next Level

Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics

Microsoft Office 365 for Business

Microsoft Azure For Dummies

Technical Writing For Dummies

Hacking: The Underground Guide to Computer Hacking, Including Wireless Networks, Security, Windows, Kali Linux and Penetration Testing

Related categories

What did you think?

Book preview

Big Data on Kubernetes - Neylson Crepalde