Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
()
Related to Big Data on Kubernetes
Related ebooks
The Kubernetes Book: Navigate the world of Kubernetes with expertise Rating: 0 out of 5 stars0 ratingsMLOps with Red Hat OpenShift: A cloud-native approach to machine learning operations Rating: 0 out of 5 stars0 ratingsSoftware Architecture with Kotlin: Combine various architectural styles to create sustainable and scalable software solutions Rating: 0 out of 5 stars0 ratingsThe Ultimate Guide to Snowpark: Design and deploy Snowflake Snowpark with Python for efficient data workloads Rating: 0 out of 5 stars0 ratingsMastering OpenShift: Deploy, Manage, and Scale Applications on Kubernetes Rating: 0 out of 5 stars0 ratingsReal-Time Big Data Analytics: Emerging Trends Rating: 0 out of 5 stars0 ratingsMinIO Object Storage Architecture and Operations: The Complete Guide for Developers and Engineers Rating: 0 out of 5 stars0 ratingsDevOps for Web Development Rating: 0 out of 5 stars0 ratingsThe AI Content Creator Rating: 0 out of 5 stars0 ratingsBuilding AI Applications with OpenAI APIs: Leverage ChatGPT, Whisper, and DALL-E APIs to build 10 innovative AI projects Rating: 0 out of 5 stars0 ratingsAI@Work: Humans@WORK Rating: 0 out of 5 stars0 ratingsElasticSearch Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsBeacon: The Definitive Business Guide to AI Strategy and Transformation Rating: 0 out of 5 stars0 ratingsLearning Apache Mahout Classification Rating: 0 out of 5 stars0 ratingsAI Ethics, Security, and Governance Demystified Rating: 0 out of 5 stars0 ratingsGenerative AI Application Integration Patterns: Integrate large language models into your applications Rating: 0 out of 5 stars0 ratingsMastering OpenAI for Enterprise Rating: 0 out of 5 stars0 ratingsPython Text Processing with NLTK 2.0 Cookbook: LITE Rating: 4 out of 5 stars4/5Economic Multi Agent Systems: Design, Implementation, and Application Rating: 4 out of 5 stars4/5Machine Learning Engineering Rating: 0 out of 5 stars0 ratingsUnlocking Python: A Comprehensive Guide for Beginners Rating: 0 out of 5 stars0 ratingsPractical Java Programming with ChatGPT Rating: 0 out of 5 stars0 ratingsMastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js Rating: 0 out of 5 stars0 ratingsSynthetic Data Generation: A Beginner’s Guide Rating: 0 out of 5 stars0 ratings
Computers For You
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms Rating: 0 out of 5 stars0 ratingsA Guide to Electronic Dance Music Volume 1: Foundations Rating: 5 out of 5 stars5/5UX/UI Design Playbook Rating: 4 out of 5 stars4/5Tor and the Dark Art of Anonymity Rating: 5 out of 5 stars5/5Quantum Computing For Dummies Rating: 3 out of 5 stars3/5Dark Aeon: Transhumanism and the War Against Humanity Rating: 3 out of 5 stars3/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Visual Studio 2022 In-Depth: Explore the Fantastic Features of Visual Studio 2022 - 2nd Edition Rating: 0 out of 5 stars0 ratingsNetworking Fundamentals: Develop the networking skills required to pass the Microsoft MTA Networking Fundamentals Exam 98-366 Rating: 0 out of 5 stars0 ratingsAlan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5Storytelling with Data: Let's Practice! Rating: 4 out of 5 stars4/52022 Adobe® Premiere Pro Guide For Filmmakers and YouTubers Rating: 5 out of 5 stars5/5Some Future Day: How AI Is Going to Change Everything Rating: 0 out of 5 stars0 ratingsPro Tools All-in-One For Dummies Rating: 5 out of 5 stars5/5Fundamentals of Programming: Using Python Rating: 5 out of 5 stars5/5Learning the Chess Openings Rating: 5 out of 5 stars5/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5Mindhacker: 60 Tips, Tricks, and Games to Take Your Mind to the Next Level Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Microsoft Office 365 for Business Rating: 4 out of 5 stars4/5Microsoft Azure For Dummies Rating: 0 out of 5 stars0 ratingsTechnical Writing For Dummies Rating: 0 out of 5 stars0 ratings
0 ratings0 reviews
Book preview
Big Data on Kubernetes - Neylson Crepalde
Big Data on Kubernetes
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Apeksha Shetty
Publishing Product Manager: Apeksha Shetty
Book Project Manager: Aparna Ravikumar Nair
Senior Editor: Sushma Reddy
Technical Editor: Kavyashree K S
Copy Editor: Safis Editing
Proofreader: Sushma Reddy
Indexer: Subalakshmi Govindhan
Production Designer: Gokul Raj S T
DevRel Marketing Executive: Nivedita Singh
First published: July 2024
Production reference: 1210624
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK
ISBN: 978-1-83546-214-0
www.packtpub.com
To my wife, Sarah, and my son, Joao Pedro, for their love and support. To Silvio Salej Higgins for being a great mentor.
– Neylson Crepalde
Contributors
About the author
Neylson Crepalde is a generative AI strategist at Amazon Web Services (AWS). Before this, Neylson was chief technology officer at A3Data, a consulting business focused on data, analytics, and artificial intelligence. In his time as CTO, he worked with the company’s tech team to build a Big Data architecture on top of Kubernetes that inspired the writing of this book. Neylson holds a PhD in economic sociology, and he was a visiting scholar at the Centre des Sociologies des Organisations at Sciences Po, Paris. Neylson is also a frequent guest speaker at conferences and has taught MBA programs for more than 10 years.
I want to thank all the people who have worked with me in the development of this great architecture, especially Mayla Teixeira and Marcus Oliveira for their outstanding contributions.
About the reviewer
Thariq Mahmood has 16 years of experience in data technology and possesses a strong skillset in Kubernetes, Big Data, data engineering, and DevOps on the public cloud, the private cloud, and on-premise environments. He has expertise in data warehousing, data modeling, and data security. He actively contributes to projects on Git and has experience setting up batch and streaming pipelines for various production environments, using Databricks, Hadoop, Spark, Flink, and other cloud-native tools from AWS, Azure, and GCP. Also, he implemented MLOps and DevSecOps in numerous projects. He currently works on helping organizations optimize their Big Data infrastructure costs and implementing data-lake and one-lake architectures within Kubernetes.
Table of Contents
Preface
Part 1: Docker and Kubernetes
1
Getting Started with Containers
Technical requirements
Container architecture
Installing Docker
Windows
macOS
Linux
Getting started with Docker images
hello-world
NGINX
Julia
Building your own image
Batch processing job
API service
Summary
2
Kubernetes Architecture
Technical requirements
Kubernetes architecture
Control plane
Node components
Pods
Deployments
StatefulSets
Jobs
Services
ClusterIP Service
NodePort Service
LoadBalancer Service
Ingress and Ingress Controller
Gateway
Persistent Volumes
StorageClasses
ConfigMaps and Secrets
ConfigMaps
Secrets
Summary
3
Getting Hands-On with Kubernetes
Technical requirements
Installing kubectl
Deploying a local cluster using Kind
Installing kind
Deploying the cluster
Deploying an AWS EKS cluster
Deploying a Google Cloud GKE cluster
Deploying an Azure AKS cluster
Running your API on Kubernetes
Creating the deployment
Creating a service
Using an ingress to access the API
Running a data processing job in Kubernetes
Summary
Part 2: Big Data Stack
4
The Modern Data Stack
Data architectures
The Lambda architecture
The Kappa architecture
Comparing Lambda and Kappa
Data lake design for big data
Data warehouses
The rise of big data and data lakes
The rise of the data lakehouse
Implementing the lakehouse architecture
Batch ingestion
Storage
Batch processing
Orchestration
Batch serving
Data visualization
Real-time ingestion
Real-time processing
Real-time serving
Real-time data visualization
Summary
5
Big Data Processing with Apache Spark
Technical requirements
Getting started with Spark
Installing Spark locally
Spark architecture
Spark executors
Components of execution
Starting a Spark program
The DataFrame API and the Spark SQL API
Transformations
Actions
Lazy evaluation
Data partitioning
Narrow versus wide transformations
Analyzing the titanic dataset
Working with real data
How Spark performs joins
Joining IMDb tables
Summary
6
Building Pipelines with Apache Airflow
Technical requirements
Getting started with Airflow
Installing Airflow with Astro
Airflow architecture
Airflow’s distributed architecture
Building a data pipeline
Airflow integration with other tools
Summary
7
Apache Kafka for Real-Time Events and Data Ingestion
Technical requirements
Getting started with Kafka
Exploring the Kafka architecture
The PubSub design
How Kafka delivers exactly-once semantics
First producer and consumer
Streaming from a database with Kafka Connect
Real-time data processing with Kafka and Spark
Summary
Part 3: Connecting It All Together
8
Deploying the Big Data Stack on Kubernetes
Technical requirements
Deploying Spark on Kubernetes
Deploying Airflow on Kubernetes
Deploying Kafka on Kubernetes
Summary
9
Data Consumption Layer
Technical requirements
Getting started with SQL query engines
The limitations of traditional data warehouses
The rise of SQL query engines
The architecture of SQL query engines
Deploying Trino in Kubernetes
Connecting DBeaver with Trino
Deploying Elasticsearch in Kubernetes
How Elasticsearch stores, indexes and manages data
Elasticsearch deployment
Summary
10
Building a Big Data Pipeline on Kubernetes
Technical requirements
Checking the deployed tools
Building a batch pipeline
Building the Airflow DAG
Creating SparkApplication jobs
Creating a Glue crawler
Building a real-time pipeline
Deploying Kafka Connect and Elasticsearch
Real-time processing with Spark
Deploying the Elasticsearch sink connector
Summary
11
Generative AI on Kubernetes
Technical requirements
What generative AI is and what it is not
The power of large neural networks
Challenges and limitations
Using Amazon Bedrock to work with foundational models
Building a generative AI application on Kubernetes
Deploying the Streamlit app
Building RAG with Knowledge Bases for Amazon Bedrock
Adjusting the code for RAG retrieval
Building action models with agents
Creating a DynamoDB table
Configuring the agent
Deploying the application on Kubernetes
Summary
12
Where to Go from Here
Important topics for big data in Kubernetes
Kubernetes monitoring and application monitoring
Building a service mesh
Security considerations
Automated scalability
GitOps and CI/CD for Kubernetes
Kubernetes cost control
What about team skills?
Key skills for monitoring
Building a service mesh
Security considerations
Automated scalability
Skills for GitOps and CI/CD
Cost control skills
Summary
Index
Other Books You May Enjoy
Preface
In today’s data-driven world, the ability to process and analyze vast amounts of data has become a critical competitive advantage for businesses across industries. Big data technologies have emerged as powerful tools to handle the ever-increasing volume, velocity, and variety of data, enabling organizations to extract valuable insights and drive informed decision-making. However, managing and scaling these technologies can be a daunting task, often requiring significant infrastructure and operational overhead.
Enter Kubernetes, the open source container orchestration platform that has revolutionized the way we deploy and manage applications. By providing a standardized and automated approach to container management, Kubernetes has simplified the deployment and scaling of complex applications, including big data workloads. This book aims to bridge the gap between these two powerful technologies, guiding you through the process of implementing a robust and scalable big data architecture on Kubernetes.
Throughout the chapters, you will embark on a comprehensive journey, starting with the fundamentals of containers and Kubernetes architecture. You will learn how to build and deploy Docker images, understand the core components of Kubernetes, and gain hands-on experience in setting up local and cloud-based Kubernetes clusters. This solid foundation will prepare you for the subsequent chapters, where you will dive into the world of the modern data stack.
The book will introduce you to the most widely adopted tools in the big data ecosystem, such as Apache Spark for data processing, Apache Airflow for pipeline orchestration, and Apache Kafka for real-time data ingestion. You will not only learn the theoretical concepts behind these technologies but also gain practical experience in implementing them on Kubernetes. Through a series of hands-on exercises and projects, you will develop a deep understanding of how to build and deploy data pipelines, process large datasets, and orchestrate complex workflows on a Kubernetes cluster.
As the book progresses, you will explore advanced topics such as deploying a data consumption layer with tools such as Trino and Elasticsearch and integrating generative AI workloads using Amazon Bedrock. These topics will equip you with the knowledge and skills necessary to build and maintain a robust and scalable big data architecture on Kubernetes, ensuring efficient data processing, analysis, and analytics application deployment.
By the end of this book, you will have gained a comprehensive understanding of the synergy between big data and Kubernetes, enabling you to leverage the power of these technologies to drive innovation and business growth. Whether you are a data engineer, a DevOps professional, or a technology enthusiast, this book will provide you with the practical knowledge and hands-on experience needed to successfully implement and manage big data workloads on Kubernetes.
Who this book is for
If you are a data engineer, a cloud architect, a DevOps professional, a data or science manager, or a technology enthusiast, this book is for you. You should have a basic background in Python and SQL programming, and basic knowledge of Apache Spark, Apache Kafka, and Apache Airflow. A basic understanding of Docker and Git will also be helpful.
What this book covers
Chapter 1
, Getting Started with Containers, embarks on a journey to understand containers and Docker, the foundational technologies for modern application deployment. You’ll learn how to install Docker and run your first container image, experiencing the power of containerization firsthand. Additionally, you’ll dive into the intricacies of Dockerfiles, mastering the art of crafting concise and functional container images. Through practical examples, including the construction of a simple API and a data processing job with Python, you’ll grasp the nuances of containerizing services and jobs. By the end of this chapter, you’ll have the opportunity to solidify your newfound knowledge by building your own job and API, laying the groundwork for a portfolio of practical container-based applications.
Chapter 2
, Kubernetes Architecture, introduces you to the core components that make up the Kubernetes architecture. You will learn about the control plane components such as the API server, etcd, scheduler, and controller manager, as well as the worker node components such as kubelet, kube-proxy, and container runtime. The chapter will explain the roles and responsibilities of each component, and how they interact with each other to ensure the smooth operation of a Kubernetes cluster. Additionally, you will gain an understanding of the key concepts in Kubernetes, including pods, deployments, services, jobs, stateful sets, persistent volumes, ConfigMaps, and secrets. By the end of this chapter, you will have a solid foundation in the architecture and core concepts of Kubernetes, preparing you for hands-on experience in the subsequent chapters.
Chapter 3
, Kubernetes – Hands On, guides you through the process of deploying a local Kubernetes cluster using kind, and a cloud-based cluster on AWS using Amazon EKS. You will learn the minimal AWS account configuration required to successfully deploy an EKS cluster. After setting up the clusters, you will have the opportunity to choose between deploying your applications on the local or cloud environment. Regardless of your choice, you will retake the API and data processing jobs developed in Chapter 1
and deploy them to Kubernetes. This hands-on experience will solidify your understanding of Kubernetes concepts and prepare you for more advanced topics in the following chapters.
Chapter 4
, The Modern Data Stack, introduces you to the most well-known data architecture designs, with a focus on the lambda
architecture. You will learn about the tools that make up the modern data stack, which is a set of technologies used to implement a data lake(house) architecture. Among these tools are Apache Spark for data processing, Apache Airflow for data pipeline orchestration, and Apache Kafka for real-time event streaming and data ingestion. This chapter will provide a conceptual introduction to these tools and how they work together to build the core technology assets of a data lake(house) architecture.
Chapter 5
, Big Data Processing with Apache Spark, introduces you to Apache Spark, one of the most popular tools for big data processing. You will understand the core components of a Spark program, how it scales and handles distributed processing, and best practices for working with Spark. You will implement simple data processing tasks using both the DataFrames API and the Spark SQL API, leveraging Python to interact with Spark. The chapter will guide you through installing Spark locally for testing purposes, enabling you to gain hands-on experience with this powerful tool before deploying it on a larger scale.
Chapter 6
, Apache Airflow for Building Pipelines, introduces you to Apache Airflow, a widely adopted open source tool for data pipeline orchestration. You will learn how to install Airflow using Docker and Astro CLI, making the setup process straightforward. The chapter will familiarize you with Airflow’s core features and the most commonly used operators for data engineering tasks. Additionally, you will gain insights into best practices for building resilient and efficient data pipelines that leverage Airflow’s capabilities to the fullest. By the end of this chapter, you will have a solid understanding of how to orchestrate complex data workflows using Airflow, a crucial skill for any data engineer or data architect working with big data on Kubernetes.
Chapter 7
, Apache Kafka for Real-Time Events and Data Ingestion, introduces you to Apache Kafka, a distributed event streaming platform that is widely used for building real-time data pipelines and streaming applications. You will understand Kafka’s architecture and how it scales while being resilient, enabling it to handle high volumes of real-time data with low latency. You will learn about Kafka’s distributed topics design, which underpins its robust performance for real-time events. The chapter will guide you through running Kafka locally with Docker and implementing basic reading and writing operations on topics. Additionally, you will explore different strategies for data replication and topic distribution, ensuring you can design and implement efficient and reliable Kafka clusters.
Chapter 8
, Deploying the Big Data Stack on Kubernetes, guides you through the process of deploying the big data tools you learned about in the previous chapters on a Kubernetes cluster. You will start by building bash scripts to deploy the Spark operator and run SparkApplications on Kubernetes. Next, you will deploy Apache Airflow to Kubernetes, enabling you to orchestrate data pipelines within the cluster. Additionally, you will deploy Apache Kafka on Kubernetes using both the ephemeral cluster and JBOD techniques. The Kafka Connect cluster will also be deployed, along with connectors to migrate data from SQL databases to persistent object storage. By the end of this chapter, you will have a fully functional big data stack running on Kubernetes, ready for further exploration and development.
Chapter 9
, Data Consumption Layer, guides you through the process of securely making data available for business analysts in a big data architecture deployed on Kubernetes. You will start by gaining an overview of working on a modern approach using a data lake engine
instead of a data warehouse. In this chapter, you will become familiar with Trino for data consumption directly from a data lake through Kubernetes. You will understand how a data lake engine works, deploy it into Kubernetes, and monitor query execution and history. Additionally, for real-time data, you will get familiar with Elasticsearch and Kibana for data consumption. You will deploy these tools, and learn how to index data in them and how to build a simple data visualization with Kibana.
Chapter 10
, Building a Big Data Pipeline in Kubernetes, guides you through the process of deploying and orchestrating two complete data pipelines, one for batch processing and another for real-time processing, on a Kubernetes cluster. You will connect all the tools you’ve learned about throughout the book, such as Apache Spark, Apache Airflow, Apache Kafka, and Trino, to build a single, complex solution. You will deploy these tools on Kubernetes, write code for data processing and orchestration, and make the data available for querying through a SQL engine. By the end of this chapter, you will have hands-on experience in building and managing a comprehensive big data pipeline on Kubernetes, integrating various components and technologies into a cohesive and scalable architecture.
Chapter 11
, Generative AI on Kubernetes, guides you through the process of deploying a generative AI application on Kubernetes using Amazon Bedrock as a service suite for foundational models. You will learn how to connect your application to a knowledge base serving as a Retrieval-Augmented Generation (RAG) layer, which enhances the AI model’s capabilities by providing access to external information sources. Additionally, you will discover how to automate task execution by the AI models with agents, enabling seamless integration of generative AI into your workflows. By the end of this chapter, you will have a solid understanding of how to leverage the power of generative AI on Kubernetes, unlocking new possibilities for personalized customer experiences, intelligent assistants, and automated business analytics.
Chapter 12
, Where to Go from Here, guides you through the next steps in your journey toward mastering big data and Kubernetes. You will explore crucial concepts and technologies that are essential for building robust and scalable solutions on Kubernetes. This includes monitoring strategies for both Kubernetes and your applications, implementing a service mesh for efficient communication, securing your cluster and applications, enabling automated scalability, embracing GitOps and CI/CD practices for streamlined deployment and management, and Kubernetes cost control. For each topic, you’ll receive an overview and recommendations on the technologies to explore further, empowering you to deepen your knowledge and skills in these areas.
To get the most out of this book
Some basics in Python programming knowledge and experience with Spark, Docker, Airflow, Kafka, and Git will help you get the most out of this book.
All guidance needed for software installation will be provided in each chapter.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
Download the example code files
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Bigdata-on-Kubernetes
. If there’s an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/
. Check them out!
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: This command will pull the hello-world image from the Docker Hub public repository and run the application in it.
A block of code is set as follows:
import pandas as pd
url = '
https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
df = pd.read_csv(url, header=None)
df[newcolumn
] = df[5].apply(lambda x: x*2)
print(df.columns)
print(df.head())
print(df.shape)
Any command-line input or output is written as follows:
$ sudo apt install docker.io
This is how the filename above the code snippet will look:
Cjava.py
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: You should ensure that the Use WSL 2 instead of Hyper-V option is selected on the Configuration page.
Tips or important notes
Appear like this.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at customercare@packtpub.com
and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata
and fill in the form.
Piracy: If you come across any illegal copies of our works in any
