Explore 1.5M+ audiobooks & ebooks free for days

From $11.99/month after trial. Cancel anytime.

Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
Ebook672 pages4 hours

Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions

Rating: 0 out of 5 stars

()

Read preview
LanguageEnglish
PublisherPackt Publishing
Release dateJul 19, 2024
ISBN9781835468999
Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions

Related to Big Data on Kubernetes

Related ebooks

Computers For You

View More

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Big Data on Kubernetes - Neylson Crepalde

    Cover.jpgpackt Logo

    Big Data on Kubernetes

    Copyright © 2024 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Group Product Manager: Apeksha Shetty

    Publishing Product Manager: Apeksha Shetty

    Book Project Manager: Aparna Ravikumar Nair

    Senior Editor: Sushma Reddy

    Technical Editor: Kavyashree K S

    Copy Editor: Safis Editing

    Proofreader: Sushma Reddy

    Indexer: Subalakshmi Govindhan

    Production Designer: Gokul Raj S T

    DevRel Marketing Executive: Nivedita Singh

    First published: July 2024

    Production reference: 1210624

    Published by Packt Publishing Ltd.

    Grosvenor House

    11 St Paul’s Square

    Birmingham

    B3 1RB, UK

    ISBN: 978-1-83546-214-0

    www.packtpub.com

    To my wife, Sarah, and my son, Joao Pedro, for their love and support. To Silvio Salej Higgins for being a great mentor.

    – Neylson Crepalde

    Contributors

    About the author

    Neylson Crepalde is a generative AI strategist at Amazon Web Services (AWS). Before this, Neylson was chief technology officer at A3Data, a consulting business focused on data, analytics, and artificial intelligence. In his time as CTO, he worked with the company’s tech team to build a Big Data architecture on top of Kubernetes that inspired the writing of this book. Neylson holds a PhD in economic sociology, and he was a visiting scholar at the Centre des Sociologies des Organisations at Sciences Po, Paris. Neylson is also a frequent guest speaker at conferences and has taught MBA programs for more than 10 years.

    I want to thank all the people who have worked with me in the development of this great architecture, especially Mayla Teixeira and Marcus Oliveira for their outstanding contributions.

    About the reviewer

    Thariq Mahmood has 16 years of experience in data technology and possesses a strong skillset in Kubernetes, Big Data, data engineering, and DevOps on the public cloud, the private cloud, and on-premise environments. He has expertise in data warehousing, data modeling, and data security. He actively contributes to projects on Git and has experience setting up batch and streaming pipelines for various production environments, using Databricks, Hadoop, Spark, Flink, and other cloud-native tools from AWS, Azure, and GCP. Also, he implemented MLOps and DevSecOps in numerous projects. He currently works on helping organizations optimize their Big Data infrastructure costs and implementing data-lake and one-lake architectures within Kubernetes.

    Table of Contents

    Preface

    Part 1: Docker and Kubernetes

    1

    Getting Started with Containers

    Technical requirements

    Container architecture

    Installing Docker

    Windows

    macOS

    Linux

    Getting started with Docker images

    hello-world

    NGINX

    Julia

    Building your own image

    Batch processing job

    API service

    Summary

    2

    Kubernetes Architecture

    Technical requirements

    Kubernetes architecture

    Control plane

    Node components

    Pods

    Deployments

    StatefulSets

    Jobs

    Services

    ClusterIP Service

    NodePort Service

    LoadBalancer Service

    Ingress and Ingress Controller

    Gateway

    Persistent Volumes

    StorageClasses

    ConfigMaps and Secrets

    ConfigMaps

    Secrets

    Summary

    3

    Getting Hands-On with Kubernetes

    Technical requirements

    Installing kubectl

    Deploying a local cluster using Kind

    Installing kind

    Deploying the cluster

    Deploying an AWS EKS cluster

    Deploying a Google Cloud GKE cluster

    Deploying an Azure AKS cluster

    Running your API on Kubernetes

    Creating the deployment

    Creating a service

    Using an ingress to access the API

    Running a data processing job in Kubernetes

    Summary

    Part 2: Big Data Stack

    4

    The Modern Data Stack

    Data architectures

    The Lambda architecture

    The Kappa architecture

    Comparing Lambda and Kappa

    Data lake design for big data

    Data warehouses

    The rise of big data and data lakes

    The rise of the data lakehouse

    Implementing the lakehouse architecture

    Batch ingestion

    Storage

    Batch processing

    Orchestration

    Batch serving

    Data visualization

    Real-time ingestion

    Real-time processing

    Real-time serving

    Real-time data visualization

    Summary

    5

    Big Data Processing with Apache Spark

    Technical requirements

    Getting started with Spark

    Installing Spark locally

    Spark architecture

    Spark executors

    Components of execution

    Starting a Spark program

    The DataFrame API and the Spark SQL API

    Transformations

    Actions

    Lazy evaluation

    Data partitioning

    Narrow versus wide transformations

    Analyzing the titanic dataset

    Working with real data

    How Spark performs joins

    Joining IMDb tables

    Summary

    6

    Building Pipelines with Apache Airflow

    Technical requirements

    Getting started with Airflow

    Installing Airflow with Astro

    Airflow architecture

    Airflow’s distributed architecture

    Building a data pipeline

    Airflow integration with other tools

    Summary

    7

    Apache Kafka for Real-Time Events and Data Ingestion

    Technical requirements

    Getting started with Kafka

    Exploring the Kafka architecture

    The PubSub design

    How Kafka delivers exactly-once semantics

    First producer and consumer

    Streaming from a database with Kafka Connect

    Real-time data processing with Kafka and Spark

    Summary

    Part 3: Connecting It All Together

    8

    Deploying the Big Data Stack on Kubernetes

    Technical requirements

    Deploying Spark on Kubernetes

    Deploying Airflow on Kubernetes

    Deploying Kafka on Kubernetes

    Summary

    9

    Data Consumption Layer

    Technical requirements

    Getting started with SQL query engines

    The limitations of traditional data warehouses

    The rise of SQL query engines

    The architecture of SQL query engines

    Deploying Trino in Kubernetes

    Connecting DBeaver with Trino

    Deploying Elasticsearch in Kubernetes

    How Elasticsearch stores, indexes and manages data

    Elasticsearch deployment

    Summary

    10

    Building a Big Data Pipeline on Kubernetes

    Technical requirements

    Checking the deployed tools

    Building a batch pipeline

    Building the Airflow DAG

    Creating SparkApplication jobs

    Creating a Glue crawler

    Building a real-time pipeline

    Deploying Kafka Connect and Elasticsearch

    Real-time processing with Spark

    Deploying the Elasticsearch sink connector

    Summary

    11

    Generative AI on Kubernetes

    Technical requirements

    What generative AI is and what it is not

    The power of large neural networks

    Challenges and limitations

    Using Amazon Bedrock to work with foundational models

    Building a generative AI application on Kubernetes

    Deploying the Streamlit app

    Building RAG with Knowledge Bases for Amazon Bedrock

    Adjusting the code for RAG retrieval

    Building action models with agents

    Creating a DynamoDB table

    Configuring the agent

    Deploying the application on Kubernetes

    Summary

    12

    Where to Go from Here

    Important topics for big data in Kubernetes

    Kubernetes monitoring and application monitoring

    Building a service mesh

    Security considerations

    Automated scalability

    GitOps and CI/CD for Kubernetes

    Kubernetes cost control

    What about team skills?

    Key skills for monitoring

    Building a service mesh

    Security considerations

    Automated scalability

    Skills for GitOps and CI/CD

    Cost control skills

    Summary

    Index

    Other Books You May Enjoy

    Preface

    In today’s data-driven world, the ability to process and analyze vast amounts of data has become a critical competitive advantage for businesses across industries. Big data technologies have emerged as powerful tools to handle the ever-increasing volume, velocity, and variety of data, enabling organizations to extract valuable insights and drive informed decision-making. However, managing and scaling these technologies can be a daunting task, often requiring significant infrastructure and operational overhead.

    Enter Kubernetes, the open source container orchestration platform that has revolutionized the way we deploy and manage applications. By providing a standardized and automated approach to container management, Kubernetes has simplified the deployment and scaling of complex applications, including big data workloads. This book aims to bridge the gap between these two powerful technologies, guiding you through the process of implementing a robust and scalable big data architecture on Kubernetes.

    Throughout the chapters, you will embark on a comprehensive journey, starting with the fundamentals of containers and Kubernetes architecture. You will learn how to build and deploy Docker images, understand the core components of Kubernetes, and gain hands-on experience in setting up local and cloud-based Kubernetes clusters. This solid foundation will prepare you for the subsequent chapters, where you will dive into the world of the modern data stack.

    The book will introduce you to the most widely adopted tools in the big data ecosystem, such as Apache Spark for data processing, Apache Airflow for pipeline orchestration, and Apache Kafka for real-time data ingestion. You will not only learn the theoretical concepts behind these technologies but also gain practical experience in implementing them on Kubernetes. Through a series of hands-on exercises and projects, you will develop a deep understanding of how to build and deploy data pipelines, process large datasets, and orchestrate complex workflows on a Kubernetes cluster.

    As the book progresses, you will explore advanced topics such as deploying a data consumption layer with tools such as Trino and Elasticsearch and integrating generative AI workloads using Amazon Bedrock. These topics will equip you with the knowledge and skills necessary to build and maintain a robust and scalable big data architecture on Kubernetes, ensuring efficient data processing, analysis, and analytics application deployment.

    By the end of this book, you will have gained a comprehensive understanding of the synergy between big data and Kubernetes, enabling you to leverage the power of these technologies to drive innovation and business growth. Whether you are a data engineer, a DevOps professional, or a technology enthusiast, this book will provide you with the practical knowledge and hands-on experience needed to successfully implement and manage big data workloads on Kubernetes.

    Who this book is for

    If you are a data engineer, a cloud architect, a DevOps professional, a data or science manager, or a technology enthusiast, this book is for you. You should have a basic background in Python and SQL programming, and basic knowledge of Apache Spark, Apache Kafka, and Apache Airflow. A basic understanding of Docker and Git will also be helpful.

    What this book covers

    Chapter 1

    , Getting Started with Containers, embarks on a journey to understand containers and Docker, the foundational technologies for modern application deployment. You’ll learn how to install Docker and run your first container image, experiencing the power of containerization firsthand. Additionally, you’ll dive into the intricacies of Dockerfiles, mastering the art of crafting concise and functional container images. Through practical examples, including the construction of a simple API and a data processing job with Python, you’ll grasp the nuances of containerizing services and jobs. By the end of this chapter, you’ll have the opportunity to solidify your newfound knowledge by building your own job and API, laying the groundwork for a portfolio of practical container-based applications.

    Chapter 2

    , Kubernetes Architecture, introduces you to the core components that make up the Kubernetes architecture. You will learn about the control plane components such as the API server, etcd, scheduler, and controller manager, as well as the worker node components such as kubelet, kube-proxy, and container runtime. The chapter will explain the roles and responsibilities of each component, and how they interact with each other to ensure the smooth operation of a Kubernetes cluster. Additionally, you will gain an understanding of the key concepts in Kubernetes, including pods, deployments, services, jobs, stateful sets, persistent volumes, ConfigMaps, and secrets. By the end of this chapter, you will have a solid foundation in the architecture and core concepts of Kubernetes, preparing you for hands-on experience in the subsequent chapters.

    Chapter 3

    , Kubernetes – Hands On, guides you through the process of deploying a local Kubernetes cluster using kind, and a cloud-based cluster on AWS using Amazon EKS. You will learn the minimal AWS account configuration required to successfully deploy an EKS cluster. After setting up the clusters, you will have the opportunity to choose between deploying your applications on the local or cloud environment. Regardless of your choice, you will retake the API and data processing jobs developed in Chapter 1

    and deploy them to Kubernetes. This hands-on experience will solidify your understanding of Kubernetes concepts and prepare you for more advanced topics in the following chapters.

    Chapter 4

    , The Modern Data Stack, introduces you to the most well-known data architecture designs, with a focus on the lambda architecture. You will learn about the tools that make up the modern data stack, which is a set of technologies used to implement a data lake(house) architecture. Among these tools are Apache Spark for data processing, Apache Airflow for data pipeline orchestration, and Apache Kafka for real-time event streaming and data ingestion. This chapter will provide a conceptual introduction to these tools and how they work together to build the core technology assets of a data lake(house) architecture.

    Chapter 5

    , Big Data Processing with Apache Spark, introduces you to Apache Spark, one of the most popular tools for big data processing. You will understand the core components of a Spark program, how it scales and handles distributed processing, and best practices for working with Spark. You will implement simple data processing tasks using both the DataFrames API and the Spark SQL API, leveraging Python to interact with Spark. The chapter will guide you through installing Spark locally for testing purposes, enabling you to gain hands-on experience with this powerful tool before deploying it on a larger scale.

    Chapter 6

    , Apache Airflow for Building Pipelines, introduces you to Apache Airflow, a widely adopted open source tool for data pipeline orchestration. You will learn how to install Airflow using Docker and Astro CLI, making the setup process straightforward. The chapter will familiarize you with Airflow’s core features and the most commonly used operators for data engineering tasks. Additionally, you will gain insights into best practices for building resilient and efficient data pipelines that leverage Airflow’s capabilities to the fullest. By the end of this chapter, you will have a solid understanding of how to orchestrate complex data workflows using Airflow, a crucial skill for any data engineer or data architect working with big data on Kubernetes.

    Chapter 7

    , Apache Kafka for Real-Time Events and Data Ingestion, introduces you to Apache Kafka, a distributed event streaming platform that is widely used for building real-time data pipelines and streaming applications. You will understand Kafka’s architecture and how it scales while being resilient, enabling it to handle high volumes of real-time data with low latency. You will learn about Kafka’s distributed topics design, which underpins its robust performance for real-time events. The chapter will guide you through running Kafka locally with Docker and implementing basic reading and writing operations on topics. Additionally, you will explore different strategies for data replication and topic distribution, ensuring you can design and implement efficient and reliable Kafka clusters.

    Chapter 8

    , Deploying the Big Data Stack on Kubernetes, guides you through the process of deploying the big data tools you learned about in the previous chapters on a Kubernetes cluster. You will start by building bash scripts to deploy the Spark operator and run SparkApplications on Kubernetes. Next, you will deploy Apache Airflow to Kubernetes, enabling you to orchestrate data pipelines within the cluster. Additionally, you will deploy Apache Kafka on Kubernetes using both the ephemeral cluster and JBOD techniques. The Kafka Connect cluster will also be deployed, along with connectors to migrate data from SQL databases to persistent object storage. By the end of this chapter, you will have a fully functional big data stack running on Kubernetes, ready for further exploration and development.

    Chapter 9

    , Data Consumption Layer, guides you through the process of securely making data available for business analysts in a big data architecture deployed on Kubernetes. You will start by gaining an overview of working on a modern approach using a data lake engine instead of a data warehouse. In this chapter, you will become familiar with Trino for data consumption directly from a data lake through Kubernetes. You will understand how a data lake engine works, deploy it into Kubernetes, and monitor query execution and history. Additionally, for real-time data, you will get familiar with Elasticsearch and Kibana for data consumption. You will deploy these tools, and learn how to index data in them and how to build a simple data visualization with Kibana.

    Chapter 10

    , Building a Big Data Pipeline in Kubernetes, guides you through the process of deploying and orchestrating two complete data pipelines, one for batch processing and another for real-time processing, on a Kubernetes cluster. You will connect all the tools you’ve learned about throughout the book, such as Apache Spark, Apache Airflow, Apache Kafka, and Trino, to build a single, complex solution. You will deploy these tools on Kubernetes, write code for data processing and orchestration, and make the data available for querying through a SQL engine. By the end of this chapter, you will have hands-on experience in building and managing a comprehensive big data pipeline on Kubernetes, integrating various components and technologies into a cohesive and scalable architecture.

    Chapter 11

    , Generative AI on Kubernetes, guides you through the process of deploying a generative AI application on Kubernetes using Amazon Bedrock as a service suite for foundational models. You will learn how to connect your application to a knowledge base serving as a Retrieval-Augmented Generation (RAG) layer, which enhances the AI model’s capabilities by providing access to external information sources. Additionally, you will discover how to automate task execution by the AI models with agents, enabling seamless integration of generative AI into your workflows. By the end of this chapter, you will have a solid understanding of how to leverage the power of generative AI on Kubernetes, unlocking new possibilities for personalized customer experiences, intelligent assistants, and automated business analytics.

    Chapter 12

    , Where to Go from Here, guides you through the next steps in your journey toward mastering big data and Kubernetes. You will explore crucial concepts and technologies that are essential for building robust and scalable solutions on Kubernetes. This includes monitoring strategies for both Kubernetes and your applications, implementing a service mesh for efficient communication, securing your cluster and applications, enabling automated scalability, embracing GitOps and CI/CD practices for streamlined deployment and management, and Kubernetes cost control. For each topic, you’ll receive an overview and recommendations on the technologies to explore further, empowering you to deepen your knowledge and skills in these areas.

    To get the most out of this book

    Some basics in Python programming knowledge and experience with Spark, Docker, Airflow, Kafka, and Git will help you get the most out of this book.

    All guidance needed for software installation will be provided in each chapter.

    If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

    Download the example code files

    You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Bigdata-on-Kubernetes

    . If there’s an update to the code, it will be updated in the GitHub repository.

    We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/

    . Check them out!

    Conventions used

    There are a number of text conventions used throughout this book.

    Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: This command will pull the hello-world image from the Docker Hub public repository and run the application in it.

    A block of code is set as follows:

    import pandas as pd

    url = '

    https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'

     

    df = pd.read_csv(url, header=None)

    df[newcolumn] = df[5].apply(lambda x: x*2)

    print(df.columns)

    print(df.head())

    print(df.shape)

    Any command-line input or output is written as follows:

    $ sudo apt install docker.io

    This is how the filename above the code snippet will look:

    Cjava.py

    Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: You should ensure that the Use WSL 2 instead of Hyper-V option is selected on the Configuration page.

    Tips or important notes

    Appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, email us at customercare@packtpub.com

    and mention the book title in the subject of your message.

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata

    and fill in the form.

    Piracy: If you come across any illegal copies of our works in any

    Enjoying the preview?
    Page 1 of 1