Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

Jul 2, 201339 likes11,320 views

Amy W. Tang

This talk was given by Joel Koshy (Senior Software Engineer at LinkedIn) at the Hadoop Summit (June 2013).

Building a Real-Time Data Pipeline:
Apache Kafka at Linkedin
Hadoop Summit 2013
Joel Koshy
June 2013
LinkedIn Corporation ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Network update stream

LinkedIn Corporation ©2013 All Rights Reserved
We have a lot of data.
We want to leverage this data to build products.
Data pipeline

HADOOP SUMMIT 2013
System and application metrics/logging
LinkedIn Corporation ©2013 All Rights Reserved 5

How do we integrate this variety of data
and make it available to all these systems?
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Point-to-point pipelines

HADOOP SUMMIT 2013
LinkedIn’s user activity data pipeline (circa 2010)

HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 10

HADOOP SUMMIT 2013
Central data pipeline

First attempt: don’t re-invent the wheel
LinkedIn Confidential ©2013 All Rights Reserved

Second attempt: re-invent the wheel!
LinkedIn Confidential ©2013 All Rights Reserved

Use a central commit log
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
What is a commit log?

HADOOP SUMMIT 2013
The log as a messaging system
LinkedIn Corporation ©2013 All Rights Reserved 17

HADOOP SUMMIT 2013
Apache Kafka
LinkedIn Corporation ©2013 All Rights Reserved 18

HADOOP SUMMIT 2013
Usage at LinkedIn
 16 brokers in each cluster
 28 billion messages/day
 Peak rates
– Writes: 460,000 messages/second
– Reads: 2,300,000 messages/second
 ~ 700 topics
 40-50 live services consuming user-activity data
 Many ad hoc consumers
 Every production service is a producer (for metrics)
 10k connections/colo
LinkedIn Corporation ©2013 All Rights Reserved 19

HADOOP SUMMIT 2013
Usage at LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 20

HADOOP SUMMIT 2013
Standardize on Avro in data pipeline
LinkedIn Corporation ©2013 All Rights Reserved 22
{
"type": "record",
"name": "URIValidationRequestEvent",
"namespace": "com.linkedin.event.usv",
"fields": [
{
"name": "header",
"type": {
"type": "record",
"name": ”TrackingEventHeader",
"namespace": "com.linkedin.event",
"fields": [
{
"name": "memberId",
"type": "int",
"doc": "The member id of the user initiating the action"
},
{
"name": ”timeMs",
"type": "long",
"doc": "The time of the event"
},
{
"name": ”host",
"type": "string",
...
...

HADOOP SUMMIT 2013
Hadoop data load (Camus)
 Open sourced:
– https://github.com/linkedin/camus
 One job loads all events
 ~10 minute ETA on average from producer to HDFS
 Hive registration done automatically
 Schema evolution handled transparently

Does it work?
“All published messages must be delivered to all consumers (quickly)”
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Kafka replication (0.8)
 Intra-cluster replication feature
– Facilitates high availability and durability
 Beta release available
https://dist.apache.org/repos/dist/release/kafka/
 Rolled out in production at LinkedIn last week
LinkedIn Corporation ©2013 All Rights Reserved 28

HADOOP SUMMIT 2013
Join us at our user-group meeting tonight @ LinkedIn!
– Thursday, June 27, 7.30pm to 9.30pm
– 2025 Stierlin Ct., Mountain View, CA
– http://www.meetup.com/http-kafka-apache-org/events/125887332/
– Presentations (replication overview and use-case studies) from:
 RichRelevance
 Netflix
 Square
 LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 29

HADOOP SUMMIT 2013LinkedIn Corporation ©2013 All Rights Reserved 30

The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.

Operating system memory managementrprajat007

Operating System Topic Memory Management for Btech/Bsc (C.S)/BCA... Memory management is the functionality of an operating system which handles or manages primary memory. Memory management keeps track of each and every memory location either it is allocated to some process or it is free. It checks how much memory is to be allocated to processes. It decides which process will get memory at what time. It tracks whenever some memory gets freed or unallocated and correspondingly it updates the status.

Data Lake,beyond the Data WarehouseData Science Thailand

Scalability, Availability & Stability PatternsJonas Bonér

This document provides an overview of patterns for scalability, availability, and stability in distributed systems. It discusses general recommendations like immutability and referential transparency. It covers scalability trade-offs around performance vs scalability, latency vs throughput, and availability vs consistency. It then describes various patterns for scalability including managing state through partitioning, caching, sharding databases, and using distributed caching. It also covers patterns for managing behavior through event-driven architecture, compute grids, load balancing, and parallel computing. Availability patterns like fail-over, replication, and fault tolerance are discussed. The document provides examples of popular technologies that implement many of these patterns.

Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UIAltinity Ltd

Distributed ComputingSudarsun Santhiappan

The document discusses various models of parallel and distributed computing including symmetric multiprocessing (SMP), cluster computing, distributed computing, grid computing, and cloud computing. It provides definitions and examples of each model. It also covers parallel processing techniques like vector processing and pipelined processing, and differences between shared memory and distributed memory MIMD (multiple instruction multiple data) architectures.

Netapp StoragePrime Infoserv

Data is being generated at rates never before encountered. The explosion of data threatens to consume all of our IT resources: People, budget, power, cooling and data center floor space. Are your systems coping with your data now? Will they continue to deliver as the stress on data centers increases and IT budgets dwindle? Imagine if you could be ahead of the data explosion by being proactive about your storage instead of reactive. Now you can be, with NetApp's approach to the designs and deployment of storage systems. With it, you can take advantage of NetApp's latest storage enhancements and take control of your storage. This will allow you to focus on gathering more insights from your data and deliver more value to your business. NetApp's most advanced storage solutions are NetApp Virtualization & scale out. By taking control of your existing storage platform with either solution, you get: • Immortal Storage system • Infinite scalability • Best possible ROI from existing environment

Computer architecture multi processorMazin Alwaaly

KNIME Software OverviewKNIMESlides

Are you curious about KNIME Software? Do you know the difference between KNIME Analytics Platform and KNIME Server? Which data sources can KNIME connect to? Can you run an R script from within a KNIME workflow? A Python script? Which other integrations are available? How can KNIME help with ETL, data preparation, and general data manipulation? Which machine learning algorithms can KNIME offer? This webinar answers all of these questions! There’s also information about connecting to big data clusters and how you can run the whole or part of your analysis on a big data platform. It also covers everything you need to know about Microsoft Azure and Amazon AWS

From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky

Top 5 mistakes when writing Spark applicationshadooparchbook

This document discusses common mistakes made when writing Spark applications and provides recommendations to address them. It covers issues like having executors that are too small or large, shuffle blocks exceeding size limits, data skew slowing jobs, and excessive stages. The key recommendations are to optimize executor and partition sizes, increase partitions to reduce skew, use techniques like salting to address skew, and favor transformations like ReduceByKey over GroupByKey to minimize shuffles and memory usage.

Introduction to Data Stream ProcessingSafe Software

With the expansion of big data and analytics, organizations are looking to incorporate data streaming into their business processes to make real-time decisions. Join this webinar as we guide you through the buzz around data streams: - Market trends in stream processing - What is stream processing - How does stream processing compare to traditional batch processing - High and low volume streams - The possibilities of working with data streaming and the benefits it provides to organizations - The importance of spatial data in streams

How Apache Drives Music Recommendations At SpotifyJosh Baer

Introduction to CassandraGokhan Atil

This document introduces Apache Cassandra, a distributed column-oriented NoSQL database. It discusses Cassandra's architecture, data model, query language (CQL), and how to install and run Cassandra. Key points covered include Cassandra's linear scalability, high availability and fault tolerance. The document also demonstrates how to use the nodetool utility and provides guidance on backing up and restoring Cassandra data.

Os ThreadsSalman Memon

This document discusses threads and threading models. It defines a thread as the basic unit of CPU utilization consisting of a program counter, stack, and registers. Threads allow for simultaneous execution of tasks within the same process by switching between threads rapidly. There are three main threading models: many-to-one maps many user threads to one kernel thread; one-to-one maps each user thread to its own kernel thread; many-to-many maps user threads to kernel threads in a variable manner. Popular thread libraries include POSIX pthreads and Win32 threads.

NOSQL- Presentation on NoSQLRamakant Soni

HADOOP TECHNOLOGY pptsravya raju

5 Data Modeling for NoSQL 1/2Fabio Fumarola

The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data. Course Website http://pbdmng.datatoknowledge.it/ Contact me for other informations and to download the slides

OPERATING SYSTEM SERVICES, OPERATING SYSTEM STRUCTURESpriyasoundar

Hadoop YARNVigen Sahakyan

This document provides an overview of YARN (Yet Another Resource Negotiator), the resource management system for Hadoop. It describes the key components of YARN including the Resource Manager, Node Manager, and Application Master. The Resource Manager tracks cluster resources and schedules applications, while Node Managers monitor nodes and containers. Application Masters communicate with the Resource Manager to manage applications. YARN allows Hadoop to run multiple applications like Spark and HBase, improves on MapReduce scheduling, and transforms Hadoop into a distributed operating system for big data processing.

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn

This presentation about Hadoop for beginners will help you understand what is Hadoop, why Hadoop, what is Hadoop HDFS, Hadoop MapReduce, Hadoop YARN, a use case of Hadoop and finally a demo on HDFS (Hadoop Distributed File System), MapReduce and YARN. Big Data is a massive amount of data which cannot be stored, processed, and analyzed using traditional systems. To overcome this problem, we use Hadoop. Hadoop is a framework which stores and handles Big Data in a distributed and parallel fashion. Hadoop overcomes the challenges of Big Data. Hadoop has three components HDFS, MapReduce, and YARN. HDFS is the storage unit of Hadoop, MapReduce is its processing unit, and YARN is the resource management unit of Hadoop. In this video, we will look into these units individually and also see a demo on each of these units. Below topics are explained in this Hadoop presentation: 1. What is Hadoop 2. Why Hadoop 3. Big Data generation 4. Hadoop HDFS 5. Hadoop MapReduce 6. Hadoop YARN 7. Use of Hadoop 8. Demo on HDFS, MapReduce and YARN What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? This course will enable you to: 1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark 2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management 3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts 4. Get an overview of Sqoop and Flume and describe how to ingest data using them 5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning 6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution 7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations 8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS 9. Gain a working knowledge of Pig and its components 10. Do functional programming in Spark 11. Understand resilient distribution datasets (RDD) in detail 12. Implement and build Spark applications 13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques 14. Understand the common use-cases of Spark and the various interactive algorithms 15. Learn Spark SQL, creating, transforming, and querying Data frames Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training

Cloudera Hadoop DistributionThisara Pramuditha

CDH is a popular distribution of Apache Hadoop and related projects that delivers scalable storage and distributed computing through Apache-licensed open source software. It addresses challenges in storing and analyzing large datasets known as Big Data. Hadoop is a framework for distributed processing of large datasets across computer clusters using simple programming models. Its core components are HDFS for storage, MapReduce for processing, and YARN for resource management. The Hadoop ecosystem also includes tools like Kafka, Sqoop, Hive, Pig, Impala, HBase, Spark, Mahout, Solr, Kudu, and Sentry that provide functionality like messaging, data transfer, querying, machine learning, search, and authorization.

Introduction To HBaseAnil Gupta

HBase is an open-source, distributed, versioned, key-value database modeled after Google's Bigtable. It is designed to store large volumes of sparse data across commodity hardware. HBase uses Hadoop for storage and provides real-time read and write capabilities. It scales horizontally and is highly fault tolerant through its master-slave architecture and use of Zookeeper for coordination. Data in HBase is stored in tables and indexed by row keys for fast lookup, with columns grouped into families and versions stored by timestamps.

GFS & HDFS IntroductionHariharan Ganesan

Google uses the Google File System (GFS) to organize and manipulate huge files across its distributed computing system. The GFS breaks files into 64MB chunks that are each stored in 3 copies on different computers. A master server coordinates the system and tracks metadata while chunkservers store and serve the file chunks. The GFS architecture is made up of clients, a master server, and chunkservers and uses chunk handles and replication to improve reliability, availability, and performance at massive scales.

Architecture of a Kafka camus infrastructuremattlieber

This document summarizes the results of a performance evaluation of Kafka and Camus to ingest streaming data into Hadoop. It finds that Kafka can ingest data at rates from 15,000-50,000 messages per second depending on data format (Avro is fastest). Camus can move the data to HDFS at rates from 54,000-662,000 records per second. Once in HDFS, queries on Avro-formatted data are fastest, with count and max aggregation queries completing in under 100 seconds for 20 million records. The customer's goal of 5000 events per second can be easily achieved with this architecture.

Data Infrastructure at LinkedInAmy W. Tang

More Related Content

What's hot (20)

Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UIAltinity Ltd

Distributed ComputingSudarsun Santhiappan

Netapp StoragePrime Infoserv

Computer architecture multi processorMazin Alwaaly

KNIME Software OverviewKNIMESlides

From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky

Top 5 mistakes when writing Spark applicationshadooparchbook

Introduction to Data Stream ProcessingSafe Software

How Apache Drives Music Recommendations At SpotifyJosh Baer

Introduction to CassandraGokhan Atil

Os ThreadsSalman Memon

NOSQL- Presentation on NoSQLRamakant Soni

HADOOP TECHNOLOGY pptsravya raju

5 Data Modeling for NoSQL 1/2Fabio Fumarola

OPERATING SYSTEM SERVICES, OPERATING SYSTEM STRUCTURESpriyasoundar

Hadoop YARNVigen Sahakyan

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn

Cloudera Hadoop DistributionThisara Pramuditha

Introduction To HBaseAnil Gupta

GFS & HDFS IntroductionHariharan Ganesan

Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UIAltinity Ltd

Distributed ComputingSudarsun Santhiappan

Netapp StoragePrime Infoserv

Computer architecture multi processorMazin Alwaaly

KNIME Software OverviewKNIMESlides

From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky

Top 5 mistakes when writing Spark applicationshadooparchbook

Introduction to Data Stream ProcessingSafe Software

How Apache Drives Music Recommendations At SpotifyJosh Baer

Introduction to CassandraGokhan Atil

Os ThreadsSalman Memon

NOSQL- Presentation on NoSQLRamakant Soni

HADOOP TECHNOLOGY pptsravya raju

5 Data Modeling for NoSQL 1/2Fabio Fumarola

OPERATING SYSTEM SERVICES, OPERATING SYSTEM STRUCTURESpriyasoundar

Hadoop YARNVigen Sahakyan

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn

Cloudera Hadoop DistributionThisara Pramuditha

Introduction To HBaseAnil Gupta

GFS & HDFS IntroductionHariharan Ganesan

Viewers also liked (20)

Architecture of a Kafka camus infrastructuremattlieber

Data Infrastructure at LinkedInAmy W. Tang

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

This document summarizes Netflix's use of Kafka in their data pipeline. It discusses how Netflix evolved from using S3 and EMR to introducing Kafka and Kafka producers and consumers to handle 400 billion events per day. It covers challenges of scaling Kafka clusters and tuning Kafka clients and brokers. Finally, it outlines Netflix's roadmap which includes contributing to open source projects like Kafka and testing failure resilience.

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network

LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationAmy W. Tang

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang

Data Infrastructure at LinkedIn Amy W. Tang

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang

Introduction to Apache KafkaJeff Holoman

The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.

LinkedIn Communication ArchitectureLinkedIn

The document discusses LinkedIn's communication architecture and network updates service. It describes how LinkedIn built scalable communication platforms to support its large professional network. The system evolved from handling 0 to 22 million members. It uses Java, databases like Oracle and MySQL, application servers like Tomcat and Jetty, and technologies like ActiveMQ, Lucene and Spring. The communication service handles messages and email delivery while the network updates service distributes short-lived notifications across LinkedIn's various clients and services.

Introduction to DatabusAmy W. Tang

This document describes Databus, a system used at LinkedIn for distributed data replication and change data capture. Some key points: - Databus provides timeline consistency across distributed data systems by applying a logical clock to data changes and using a pull-based model for replication. - It addresses the challenges of specialization in distributed data systems through standardization, isolation of consumers from sources, and handling slow consumers without impacting fast ones. - The architecture includes fetchers that extract changes from databases, a relay for buffering changes, log and snapshot stores, and client libraries that allow applications to consume changes. - Performance is optimized through partitioning, filtering, and scaling of consumers independently of sources. Databus

Building Distributed Systems Using HelixAmy W. Tang

Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

A data pipeline is a unified system for capturing events for analysis and building products. It involves capturing user events from various sources, storing them in a centralized data warehouse, and performing analysis and building products using tools like Hadoop. Key components of a data pipeline include an event framework, message bus, data serialization, data persistence, workflow management, and batch processing. A Lambda architecture allows for both batch and real-time processing of data captured by the pipeline.

What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella

Rakuten LeoFs - distributed file systemRakuten Group, Inc.

Introduction to apache kafkaSamuel Kerrien

Apache KafkaMaher TEBOURBI

Apache Kafka is a distributed streaming platform that allows building event-driven architectures. It provides high throughput and low latency for processing streaming data. Key features include event logging, publish-subscribe messaging, and stream processing capabilities. Some advantages are eventual consistency, scalability, fault tolerance and being more agile to maintain compared to traditional databases. It requires Zookeeper and the Java client API has undergone changes. Performance can be very high, with examples of LinkedIn processing 1.1 trillion messages per day and 2 million writes per second on modest hardware.

Realtime streaming architecture in INFINARIOJozo Kovac

IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...In-Memory Computing Summit

Non-interactive big-data analysis prohibits experimentation and can interrupt the analyst’s train of thoughts but analyzing and drawing insights in real time is no easy task with jobs often taking minutes/hours to complete. What if you want to put a interactive interface in front of that data that allows iterative insights? What if you need that interactive experience to be sub second? Traditional SQL and most MPP/NoSQL databases cannot run complex calculations over large data in a performant manner. Popular distributed systems such as Hadoop or Spark can execute jobs but their job overhead prohibits sub second response times. Learn how an in-memory computing framework enabled us to perform complex analysis jobs on massive data points with sub second response times — allowing us to plug it into a simple, drag-and-drop web 2.0 interface.

Intro to SnappyData WebinarSnappyData

The slides for the first ever SnappyData webinar. Covers SnappyData core concepts, programming models, benchmarks and more. SnappyData is open sourced here: https://github.com/SnappyDataInc/snappydata We also have a deep technical paper here: http://www.snappydata.io/snappy-industrial We can be easily contacted on Slack, Gitter and more: http://www.snappydata.io/about#contactus

Architecture of a Kafka camus infrastructuremattlieber

Data Infrastructure at LinkedInAmy W. Tang

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network

LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationAmy W. Tang

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang

Data Infrastructure at LinkedIn Amy W. Tang

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang

Introduction to Apache KafkaJeff Holoman

LinkedIn Communication ArchitectureLinkedIn

Introduction to DatabusAmy W. Tang

Building Distributed Systems Using HelixAmy W. Tang

Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella

Rakuten LeoFs - distributed file systemRakuten Group, Inc.

Introduction to apache kafkaSamuel Kerrien

Apache KafkaMaher TEBOURBI

Realtime streaming architecture in INFINARIOJozo Kovac

IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...In-Memory Computing Summit

Intro to SnappyData WebinarSnappyData

Similar to Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn (20)

Apache Kafka at LinkedInGuozhang Wang

The document discusses Apache Kafka, a distributed publish-subscribe messaging system developed at LinkedIn. It describes how LinkedIn uses Kafka to integrate large amounts of user activity and other data across its products. Key aspects of Kafka's design allow it to scale to LinkedIn's high throughput requirements, including using a log structure and data partitioning for parallelism. LinkedIn relies on Kafka to transport over 500 billion messages per day between systems and for real-time analytics.

How Linkedin uses Automic for Big Data ProcessesCA | Automic Software

This document provides an overview of LinkedIn's use of big data. It discusses how data is important for LinkedIn's products and services. It describes LinkedIn's big data ecosystem, including tools used for data ingestion (Camus, Gobblin) and scheduling workflows (Azkaban). It provides details on the types and volumes of data handled, including over 900 Kafka topics ingesting 10TB of data daily, 300+ online database tables in Hadoop totaling 8TB, and a 186TB Teradata data warehouse. Automic tools help schedule external, Hadoop, and Teradata ETL jobs.

The "Big Data" Ecosystem at LinkedInSam Shah

[This work was presented at SIGMOD'13.] The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn's Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the "last mile" issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.

The "Big Data" Ecosystem at LinkedInSam Shah

[This is work presented at SIGMOD'13.] The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn's Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the "last mile" issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.

The “Big Data” Ecosystem at LinkedInKun Le

This document describes LinkedIn's "Big Data" ecosystem for machine learning and data mining. It discusses how LinkedIn uses Hadoop and related tools to extract insights from massive amounts of data and build predictive analytics applications. It outlines LinkedIn's solutions for easing the process of deploying machine learning models into production by providing seamless ingress of data into Hadoop and egress of results to various systems, abstracting away distributed systems concerns for researchers.

Building a Self-Service Hadoop Platform at Linkedin with AzkabanDataWorks Summit

Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...David Chen

Hadoop comprises the core of LinkedIn’s data analytics infrastructure and runs a vast array of our data products, including People You May Know, Endorsements, and Recommendations. To schedule and run the Hadoop workflows that drive our data products, we rely on Azkaban, an open-source workflow manager developed and used at LinkedIn since 2009. Azkaban is designed to be scalable, reliable, and extensible, and features a beautiful and intuitive UI. Over the years, we have seen tremendous growth, both in the scale of our data and our Hadoop user base, which includes over a thousand developers, data scientists, and analysts. We evolved Azkaban to not only meet the demands of this scale, but also support query platforms including Pig and Hive and continue to be an easy to use, self-service platform. In this talk, we discuss how Azkaban’s monitoring and visualization features allow our users to quickly and easily develop, profile, and tune their Hadoop workflows.

DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroGaurav "GP" Pal

DevOps helps accelerate the delivery of software applications through automation and by removing Development & Operations silos. The Netflix Platform Engineering team has developed a robust data pipeline solution called SURO that has been open sourced. Come learn from the experiences of pioneers like Netflix how they are leveraging the data pipeline for new and innovative use cases. This is the presentation by Danny Yuan, Netflix Platform Engineering Team on operational and monitoring aspects of applications on cloud platforms.

The Big Data Analytics Ecosystem at LinkedInrajappaiyer

LinkedIn has several data driven products that improve the experience of its users -- whether they are professionals or enterprises. Supporting this is a large ecosystem of systems and processes that provide data and insights in a timely manner to the products that are driven by it. This talk provides an overview of the various components of this ecosystem which are: - Hadoop - Teradata - Kafka - Databus - Camus - Lumos etc.

CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe

This document discusses LinkedIn's use of Kafka, Hadoop, Storm, and Couchbase in their big data pipeline. It provides an overview of each technology and how LinkedIn uses them together. Specifically, it describes how LinkedIn uses Kafka to stream data to Hadoop for analytics and report generation. It also discusses how LinkedIn uses Hadoop to pre-build and warm Couchbase buckets for improved performance. The presentation includes a use case of streaming member profile and activity data through Kafka to both Hadoop and Couchbase clusters.

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari

LinkedIn has a large professional network with 360M members. They build data-driven products using members' rich profile data. To do this, they ingest online data into offline systems using Apache Kafka. The data is then processed using Hadoop, Spark, Samza and Cubert to compute features and train models. Results are moved back online using Voldemort and Kafka. For example, People You May Know recommendations are generated by triangle closing in Hadoop and Cubert to count common connections faster. Site speed is monitored in real-time using Samza to join logs from different services.

Software Development & Architecture @ LinkedInC4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1DAsghW. Sid Anand discusses the architectural and development practices adopted by LinkedIn as a continuous growing company. Filmed at qconsf.com. Siddharth “Sid" Anand is a hands-on software architect with deep experience building and scaling web sites that millions of people visit every day. Sid is currently the Chief Architect at ClipMine, a video mining and search start-up that improves the consumption of long-form video content through automated video analysis and search technology.

Data Process Systems, connecting everythingDataWorks Summit/Hadoop Summit

This document summarizes Patrick de Vries' presentation on connecting everything at the Hadoop Summit 2016. The presentation discusses KPN's use of Hadoop to manage increasing data and network capacity needs. It outlines KPN's data flow process from source systems to Hadoop for processing and generating reports. The presentation also covers lessons learned in implementing Hadoop including having strong executive support, addressing cultural challenges around data ownership, and leveraging existing investments. Finally, it promotes joining a new TELCO Hadoop community for telecommunications providers to share use cases and lessons.

Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon

From the StampedeCon 2015 Big Data Conference: There is an adage, “If you fail to plan, you plan to fail” . When developing systems the adage can be taken a step further, “If you fail to plan FOR FAILURE, you plan to fail”. At Huffington post data moves between a number of systems to provide statistics for our technical, business, and editorial teams. Due to the mission-critical nature of our data, considerable effort is spent building resiliency into processes. This talk will focus on designing for failure. Some material will focus understanding the traits of specific distributed systems such as message queues or NoSQL databases and what are the consequences for different types of failures. While other parts of the presentation will focus on how systems and software can be designed to make re-processing batch data simple, or how to determine what failure mode semantics are important for a real time event processing system.

Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent

LinkedIn uses Apache Kafka extensively to power various data pipelines and platforms. Some key uses of Kafka include: 1) Moving data between systems for monitoring, metrics, search indexing, and more. 2) Powering the Pinot real-time analytics query engine which handles billions of documents and queries per day. 3) Enabling replication and partitioning for the Espresso NoSQL data store using a Kafka-based approach. 4) Streaming data processing using Samza to handle workflows like user profile evaluation. Samza is used for both stateless and stateful stream processing at LinkedIn.

An introduction to Apache Kafka and Kafka ecosystem at LinkedInDong Lin

Hadoop Big Data A big pictureJ S Jodha

The document provides statistics on the amount of data generated and shared on various digital platforms each day: over 1 terabyte of data from NYSE, 144.8 billion emails sent, 340 million tweets, 684,000 pieces of content shared on Facebook, 72 hours of new video uploaded to YouTube per minute, and more. It outlines the massive scale of data creation and sharing occurring across social media, financial, and other digital platforms.

Software Developer and Architecture @ LinkedIn (QCon SF 2014)Sid Anand

The document provides details about Sid Anand's career and then discusses LinkedIn's software development process and architecture when he was there. It notes that when Sid started at LinkedIn in 2011, compiling the code took a long time due to the large codebase and many dependencies. It then describes how LinkedIn scaled to support hundreds of millions of members and thousands of employees by splitting the monolithic codebase into individual Git repos, using intermediate JARs to reduce dependencies, and connecting development machines to test environments instead of deploying everything locally. It also discusses LinkedIn's use of Kafka, search federation, and not making web service calls between data centers to scale across multiple data centers.

Developing Real-Time Data Pipelines with Apache KafkaJoe Stein

Developing Real-Time Data Pipelines with Apache Kafka http://kafka.apache.org/ is an introduction for developers about why and how to use Apache Kafka. Apache Kafka is a publish-subscribe messaging system rethought of as a distributed commit log. Kafka is designed to allow a single cluster to serve as the central data backbone. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of coordinated consumers. Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages. For the Spring user, Spring Integration Kafka and Spring XD provide integration with Apache Kafka.

Real time monitoring of hadoop and spark workflowsShankar Manian

This document summarizes LinkedIn's efforts to implement real-time monitoring of their big data workflows. Key points include: - They collect metrics from Hadoop and Spark components using plugins and ship them to Kafka for storage and processing. - A metrics pipeline calculates higher-level business metrics and stores them in Pinot for fast retrieval and Hive/Presto for long-term storage. - Visualization, alerting, and analysis tools like Raptor and Third-Eye are used to analyze the metrics. - Lessons learned include the importance of data quality, instrumentation, and integration with existing infrastructure.

Apache Kafka at LinkedInGuozhang Wang

How Linkedin uses Automic for Big Data ProcessesCA | Automic Software

The "Big Data" Ecosystem at LinkedInSam Shah

The “Big Data” Ecosystem at LinkedInKun Le

Building a Self-Service Hadoop Platform at Linkedin with AzkabanDataWorks Summit

Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...David Chen

DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroGaurav "GP" Pal

The Big Data Analytics Ecosystem at LinkedInrajappaiyer

CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari

Software Development & Architecture @ LinkedInC4Media

Data Process Systems, connecting everythingDataWorks Summit/Hadoop Summit

Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon

Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent

An introduction to Apache Kafka and Kafka ecosystem at LinkedInDong Lin

Hadoop Big Data A big pictureJ S Jodha

Software Developer and Architecture @ LinkedIn (QCon SF 2014)Sid Anand

Developing Real-Time Data Pipelines with Apache KafkaJoe Stein

Real time monitoring of hadoop and spark workflowsShankar Manian

More from Amy W. Tang (6)

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang

LinkedIn Graph PresentationAmy W. Tang

Data Infrastructure at LinkedInAmy W. Tang

This document provides an overview of LinkedIn's data infrastructure. It discusses LinkedIn's large user base and data needs for products like profiles, communications, and recommendations. It describes LinkedIn's data ecosystem with three paradigms for online, nearline and offline data. It then summarizes key parts of LinkedIn's data infrastructure, including Databus for change data capture, Voldemort for distributed key-value storage, Kafka for messaging, and Espresso for distributed data storage. Overall, the document outlines how LinkedIn builds scalable data solutions to power its products and services for its large user base.

Voldemort on Solid State DrivesAmy W. Tang

Untangling Cluster Management with HelixAmy W. Tang

All Aboard the DatabusAmy W. Tang

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang

LinkedIn Graph PresentationAmy W. Tang

Data Infrastructure at LinkedInAmy W. Tang

Voldemort on Solid State DrivesAmy W. Tang

Untangling Cluster Management with HelixAmy W. Tang

All Aboard the DatabusAmy W. Tang

Recently uploaded (20)

DevOps in the Modern Era - Thoughtfully Critical PodcastChris Wahl

Dancing with AI - A Developer's Journey.pptxElliott Richmond

In this talk, Elliott explores how developers can embrace AI not as a threat, but as a collaborative partner. We’ll examine the shift from routine coding to creative leadership, highlighting the new developer superpowers of vision, integration, and innovation. We'll touch on security, legacy code, and the future of democratized development. Whether you're AI-curious or already a prompt engineering, this session will help you find your rhythm in the new dance of modern development.

Domino IQ – What to Expect, First Steps and Use Casespanagenda

Webinar Recording: https://www.panagenda.com/webinars/domino-iq-what-to-expect-first-steps-and-use-cases/ HCL Domino iQ Server – From Ideas Portal to implemented Feature. Discover what it is, what it isn’t, and explore the opportunities and challenges it presents. Key Takeaways - What are Large Language Models (LLMs) and how do they relate to Domino iQ - Essential prerequisites for deploying Domino iQ Server - Step-by-step instructions on setting up your Domino iQ Server - Share and discuss thoughts and ideas to maximize the potential of Domino iQ

Oracle Cloud Infrastructure Generative AI ProfessionalVICTOR MAESTRE RAMIREZ

Your startup on AWS - How to architect and maintain a Lean and Mean account J...angelo60207

Prevent infrastructure costs from becoming a significant line item on your startup’s budget! Serial entrepreneur and software architect Angelo Mandato will share his experience with AWS Activate (startup credits from AWS) and knowledge on how to architect a lean and mean AWS account ideal for budget minded and bootstrapped startups. In this session you will learn how to manage a production ready AWS account capable of scaling as your startup grows for less than $100/month before credits. We will discuss AWS Budgets, Cost Explorer, architect priorities, and the importance of having flexible, optimized Infrastructure as Code. We will wrap everything up discussing opportunities where to save with AWS services such as S3, EC2, Load Balancers, Lambda Functions, RDS, and many others.

ELNL2025 - Unlocking the Power of Sensitivity Labels - A Comprehensive Guide....Jasper Oosterveld

Sensitivity labels, powered by Microsoft Purview Information Protection, serve as the foundation for classifying and protecting your sensitive data within Microsoft 365. Their importance extends beyond classification and play a crucial role in enforcing governance policies across your Microsoft 365 environment. Join me, a Data Security Consultant and Microsoft MVP, as I share practical tips and tricks to get the full potential of sensitivity labels. I discuss sensitive information types, automatic labeling, and seamless integration with Data Loss Prevention, Teams Premium, and Microsoft 365 Copilot.

7 Salesforce Data Cloud Best Practices.pdfMinuscule Technologies

Create Your First AI Agent with UiPath Agent BuilderDianaGray10

Join us for an exciting virtual event where you'll learn how to create your first AI Agent using UiPath Agent Builder. This session will cover everything you need to know about what an agent is and how easy it is to create one using the powerful AI-driven UiPath platform. You'll also discover the steps to successfully publish your AI agent. This is a wonderful opportunity for beginners and enthusiasts to gain hands-on insights and kickstart their journey in AI-powered automation.

Establish Visibility and Manage Risk in the Supply Chain with Anchore SBOMAnchore

Over 70% of any given software application consumes open source software (most likely not even from the original source) and only 15% of organizations feel confident in their risk management practices. With the newly announced Anchore SBOM feature, teams can start safely consuming OSS while mitigating security and compliance risks. Learn how to import SBOMs in industry-standard formats (SPDX, CycloneDX, Syft), validate their integrity, and proactively address vulnerabilities within your software ecosystem.

Mark Zuckerberg teams up with frenemy Palmer Luckey to shape the future of XR...Scott M. Graffius

Mark Zuckerberg teams up with frenemy Palmer Luckey to shape the future of XR/VR/AR wearables 🥽 Drawing on his background in AI, Agile, hardware, software, gaming, and defense, Scott M. Graffius explores the collaboration in “Meta and Anduril’s EagleEye and the Future of XR: How Gaming, AI, and Agile are Transforming Defense.” It’s a powerful case of cross-industry innovation—where gaming meets battlefield tech. 📖 Read the article: https://www.scottgraffius.com/blog/files/meta-and-anduril-eagleeye-and-the-future-of-xr-how-gaming-ai-and-agile-are-transforming-defense.html #Agile #AI #AR #ArtificialIntelligence #AugmentedReality #Defense #DefenseTech #EagleEye #EmergingTech #ExtendedReality #ExtremeReality #FutureOfTech #GameDev #GameTech #Gaming #GovTech #Hardware #Innovation #Meta #MilitaryInnovation #MixedReality #NationalSecurity #TacticalTech #Tech #TechConvergence #TechInnovation #VirtualReality #XR

Oracle Cloud Infrastructure AI FoundationsVICTOR MAESTRE RAMIREZ

“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...Edge AI and Vision Alliance

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2025/06/state-space-models-vs-transformers-for-ultra-low-power-edge-ai-a-presentation-from-brainchip/ Tony Lewis, Chief Technology Officer at BrainChip, presents the “State-space Models vs. Transformers for Ultra-low-power Edge AI” tutorial at the May 2025 Embedded Vision Summit. At the embedded edge, choices of language model architectures have profound implications on the ability to meet demanding performance, latency and energy efficiency requirements. In this presentation, Lewis contrasts state-space models (SSMs) with transformers for use in this constrained regime. While transformers rely on a read-write key-value cache, SSMs can be constructed as read-only architectures, enabling the use of novel memory types and reducing power consumption. Furthermore, SSMs require significantly fewer multiply-accumulate units—drastically reducing compute energy and chip area. New techniques enable distillation-based migration from transformer models such as Llama to SSMs without major performance loss. In latency-sensitive applications, techniques such as precomputing input sequences allow SSMs to achieve sub-100 ms time-to-first-token, enabling real-time interactivity. Lewis presents a detailed side-by-side comparison of these architectures, outlining their trade-offs and opportunities at the extreme edge.

Top 25 AI Coding Agents for Vibe Coders to Use in 2025.pdfSOFTTECHHUB

Palo Alto Networks Cybersecurity FoundationVICTOR MAESTRE RAMIREZ

LSNIF: Locally-Subdivided Neural Intersection FunctionTakahiro Harada

Neural representations have shown the potential to accelerate ray casting in a conventional ray-tracing-based rendering pipeline. We introduce a novel approach called Locally-Subdivided Neural Intersection Function (LSNIF) that replaces bottom-level BVHs used as traditional geometric representations with a neural network. Our method introduces a sparse hash grid encoding scheme incorporating geometry voxelization, a scene-agnostic training data collection, and a tailored loss function. It enables the network to output not only visibility but also hit-point information and material indices. LSNIF can be trained offline for a single object, allowing us to use LSNIF as a replacement for its corresponding BVH. With these designs, the network can handle hit-point queries from any arbitrary viewpoint, supporting all types of rays in the rendering pipeline. We demonstrate that LSNIF can render a variety of scenes, including real-world scenes designed for other path tracers, while achieving a memory footprint reduction of up to 106.2x compared to a compressed BVH. https://arxiv.org/abs/2504.21627

What is Oracle EPM A Guide to Oracle EPM Cloud Everything You Need to KnowSMACT Works

In today's fast-paced business landscape, financial planning and performance management demand powerful tools that deliver accurate insights. Oracle EPM (Enterprise Performance Management) stands as a leading solution for organizations seeking to transform their financial processes. This comprehensive guide explores what Oracle EPM is, its key benefits, and how partnering with the right Oracle EPM consulting team can maximize your investment.

Boosting MySQL with Vector Search -THE VECTOR SEARCH CONFERENCE 2025 .pdfAlkin Tezuysal

As the demand for vector databases and Generative AI continues to rise, integrating vector storage and search capabilities into traditional databases has become increasingly important. This session introduces the *MyVector Plugin*, a project that brings native vector storage and similarity search to MySQL. Unlike PostgreSQL, which offers interfaces for adding new data types and index methods, MySQL lacks such extensibility. However, by utilizing MySQL's server component plugin and UDF, the *MyVector Plugin* successfully adds a fully functional vector search feature within the existing MySQL + InnoDB infrastructure, eliminating the need for a separate vector database. The session explains the technical aspects of integrating vector support into MySQL, the challenges posed by its architecture, and real-world use cases that showcase the advantages of combining vector search with MySQL's robust features. Attendees will leave with practical insights on how to add vector search capabilities to their MySQL systems.

Azure vs AWS Which Cloud Platform Is Best for Your Business in 2025Infrassist Technologies Pvt. Ltd.

6th Power Grid Model Meetup - 21 May 2025DanBrown980551

6th Power Grid Model Meetup Join the Power Grid Model community for an exciting day of sharing experiences, learning from each other, planning, and collaborating. This hybrid in-person/online event will include a full day agenda, with the opportunity to socialize afterwards for in-person attendees. If you have a hackathon proposal, tell us when you register! About Power Grid Model The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services. Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability. Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.

Scaling GenAI Inference From Prototype to Production: Real-World Lessons in S...Anish Kumar

Presented by: Anish Kumar LinkedIn: https://www.linkedin.com/in/anishkumar/ This lightning talk dives into real-world GenAI projects that scaled from prototype to production using Databricks’ fully managed tools. Facing cost and time constraints, we leveraged four key Databricks features—Workflows, Model Serving, Serverless Compute, and Notebooks—to build an AI inference pipeline processing millions of documents (text and audiobooks). This approach enables rapid experimentation, easy tuning of GenAI prompts and compute settings, seamless data iteration and efficient quality testing—allowing Data Scientists and Engineers to collaborate effectively. Learn how to design modular, parameterized notebooks that run concurrently, manage dependencies and accelerate AI-driven insights. Whether you're optimizing AI inference, automating complex data workflows or architecting next-gen serverless AI systems, this session delivers actionable strategies to maximize performance while keeping costs low.

DevOps in the Modern Era - Thoughtfully Critical PodcastChris Wahl

Dancing with AI - A Developer's Journey.pptxElliott Richmond

Domino IQ – What to Expect, First Steps and Use Casespanagenda

Oracle Cloud Infrastructure Generative AI ProfessionalVICTOR MAESTRE RAMIREZ

Your startup on AWS - How to architect and maintain a Lean and Mean account J...angelo60207

ELNL2025 - Unlocking the Power of Sensitivity Labels - A Comprehensive Guide....Jasper Oosterveld

7 Salesforce Data Cloud Best Practices.pdfMinuscule Technologies

Create Your First AI Agent with UiPath Agent BuilderDianaGray10

Establish Visibility and Manage Risk in the Supply Chain with Anchore SBOMAnchore

Mark Zuckerberg teams up with frenemy Palmer Luckey to shape the future of XR...Scott M. Graffius

Oracle Cloud Infrastructure AI FoundationsVICTOR MAESTRE RAMIREZ

“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...Edge AI and Vision Alliance

Top 25 AI Coding Agents for Vibe Coders to Use in 2025.pdfSOFTTECHHUB

Palo Alto Networks Cybersecurity FoundationVICTOR MAESTRE RAMIREZ

LSNIF: Locally-Subdivided Neural Intersection FunctionTakahiro Harada

What is Oracle EPM A Guide to Oracle EPM Cloud Everything You Need to KnowSMACT Works

Boosting MySQL with Vector Search -THE VECTOR SEARCH CONFERENCE 2025 .pdfAlkin Tezuysal

Azure vs AWS Which Cloud Platform Is Best for Your Business in 2025Infrassist Technologies Pvt. Ltd.

6th Power Grid Model Meetup - 21 May 2025DanBrown980551

Scaling GenAI Inference From Prototype to Production: Real-World Lessons in S...Anish Kumar

Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

2. HADOOP SUMMIT 2013 Network update stream

4. HADOOP SUMMIT 2013 People you may know

7. HADOOP SUMMIT 2013 Point-to-point pipelines

8. HADOOP SUMMIT 2013 LinkedIn’s user activity data pipeline (circa 2010)

9. HADOOP SUMMIT 2013 Point-to-point pipelines

11. HADOOP SUMMIT 2013 Central data pipeline

13. HADOOP SUMMIT 2013

16. HADOOP SUMMIT 2013 What is a commit log?

19. HADOOP SUMMIT 2013 Usage at LinkedIn  16 brokers in each cluster  28 billion messages/day  Peak rates – Writes: 460,000 messages/second – Reads: 2,300,000 messages/second  ~ 700 topics  40-50 live services consuming user-activity data  Many ad hoc consumers  Every production service is a producer (for metrics)  10k connections/colo LinkedIn Corporation ©2013 All Rights Reserved 19

22. HADOOP SUMMIT 2013 Standardize on Avro in data pipeline LinkedIn Corporation ©2013 All Rights Reserved 22 { "type": "record", "name": "URIValidationRequestEvent", "namespace": "com.linkedin.event.usv", "fields": [ { "name": "header", "type": { "type": "record", "name": ”TrackingEventHeader", "namespace": "com.linkedin.event", "fields": [ { "name": "memberId", "type": "int", "doc": "The member id of the user initiating the action" }, { "name": ”timeMs", "type": "long", "doc": "The time of the event" }, { "name": ”host", "type": "string", ... ...

24. HADOOP SUMMIT 2013 Hadoop data load (Camus)  Open sourced: – https://github.com/linkedin/camus  One job loads all events  ~10 minute ETA on average from producer to HDFS  Hive registration done automatically  Schema evolution handled transparently

27. HADOOP SUMMIT 2013 Audit Trail

28. HADOOP SUMMIT 2013 Kafka replication (0.8)  Intra-cluster replication feature – Facilitates high availability and durability  Beta release available https://dist.apache.org/repos/dist/release/kafka/  Rolled out in production at LinkedIn last week LinkedIn Corporation ©2013 All Rights Reserved 28

29. HADOOP SUMMIT 2013 Join us at our user-group meeting tonight @ LinkedIn! – Thursday, June 27, 7.30pm to 9.30pm – 2025 Stierlin Ct., Mountain View, CA – http://www.meetup.com/http-kafka-apache-org/events/125887332/ – Presentations (replication overview and use-case studies) from:  RichRelevance  Netflix  Square  LinkedIn LinkedIn Corporation ©2013 All Rights Reserved 29

Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn (20)

More from Amy W. Tang (6)

Recently uploaded (20)

Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn