Blog, Bytes and Kafka - Data Engineering Digest

Fault Tolerance in Distributed Systems: Tracing with Apache Kafka and Jaeger

Confluent

JULY 24, 2019

Using Jaeger tracing, I’ve been able to answer an important question that nearly every Apache Kafka ® project that I’ve worked on posed: how is data flowing through my distributed system? Distributed tracing with Apache Kafka and Jaeger. Example of a Kafka project with Jaeger tracing. What does this all mean?

Kafka

Kafka Systems Bytes Project

Apache Kafka Deployments and Systems Reliability – Part 1

Cloudera

SEPTEMBER 20, 2021

There are many ways that Apache Kafka has been deployed in the field. In our Kafka Summit 2021 presentation, we took a brief overview of many different configurations that have been observed to date. In this blog series, we will discuss each of these deployments and the deployment choices made along with how they impact reliability.

Kafka

Kafka Systems Utilities Bytes

Kafka Listeners – Explained

Confluent

JULY 1, 2019

Put another way, courtesy of Spencer Ruport: LISTENERS are what interfaces Kafka binds to. Apache Kafka ® is a distributed system. You need to tell Kafka how the brokers can reach each other but also make sure that external clients (producers/consumers) can reach the broker they need to reach. Is anyone listening? on AWS, etc.)

Kafka

Kafka Metadata AWS Bytes

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Getting Started with Rust and Apache Kafka

Confluent

OCTOBER 24, 2019

I’ve written an event sourcing bank simulation in Clojure (a lisp build for Java virtual machines or JVMs) called open-bank-mark , which you are welcome to read about in my previous blog post explaining the story behind this open source example. Also, there are several functions wrapping the Java clients for Kafka.

Kafka

Kafka Java Banking Bytes

Improving Efficiency Of Goku Time Series Database at Pinterest (Part?—?1)

Pinterest Engineering

NOVEMBER 22, 2023

In the first blog, we will share a short summary on the GokuS and GokuL architecture, data format for Goku Long Term, and how we improved the bootstrap time for our storage and serving components. This is because the synthetic data points would be present in the retry kafka waiting to be pushed into the recovering host by the retry ingestor.

Database

Database Bytes Kafka Architecture

Deploying Kafka Streams and KSQL with Gradle – Part 2: Managing KSQL Implementations

Confluent

MAY 29, 2019

In part 1 , we discussed an event streaming architecture that we implemented for a customer using Apache Kafka ® , KSQL from Confluent, and Kafka Streams. In part 3, we’ll explore using Gradle to build and deploy KSQL user-defined functions (UDFs) and Kafka Streams microservices. Sample repository. gradlew composeUp.

Kafka

Kafka Management Bytes SQL

Optimizing Kafka Streams Applications

Confluent

APRIL 30, 2019

With the release of Apache Kafka ® 2.1.0, Kafka Streams introduced the processor topology optimization framework at the Kafka Streams DSL layer. In what follows, we provide some context around how a processor topology was generated inside Kafka Streams before 2.1, Kafka Streams topology generation 101.

Kafka

Kafka Coding Process Bytes

Deploying Kafka Streams and KSQL with Gradle – Part 3: KSQL User-Defined Functions and Kafka Streams

Confluent

JULY 10, 2019

As discussed in part 2, I created a GitHub repository with Docker Compose functionality for starting a Kafka and Confluent Platform environment, as well as the code samples mentioned below. jar Zip file size: 5849 bytes, number of entries: 5. jar Zip file size: 11405084 bytes, number of entries: 7422. Kafka Streams.

Kafka

Kafka Java Bytes SQL

Incremental Cooperative Rebalancing in Apache Kafka: Why Stop the World When You Can Change It?

Confluent

SEPTEMBER 24, 2019

Franz Kafka, 1897. Load balancing and scheduling are at the heart of every distributed system, and Apache Kafka ® is no different. Kafka clients—specifically the Kafka consumer, Kafka Connect, and Kafka Streams, which are the focus in this post—have used a sophisticated, paradigmatic way of balancing resources since the very beginning.

Kafka

Kafka IT Algorithm Bytes

Using Kafka Connect Securely in the Cloudera Data Platform

Cloudera

OCTOBER 19, 2022

In this post I will demonstrate how Kafka Connect is integrated in the Cloudera Data Platform (CDP), allowing users to manage and monitor their connectors in Streams Messaging Manager while also touching on security features such as role-based access control and sensitive information handling. Kafka Connect. Streams Messaging Manager.

Kafka

Kafka MySQL Data Bytes

Kafka to Delta Lake, as fast as possible

Scribd Technology

MAY 18, 2021

Streaming data from Apache Kafka into Delta Lake is an integral part of Scribd’s data platform, but has been challenging to manage and scale. We use Spark Structured Streaming jobs to read data from Kafka topics and write that data into Delta Lake tables. To serve this need, we created kafka-delta-ingest.

Kafka

Kafka Data Warehouse Bytes Metadata

Towards a Reliable Device Management Platform

Netflix Tech

AUGUST 30, 2021

In this blog post, we will focus on the latter feature set. The challenge, then, is to be able to ingest and process these events in a scalable manner, i.e., scaling with the number of devices, which will be the focus of this blog post. In particular, the Kafka integration is the most relevant for this blog post.

Management

Management Kafka Transportation Cloud

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

DoorDash Engineering

JANUARY 16, 2024

In this blog post, we describe the journey DoorDash took using a service mesh to realize data transfer cost savings without sacrificing service quality. Storage traffic: Includes traffic from microservices to stateful systems such as Aurora PostgreSQL, CockroachDB, Redis, and Kafka.

Bytes

Bytes Cloud Management PostgreSQL

Pinterest Tiered Storage for Apache Kafka®️: A Broker-Decoupled Approach

Pinterest Engineering

SEPTEMBER 17, 2024

Jeff Xiang | Senior Software Engineer, Logging Platform; Vahid Hashemian | Staff Software Engineer, LoggingPlatform When it comes to PubSub solutions, few have achieved higher degrees of ubiquity, community support, and adoption than Apache Kafka, which has become the industry standard for data transportation at large scale.

Kafka

Kafka Bytes Transportation Metadata

A Glimpse into the Redesigned Goku-Ingestor vNext at Pinterest

Pinterest Engineering

NOVEMBER 28, 2023

When there is a full GC, it leads to full halt to the data processing pipeline and causes both back-pressure for upstream kafka clusters and cascading failure for downstream TSDB. Pyoung = Seden / Ralloc where Pyoung is the period between young GC, Seden is the size of Eden and Ralloc is the rate of memory allocations (bytes per second).

Kafka

Kafka Bytes Architecture Software Engineer

Data Engineering Weekly #151

Data Engineering Weekly

DECEMBER 3, 2023

Github writes an excellent blog to capture the current state of the LLM integration architecture. The blog is an excellent read to understand late-arriving data, backfilling, and incremental processing complications. link] Sophie Blee-Goldman: Kafka Streams and Rebalancing through the Ages Consumers come and go.

Data Engineering

Data Engineering Data Engineer Engineering Bytes

BPFAgent: eBPF for Monitoring at DoorDash

DoorDash Engineering

AUGUST 15, 2023

For a more detailed introduction to BPF portability and CO-RE, see Andrii Nakryiko’s blog post on the subject. We also have an unmarshalling function to convert the raw bytes from the kernel into our structure. The post BPFAgent: eBPF for Monitoring at DoorDash appeared first on DoorDash Engineering Blog.

Bytes

Bytes PostgreSQL Coding Database

100+ Kafka Interview Questions and Answers for 2023

ProjectPro

JUNE 29, 2021

Your search for Apache Kafka interview questions ends right here! Let us now dive directly into the Apache Kafka interview questions and answers and help you get started with your Big Data interview preparation! How to study for Kafka interview? What is Kafka used for? What are main APIs of Kafka?

Kafka

Kafka Big Data Bytes Java

Data Engineering Annotated Monthly – May 2022

Big Data Tools

JUNE 8, 2022

RocksDB is a storage engine with a key/value interface, where keys and values are arbitrary byte streams written as a C++ library. Kafka: Mark KRaft as Production Ready – One of the most interesting changes to Kafka from recent years is that it now works without ZooKeeper. Of course, the main topic is data streaming.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Annotated Monthly – May 2022

Big Data Tools

JUNE 8, 2022

RocksDB is a storage engine with a key/value interface, where keys and values are arbitrary byte streams written as a C++ library. Kafka: Mark KRaft as Production Ready – One of the most interesting changes to Kafka from recent years is that it now works without ZooKeeper. Of course, the main topic is data streaming.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Improving Efficiency Of Goku Time Series Database at Pinterest (Part — 3)

Pinterest Engineering

SEPTEMBER 9, 2024

This three part blog post series covers the efficiency improvements (view parts 1 and parts 2 ), and this final part will cover the reduction of the overall cost of Goku and Pinterest. Gokus ingestor component consumes from this Kafka topic and then produces into another kafka topic (partition corresponds to GokuSshard).

Database

Database Bytes Kafka Software Engineer

Expert Roundtable: Batch vs Streaming in the Modern Data Stack [Video]

Rockset

AUGUST 11, 2022

Our esteemed roundtable included leading practitioners, thought leaders and educators in the space, including: Ben Rogojan , aka Seattle Data Guy , is a data engineering and data science consultant (now based in the Rocky Mountain city of Denver) with a popular YouTube channel , Medium blog , and newsletter. Doing the pre-work is important.

Bytes

Bytes Consulting Kafka MongoDB

ZIO Streams: A Long-Form Introduction

Rock the JVM

AUGUST 9, 2022

Logs-As-A-Stream Many messaging platforms, such as Kafka, Pulsar, and/or RabbitMQ have what they advertise as Stream s. FileInputStream In our example later, we are going to process blog posts to parse tag meta-data. class RealFakeInputStream [ T T ) extends InputStream { val data : Array [ Byte ] = "0123456789". run ( sink ).

Scala

Scala Bytes Kafka Programming

HBase Interview Questions and Answers for 2023

ProjectPro

JULY 6, 2016

This is just a hypothetical case that we are talking about and if you prepare well, you will be able to answer any HBase Interview Question, during your next Hadoop job interview, having read ProjectPro Hadoop Interview Questions blogs. To iterate through these values in reverse order-the bytes of the actual value should be written twice.

Hadoop

Hadoop Bytes Metadata Database

Meeting DoorDash Growth with a Self-Service Logistics Configuration Platform

DoorDash Engineering

JANUARY 23, 2024

New input formats: Currently, the platform is supporting byte-based input. The post Meeting DoorDash Growth with a Self-Service Logistics Configuration Platform appeared first on DoorDash Engineering Blog. Having separate endpoints for them will keep the blast radius limited and isolated.

Architecture

Architecture Metadata Bytes Systems

Forge Your Career Path with Best Data Engineering Certifications

ProjectPro

FEBRUARY 21, 2023

Whether you are just starting your career as a Data Engineer or looking to take the next step, this blog will walk you through the most valuable data engineering certifications and help you make an informed decision about which one to pursue. Why Are Data Engineering Skills In Demand?

Certification

Certification Data Engineering Data Engineer Engineering

50 PySpark Interview Questions and Answers For 2023

ProjectPro

NOVEMBER 22, 2021

For input streams receiving data through networks such as Kafka, Flume, and others, the default persistence level setting is configured to achieve data replication on two nodes to achieve fault tolerance. MEMORY ONLY SER: The RDD is stored as One Byte per partition serialized Java Objects. But the problem is, where do you start?

Hadoop

Hadoop Python Datasets Metadata

Operating Apache Kafka with Cruise Control

Cloudera

SEPTEMBER 13, 2021

There are two big gaps in the Apache Kafka project when we think of operating a cluster. There are no solutions for these inside the Kafka project but there are many good 3rd party tools for both problems. Cruise Control is integrated with Kafka through metrics reporting. About Cruise Control. Architecture. Metrics Reporting.

Kafka

Kafka Bytes Utilities Management

Where's My Tesla? Creating a Data API Using Kafka, Rockset and Postman to Find Out

Rockset

FEBRUARY 14, 2020

I walk through an end to end integration of requesting data from the car, streaming it into a Kafka Topic and using Rockset to expose the data via its API to create real time visualisations in D3. Getting started with Kafka When starting with any new tool I find it best to look around and see the art of the possible.

Kafka

Kafka SQL Metadata Python

Data Engineering Weekly #117

Data Engineering Weekly

FEBRUARY 5, 2023

The ML for large-scale production systems highlights the improvement made from the existing heuristic in the YouTube cache replacement algorithm with a new hybrid algorithm that combines a simple heuristic with a learned model, improving the byte miss ratio at the peak by ~9%. The blog talks about four types of architecture.

Data Engineering

Data Engineering Data Engineer Engineering Food

Use Compression, Luke: Cut 20% of the Cloud Cost with a Single Code Change!

Booking.com Engineering

FEBRUARY 5, 2025

Its based on a talk I gave at one of our internal engineering meetups, adapted for a blog format. If we select Usage Type , we can see what exactly EC2-Other refersto: Okay, so the majority of the EC2-Other cost comes from a usage type called EUC1-DataTransfer-Regional-Bytes. Still not clear? Wait, what? Yes,really!

Coding

Coding Cloud AWS Bytes

Data Engineering Digest

Fault Tolerance in Distributed Systems: Tracing with Apache Kafka and Jaeger

Apache Kafka Deployments and Systems Reliability – Part 1

Webinars

Trending Sources

Kafka Listeners – Explained

Webinars

Getting Started with Rust and Apache Kafka

Improving Efficiency Of Goku Time Series Database at Pinterest (Part?—?1)

Deploying Kafka Streams and KSQL with Gradle – Part 2: Managing KSQL Implementations

Optimizing Kafka Streams Applications

Deploying Kafka Streams and KSQL with Gradle – Part 3: KSQL User-Defined Functions and Kafka Streams

Incremental Cooperative Rebalancing in Apache Kafka: Why Stop the World When You Can Change It?

Using Kafka Connect Securely in the Cloudera Data Platform

Kafka to Delta Lake, as fast as possible

Towards a Reliable Device Management Platform

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

Pinterest Tiered Storage for Apache Kafka®️: A Broker-Decoupled Approach

A Glimpse into the Redesigned Goku-Ingestor vNext at Pinterest

Data Engineering Weekly #151

BPFAgent: eBPF for Monitoring at DoorDash

100+ Kafka Interview Questions and Answers for 2023

Data Engineering Annotated Monthly – May 2022

Data Engineering Annotated Monthly – May 2022

Improving Efficiency Of Goku Time Series Database at Pinterest (Part — 3)

Expert Roundtable: Batch vs Streaming in the Modern Data Stack [Video]

ZIO Streams: A Long-Form Introduction

HBase Interview Questions and Answers for 2023

Top 50 Java Interview Questions for Hadoop Developers

Meeting DoorDash Growth with a Self-Service Logistics Configuration Platform

Forge Your Career Path with Best Data Engineering Certifications

50 PySpark Interview Questions and Answers For 2023

Operating Apache Kafka with Cruise Control

Where's My Tesla? Creating a Data API Using Kafka, Rockset and Postman to Find Out

Data Engineering Weekly #117

Use Compression, Luke: Cut 20% of the Cloud Cost with a Single Code Change!

Stay Connected