Blog and Bytes - Data Engineering Digest

Data Engineering Weekly #221

Data Engineering Weekly

MAY 25, 2025

The blog is an excellent compilation of types of query engines on top of the lakehouse, its internal architecture, and benchmarking against various categories. I think the market is wide open for more innovations, as Onehouse announces a compute runtime named Quanton. link] Gunnar Morling: What If We Could Rebuild Kafka From Scratch?

Data Engineer

Data Engineer Data Engineering Engineering PostgreSQL

Handling Network Throttling with AWS EC2 at Pinterest

Pinterest Engineering

APRIL 7, 2025

In this blog post, well discuss our experiences in identifying the challenges associated with EC2 network throttling. In the remainder of this blog post, well share how we root cause and mitigate the aboveissues. This prompted us to engage with AWS and dive deep into the network performance of our clusters. 4xl with up to 12.5

AWS

AWS Bytes Data Ingestion Database

Google BigQuery: A Game-Changing Data Warehousing Solution

ProjectPro

JUNE 6, 2025

This blog is your comprehensive guide to Google BigQuery, its architecture, and a beginner-friendly tutorial on how to use Google BigQuery for your data warehousing activities. This blog presents a detailed overview of Google BigQuery and its architecture. Due to this, combining and contrasting the STRING and BYTE types is impossible.

Bytes

Bytes Google Cloud Data Warehouse Cloud Storage

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

Data Scientist Vs Data Analyst: Key Differences, Career Paths, and How to Choose the Right Role

WeCloudData

FEBRUARY 13, 2025

quintillion bytes of data are generated every day and thats a great sign for anyone interested in a data-driven career. This blog focuses […] The post Data Scientist Vs Data Analyst: Key Differences, Career Paths, and How to Choose the Right Role appeared first on WeCloudData.

Bytes

Bytes BI Data Engineering

Fault Tolerance in Distributed Systems: Tracing with Apache Kafka and Jaeger

Confluent

JULY 24, 2019

Instead, in this post I will point you to an earlier blog post where I already answered that question and then I will focus on what should be your next question: now that I’m relying on Jaeger to trace how data is flowing through my distributed system, what if Jaeger goes down? Distributed tracing with Apache Kafka and Jaeger.

Kafka

Kafka Systems Bytes Project

Improving Efficiency Of Goku Time Series Database at Pinterest (Part?—?1)

Pinterest Engineering

NOVEMBER 22, 2023

In the first blog, we will share a short summary on the GokuS and GokuL architecture, data format for Goku Long Term, and how we improved the bootstrap time for our storage and serving components. More information about the architecture can be found in the GokuL blog and the cost reduction blog.

Database

Database Bytes Kafka Architecture

How Optimizing Memory Management with LMDB Boosted Performance on Our API Service

Pinterest Engineering

JANUARY 13, 2025

We used OO design to support various deserialization methods to mimic Python lists, sets, and dictionaries, using LMDBs byte-based key-value records. In the API processes, we maintain persistent read-only connections, allowing LMDB to paginate data present in virtual shared memory efficiently.

Management

Management Bytes Python Software Engineer

LLM finetuning memory requirements by Alex Birch

Scott Logic

NOVEMBER 23, 2023

Cost increases when gradient accumulation is enabled, or becomes ~free if used in concert with DDP DDP usually costs ~4 bytes/param, but becomes cheaper if used in concert with AMP DDP can be made 2.5 Transformer Math does not mention a "4 bytes/param master gradients" cost.

Bytes

Bytes Education IT Utilities

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

LinkedIn Engineering

JUNE 15, 2023

In this blog post, we will discuss the AvroTensorDataset API, techniques we used to improve data processing speeds by up to 162x over existing solutions (thereby decreasing overall training time by up to 66%), and performance results from benchmarks and production. an array within a map, within a union, etc…). Default is 128 * 1024 (128KB).

Datasets

Datasets Bytes Process Data Ingestion

Introducing Netflix’s Key-Value Data Abstraction Layer

Netflix Tech

SEPTEMBER 18, 2024

The first level is a hashed string ID (the primary key), and the second level is a sorted map of a key-value pair of bytes. A future blog post will describe the chunking architecture in more detail, including its intricacies and optimization strategies. The idempotency token ties all these writes together into one atomic operation.

Bytes

Bytes Metadata Database Data

Memory Optimizations for Analytic Queries in Cloudera Data Warehouse

Cloudera

MARCH 2, 2022

You can read previous blog posts on Impala’s performance and querying techniques here – “ New Multithreading Model for Apache Impala ”, “ Keeping Small Queries Fast – Short query optimizations in Apache Impala ” and “ Faster Performance for Selective Queries ”. . Total size of the Bucket is 16 bytes. Folding data into pointers.

Data Warehouse

Data Warehouse Bytes Data Business Intelligence

Pinterest is now on HTTP/3

Pinterest Engineering

FEBRUARY 23, 2023

These advancements fit well with Pinterest use cases — enabling faster connection establishment (time to first byte of first request), improved congestion control (large media as we have), multiplexing without TCP head-of-line blocking (multiple downloads at the same time), and continued in-flight requests when pinners’ device network/ip changes.

Bytes

Bytes Media Software Engineer Software Engineering

Life Cycle of Data Science Project

WeCloudData

MARCH 1, 2025

quintillion bytes of data are generated every day. The world is becoming increasingly dependent on data, about 2.5 Data is shaping our decisions, from personalized shopping experiences to checking weather forecasts before leaving home. All of these data science applications have a life cycle to follow.

Data Science

Data Science Project Bytes Data

Compare Redshift vs BigQuery vs Snowflake for Big Data Projects

ProjectPro

JUNE 6, 2025

This blog presents a detailed list of differences between the three popular cloud data warehouses- Redshift vs. BigQuery vs. Snowflake to guide you on the most suitable tool for your big data and data warehousing projects. It can be challenging to pick the best data warehousing platform because so many options are available to enterprises.

Big Data

Big Data Project Bytes Data Storage

Launching the Engineering Blog

Zalando Engineering

JUNE 30, 2020

Our Engineering Blog was launched in June 2020 after a long break of the previous tech blog. What customizations we applied to design the blog and the publishing process. Static Site Generator Our previous tech blog used a CMS which only a limited number of people had access to. So which static site generator to choose?

Engineering

Engineering Bytes AWS Python

Netflix Cloud Packaging in the Terabyte Era

Netflix Tech

SEPTEMBER 24, 2021

Our previous tech blog Packaging award-winning shows with award-winning technology detailed our packaging technology deployed on the streaming side. Writable MezzFS As described in a previous blog post, MezzFS is a tool developed by Netflix that allows cloud storage objects to be mounted as local files via FUSE.

Cloud

Cloud Bytes Cloud Storage Media

Patching the PostgreSQL JDBC Driver

Zalando Engineering

NOVEMBER 8, 2023

Introduction This blog post describes a recent contribution from Zalando to the Postgres JDBC driver to address a long-standing issue with the driver’s integration with Postgres’ logical replication that resulted in runaway Write-Ahead Log (WAL) growth. However as you may imagine, this blog post concerns a path that is anything but happy.

PostgreSQL

PostgreSQL Java Database Bytes

Monitoring Cloudera DataFlow Deployments With Prometheus and Grafana

Cloudera

JANUARY 17, 2024

In this blog we will dive into how CDF-PC’s support for NiFi reporting tasks can be used to monitor key metrics in Prometheus and Grafana. By using component_name and “Hello World Prometheus,” we’re monitoring the bytes received aggregated by the entire process group and therefore the flow. Select the nifi_amount_bytes_received metric.

Bytes

Bytes Architecture Designing Building

How to Build a Multimodal RAG Pipeline in Python?

ProjectPro

JUNE 6, 2025

In this blog, we’ll break it all down—from its architecture, a hands-on tutorial to real-world applications—so you can see why it’s the next big leap in AI. But, how does it actually work? What makes it different from traditional RAG systems? Table of Contents What is Multimodal RAG? b64encode(buffered.getvalue()).decode("utf-8")

Building

Building Python Bytes Pharmaceutical

How to Stream JSON Data Using Server-Sent Events and FastAPI in Python over HTTP?

Workfall

SEPTEMBER 26, 2023

Reading Time: 9 minutes In this blog, we will cover: What are Server-Sent Events? We’re taking in 16 bytes of data at a time from the stream. This function will provide basic units of data in the form of raw bytes. These bytes can then be converted into a readable JSON format. appeared first on The Workfall Blog.

Python

Python Bytes Coding Data

Postgres Aurora DB major version upgrade with minimal downtime

Lyft Engineering

MARCH 11, 2024

This blog would be of immense help to understand what happens under the hood with AWS blue/green deployment! The diff_bytes is 0 now! We now need to reset sequences in, which we accomplished with the following script: [link] This ensures that the sequence starts from the last entry of the individual tables.

Bytes

Bytes PostgreSQL AWS Database

Deploying Kafka Streams and KSQL with Gradle – Part 2: Managing KSQL Implementations

Confluent

MAY 29, 2019

If you want to follow along and execute all the commands included in this blog post (and the next), you can check out this GitHub repository , which also includes the necessary Docker Compose functionality for running a compatible KSQL and Confluent Platform environment using the recently released Confluent 5.2.1. Sample repository.

Kafka

Kafka Management Bytes SQL

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

DoorDash Engineering

JANUARY 16, 2024

In this blog post, we describe the journey DoorDash took using a service mesh to realize data transfer cost savings without sacrificing service quality. This led us to use a number of observability tools, including VPC flow logs , ebpf agent metrics , and Envoy networking bytes metrics to rectify the situation.

Bytes

Bytes Cloud Management PostgreSQL

Snowflake Architecture and It's Fundamental Concepts

ProjectPro

JUNE 6, 2025

This blog walks you through what does Snowflake do , the various features it offers, the Snowflake architecture, and so much more. BigQuery charges users depending on how many bytes are read or scanned. Launched in 2014, Snowflake is one of the most popular cloud data solutions on the market.

Architecture

Architecture IT Data Warehouse Amazon Web Services

BPFAgent: eBPF for Monitoring at DoorDash

DoorDash Engineering

AUGUST 15, 2023

For a more detailed introduction to BPF portability and CO-RE, see Andrii Nakryiko’s blog post on the subject. We also have an unmarshalling function to convert the raw bytes from the kernel into our structure. The post BPFAgent: eBPF for Monitoring at DoorDash appeared first on DoorDash Engineering Blog.

Bytes

Bytes PostgreSQL Coding Database

Introducing Netflix TimeSeries Data Abstraction Layer

Netflix Tech

OCTOBER 8, 2024

In previous blog posts, we introduced the Key-Value Data Abstraction Layer and the Data Gateway Platform , both of which are integral to Netflix’s data architecture. We may go into detail on this subject in one of our future blog posts. The next section describes how this is achieved.

Bytes

Bytes Datasets Metadata Data

Optimizing Hive on Tez Performance

Cloudera

MAY 9, 2022

Refer to the YARN – The Capacity Scheduler blog to understand these configuration settings.) . This can be tuned using the user limit factor of the YARN queue (refer the details in Capacity Scheduler blog ). Tez determines the reducers automatically based on the data (number of bytes) to be processed.

Bytes

Bytes SQL Professional Services Utilities

The Rise of Unstructured Data

Cloudera

NOVEMBER 15, 2021

This blog discusses quantifications, types, and implications of data. The International Data Corporation (IDC) estimates that by 2025 the sum of all data in the world will be in the order of 175 Zettabytes (one Zettabyte is 10^21 bytes). The post The Rise of Unstructured Data appeared first on Cloudera Blog.

Unstructured Data

Unstructured Data Pipeline-centric Database-centric Entertainment

100+ Kafka Interview Questions and Answers for 2025

ProjectPro

JUNE 6, 2025

This blog brings you the most popular Kafka interview questions and answers divided into various categories such as Apache Kafka interview questions for beginners, Advanced Kafka interview questions/Apache Kafka interview questions for experienced, Apache Kafka Zookeeper interview questions, etc. What do you understand about quotas in Kafka?

Kafka

Kafka Bytes Big Data Java

How to Navigate the Costs of Legacy SIEMS with Snowflake

Snowflake

APRIL 18, 2024

This blog post explores how Snowflake can help with this challenge. In the cloud, computing can be measured in various ways, like bytes scanned or CPU cycles. But what if security teams didn’t have to make tradeoffs? Detection and investigation processing: Security teams depend on detection rules to find important events automatically.

Data Lake

Data Lake Data Ingestion Bytes Cloud Computing

HBase Interview Questions and Answers for 2025

ProjectPro

JUNE 6, 2025

This is just a hypothetical case that we are talking about and if you prepare well, you will be able to answer any HBase Interview Question, during your next Hadoop job interview, having read ProjectPro Hadoop Interview Questions blogs. To iterate through these values in reverse order-the bytes of the actual value should be written twice.

Hadoop

Hadoop Bytes Metadata MongoDB

Packaging award-winning shows with award-winning technology

Netflix Tech

FEBRUARY 25, 2021

By Cyril Concolato Introduction In previous blog posts, our colleagues at Netflix have explained how 4K video streams are optimized , how even legacy video streams are improved and more recently how new audio codecs can provide better aural experiences to our members. Figure 1?—?Simplified

Technology

Technology Bytes Media Entertainment

Getting Started with Rust and Apache Kafka

Confluent

OCTOBER 24, 2019

I’ve written an event sourcing bank simulation in Clojure (a lisp build for Java virtual machines or JVMs) called open-bank-mark , which you are welcome to read about in my previous blog post explaining the story behind this open source example. Make sure it is indeed an ID and that the Value matches the expected type Fixed , with 16 bytes.

Kafka

Kafka Java Banking Bytes

Understanding LLM Parameters: Inside the Engine of LLMs

ProjectPro

JUNE 6, 2025

This blog covers the LLM parameters in detail and how to tweak them to control the model configuration and optimize performance. This blog will explore the key LLM parameters, detailing how they influence model performance and providing practical optimization tips.

Engineering

Engineering Bytes Architecture Datasets

A Glimpse into the Redesigned Goku-Ingestor vNext at Pinterest

Pinterest Engineering

NOVEMBER 28, 2023

Pyoung = Seden / Ralloc where Pyoung is the period between young GC, Seden is the size of Eden and Ralloc is the rate of memory allocations (bytes per second). To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site.

Kafka

Kafka Bytes Architecture Software Engineer

Building a Simple CRUD web application and image store using Cloudera Operational Database and Flask

Cloudera

OCTOBER 6, 2020

In this blog, I will demonstrate how COD can easily be used as a backend system to store data and images for a simple web application. The post Building a Simple CRUD web application and image store using Cloudera Operational Database and Flask appeared first on Cloudera Blog.

Database

Database Building Bytes NoSQL

Solving Espresso’s scalability and performance challenges to support our member base

LinkedIn Engineering

SEPTEMBER 7, 2023

Espresso System Overview Figure 1 is a high-level overview of the Espresso ecosystem, which includes the online operation section of Espresso (the main focus of this blog post). Improvements to Encode/Decode performance This section focuses on the performance improvements we made when converting bytes to Http objects and vice versa.

Bytes

Bytes Transportation Utilities Java

Netflix Drive

Netflix Tech

MAY 5, 2021

We will cover the different namespaces of Netflix Drive in more detail in a subsequent blog post. Data Store Characteristics Netflix Drive relies on a data store that allows streaming bytes into files/objects persisted on the storage media. The transfer mechanism for transport of bytes is a function of the data store.

Metadata

Metadata Bytes Media Cloud Storage

Data Engineering Weekly #151

Data Engineering Weekly

DECEMBER 3, 2023

Github writes an excellent blog to capture the current state of the LLM integration architecture. The blog is an excellent read to understand late-arriving data, backfilling, and incremental processing complications. I experienced similar drawbacks to what Lyft is talking about in Druid. Rebalancing, the awkward middle child.

Data Engineer

Data Engineer Data Engineering Engineering Bytes

Practical API Design at Netflix, Part 1: Using Protobuf FieldMask

Netflix Tech

SEPTEMBER 3, 2021

If a consumer is only interested in production titles and format, they can set a FieldMask with paths “title” and “format”: [link] Masking fields Please note, even though code samples in this blog post are written in Java, demonstrated concepts apply to any other language supported by protocol buffers. Field names are not included.

Designing

Designing Java Bytes Utilities

Apache Kafka Deployments and Systems Reliability – Part 1

Cloudera

SEPTEMBER 20, 2021

In this blog series, we will discuss each of these deployments and the deployment choices made along with how they impact reliability. The post Apache Kafka Deployments and Systems Reliability – Part 1 appeared first on Cloudera Blog. There are many ways that Apache Kafka has been deployed in the field.

Kafka

Kafka Systems Utilities Bytes

Practical Guide to Implementing Apache NiFi in Big Data Projects

ProjectPro

JUNE 6, 2025

Content Repository The Content Repository stores the actual content bytes of a given FlowFile. As we conclude the exploration of Apache NiFi architecture in this blog, we emphasize the significance of hands-on learning in the journey towards mastering the usage of Nifi in big data projects.

Big Data

Big Data Project Healthcare Medical

50 PySpark Interview Questions and Answers For 2025

ProjectPro

JUNE 6, 2025

MEMORY ONLY SER: The RDD is stored as One Byte per partition serialized Java Objects. MEMORY ONLY SER: The RDD is stored as One Byte per partition serialized Java Objects. You can refer to GitHub for some of the examples used in this blog. DISK ONLY: RDD partitions are only saved on disc. But the problem is, where do you start?

Hadoop

Hadoop Metadata Java Datasets

Data Engineering Weekly #221

Handling Network Throttling with AWS EC2 at Pinterest

Webinars

Trending Sources

Google BigQuery: A Game-Changing Data Warehousing Solution

Webinars

Data Scientist Vs Data Analyst: Key Differences, Career Paths, and How to Choose the Right Role

Fault Tolerance in Distributed Systems: Tracing with Apache Kafka and Jaeger

Improving Efficiency Of Goku Time Series Database at Pinterest (Part?—?1)

How Optimizing Memory Management with LMDB Boosted Performance on Our API Service

LLM finetuning memory requirements by Alex Birch

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

Introducing Netflix’s Key-Value Data Abstraction Layer

Memory Optimizations for Analytic Queries in Cloudera Data Warehouse

Pinterest is now on HTTP/3

Life Cycle of Data Science Project

Compare Redshift vs BigQuery vs Snowflake for Big Data Projects

Launching the Engineering Blog

Netflix Cloud Packaging in the Terabyte Era

Patching the PostgreSQL JDBC Driver

Monitoring Cloudera DataFlow Deployments With Prometheus and Grafana

How to Build a Multimodal RAG Pipeline in Python?

How to Stream JSON Data Using Server-Sent Events and FastAPI in Python over HTTP?

Postgres Aurora DB major version upgrade with minimal downtime

Deploying Kafka Streams and KSQL with Gradle – Part 2: Managing KSQL Implementations

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

Top 15 Azure Synapse Analytics Interview Questions and Answers

Snowflake Architecture and It's Fundamental Concepts

BPFAgent: eBPF for Monitoring at DoorDash

Introducing Netflix TimeSeries Data Abstraction Layer

Optimizing Hive on Tez Performance

The Rise of Unstructured Data

100+ Kafka Interview Questions and Answers for 2025

How to Navigate the Costs of Legacy SIEMS with Snowflake

HBase Interview Questions and Answers for 2025

Packaging award-winning shows with award-winning technology

Getting Started with Rust and Apache Kafka

Understanding LLM Parameters: Inside the Engine of LLMs

A Glimpse into the Redesigned Goku-Ingestor vNext at Pinterest

Building a Simple CRUD web application and image store using Cloudera Operational Database and Flask

Solving Espresso’s scalability and performance challenges to support our member base

Netflix Drive

Data Engineering Weekly #151

Practical API Design at Netflix, Part 1: Using Protobuf FieldMask

Apache Kafka Deployments and Systems Reliability – Part 1

Practical Guide to Implementing Apache NiFi in Big Data Projects

50 PySpark Interview Questions and Answers For 2025

Stay Connected