Bytes and Systems - Data Engineering Digest

The Roots of Today's Modern Backend Engineering Practices

The Pragmatic Engineer

NOVEMBER 21, 2023

If you had a continuous deployment system up and running around 2010, you were ahead of the pack: but today it’s considered strange if your team would not have this for things like web applications. We dabbled in network engineering, database management, and system administration. and hand-rolled C -code.

Engineering

Engineering Bytes Cloud Computing AWS

Fault Tolerance in Distributed Systems: Tracing with Apache Kafka and Jaeger

Confluent

JULY 24, 2019

Using Jaeger tracing, I’ve been able to answer an important question that nearly every Apache Kafka ® project that I’ve worked on posed: how is data flowing through my distributed system? Before I discuss how Kafka can make a Jaeger tracing solution in a distributed system more robust, I’d like to start by providing some context.

Kafka

Kafka Systems Bytes Project

Apache Kafka Deployments and Systems Reliability – Part 1

Cloudera

SEPTEMBER 20, 2021

In Part 1, the discussion is related to: Serial and Parallel Systems Reliability as a concept, Kafka Clusters with and without Co-Located Apache Zookeeper, and Kafka Clusters deployed on VMs. . Serial and Parallel Systems Reliability . Serial Systems Reliability. Serial Systems Reliability.

Kafka

Kafka Systems Utilities Bytes

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

Understanding Change Data Capture (CDC) in MySQL and PostgreSQL: BinLog vs. WAL + Logical Decoding

Towards Data Science

JANUARY 7, 2025

By recording changes as they occur, CDC enables real-time data replication and transfer, minimizing the impact on source systems and ensuring timely consistency across downstream data stores and processing systems that depend on thisdata.

PostgreSQL

PostgreSQL MySQL Bytes Data Lake

Foundation Model for Personalized Recommendation

Netflix Tech

MARCH 28, 2025

By Ko-Jen Hsiao , Yesu Feng and Sudarshan Lamkhede Motivation Netflixs personalized recommender system is a complex system, boasting a variety of specialized machine learned models each catering to distinct needs including Continue Watching and Todays Top Picks for You. Refer to our recent overview for more details).

Metadata

Metadata Bytes Entertainment Data Mining

Google BigQuery: A Game-Changing Data Warehousing Solution

ProjectPro

JUNE 6, 2025

Borg, Google's large-scale cluster management system, distributes computing resources for the Dremel tasks. Dremel tasks read data from Google's Colossus file systems through the Jupiter network, conduct various SQL operations, and provide results to the client.

Bytes

Bytes Google Cloud Data Warehouse Cloud Storage

Handling Network Throttling with AWS EC2 at Pinterest

Pinterest Engineering

APRIL 7, 2025

In recent years, while managing Pinterests EC2 infrastructure, particularly for our essential online storage systems, we identified a significant challenge: the lack of clear insights into EC2s network performance and its direct impact on our applications reliability and performance.

AWS

AWS Bytes Data Ingestion Database

Life of a Netflix Partner Engineer?—?The case of extra 40 ms

Netflix Tech

DECEMBER 14, 2020

All four players involved in the device were on the call: there was the large European pay TV company (the operator) launching the device, the contractor integrating the set-top-box firmware (the integrator), the system-on-a-chip provider (the chip vendor), and myself (Netflix). Audio data is moved at about 45 bytes/ms.

Bytes

Bytes Engineering Manufacturing Coding

Introducing Netflix’s Key-Value Data Abstraction Layer

Netflix Tech

SEPTEMBER 18, 2024

The first level is a hashed string ID (the primary key), and the second level is a sorted map of a key-value pair of bytes. This flexibility allows our Data Platform to route different use cases to the most suitable storage system based on performance, durability, and consistency needs.

Bytes

Bytes Metadata Database Data

Aligning Velox and Apache Arrow: Towards composable data management

Engineering at Meta

FEBRUARY 20, 2024

This new convergence helps Meta and the larger community build data management systems that are unified, more efficient, and composable. Meta’s Data Infrastructure teams have been rethinking how data management systems are designed. An introduction to Velox Velox is the first project in our composable data management system program.

Data Management

Data Management Bytes Management Datasets

Post-quantum readiness for TLS at Meta

Engineering at Meta

MAY 22, 2024

Migrating systems to different cryptosystems always carries some risks such as interoperability issues and security vulnerabilities. In this way, we ensure that our systems remain protected against existing attacks while also providing protection against future threats. However, the key size becomes an issue during TLS resumption.

Bytes

Bytes Algorithm Coding Systems

How Optimizing Memory Management with LMDB Boosted Performance on Our API Service

Pinterest Engineering

JANUARY 13, 2025

NGAPI, the API platform for serving all first party client API requests, requires optimized system performance to ensure a high success rate of requests and allow for maximum efficiency to provide Pinners worldwide with engaging content. To the left, a glowing Pinterest P sign hovers in front of a glasswall.

Management

Management Bytes Software Engineer Software Engineering

Separating debug symbols from executables

Tweag

NOVEMBER 22, 2023

But also, all build systems for larger C/C++ codebases typically compile and link in separate steps already, so this is closer to what we’ll encounter when we want to apply our learnings to a larger build system towards the end of this article. gold is a popular choice in many build systems, including Bazel. o hello.with-g.2

Bytes

Bytes Coding Project Building

A Definitive Guide to Using BigQuery Efficiently

Towards Data Science

MARCH 5, 2024

Like a dragon guarding its treasure, each byte stored and each query executed demands its share of gold coins. Join as we journey through the depths of cost optimization, where every byte is a precious coin. It is also possible to set a maximum for the bytes billed for your query. Photo by Konstantin Evdokimov on Unsplash ?

Bytes

Bytes Google Cloud Cloud Storage Utilities

MezzFS?—?Mounting object storage in Netflix’s media processing platform

Netflix Tech

MARCH 6, 2019

Mounting object storage in Netflix’s media processing platform By Barak Alon (on behalf of Netflix’s Media Cloud Engineering team) MezzFS (short for “Mezzanine File System”) is a tool we’ve developed at Netflix that mounts cloud objects as local files via FUSE. MezzFS can be configured to cache objects on the local disk. Regional caching? —?Netflix

Media

Media Bytes Process Accessibility

Tulip: Modernizing Meta’s data platform

Engineering at Meta

JANUARY 26, 2023

Moreover, they become much harder at Meta because of: Technical debt: Systems have been built over years and have various levels of dependencies and deep integrations with other systems. Some systems serving a smaller scale began showing signs of being insufficient for the increased demands that were placed on them.

Bytes

Bytes Data Engineering Coding

Improving Efficiency Of Goku Time Series Database at Pinterest (Part?—?1)

Pinterest Engineering

NOVEMBER 22, 2023

Initial Architecture For Goku Short Term Ingestion Figure 1: Old push based ingestion pipeline into GokuS At Pinterest, we have a sidecar metrics agent running on every host that logs the application system metrics time series data points (metric name, tag value pairs, timestamp and value) into dedicated kafka topics.

Database

Database Bytes Kafka Architecture

A Beginners Guide to Spark Streaming Architecture with Example

ProjectPro

JUNE 6, 2025

For example, Amazon Redshift can load static data to Spark and process it before sending it to downstream systems. In other words, developers and system administrators can focus their efforts on developing more innovative applications instead of learning, implementing, and maintaining different frameworks. pre-computed models).

Architecture

Architecture Kafka Java Scala

Memory Optimizations for Analytic Queries in Cloudera Data Warehouse

Cloudera

MARCH 2, 2022

For instance, in both the struct s above the largest member is a pointer of size 8 bytes. Total size of the Bucket is 16 bytes. Similarly, the total size of DuplicateNode is 24 bytes. However 12 bytes is not a valid size of Bucket , as it needs to be a multiple of 8 bytes (the size of the largest member of the struct ).

Data Warehouse

Data Warehouse Bytes Data Business Intelligence

AVIF for Next-Generation Image Coding

Netflix Tech

FEBRUARY 13, 2020

The goal is to have the compressed image look as close to the original as possible while reducing the number of bytes required. Shown below is one original source image from the Kodak dataset and the corresponding result with JPEG 444 @ 20,429 bytes and with AVIF 444 @ 19,788 bytes.

Coding

Coding Bytes Datasets Media

Snowflake Architecture and It's Fundamental Concepts

ProjectPro

JUNE 6, 2025

Snowflake provides data warehousing, processing, and analytical solutions that are significantly quicker, simpler to use, and more adaptable than traditional systems. Snowflake is not based on existing database systems or big data software platforms like Hadoop. Snowflake is a data warehousing platform that runs on the cloud.

Architecture

Architecture IT Data Warehouse Amazon Web Services

Meta Quest 2: Defense through offense

Engineering at Meta

SEPTEMBER 12, 2023

It contains customizations on top of AOSP to provide the VR experience on Quest hardware, including firmware, kernel modifications, device drivers, system services, SELinux policies, and applications. As an Android variant, VROS has many of the same security features as other modern Android systems.

Bytes

Bytes Coding Programming Manufacturing

How to Build a Multimodal RAG Pipeline in Python?

ProjectPro

JUNE 6, 2025

What makes it different from traditional RAG systems? Multimodal RAG System Applications and Use Cases Learn and Implement Multimodal RAG with ProjectPro! The system intelligently manages various data types within the context window, ensuring coherent relationships between them. But, how does it actually work?

Building

Building Python Bytes Pharmaceutical

A guide to UDP in Scala with FS2

Rock the JVM

DECEMBER 17, 2023

The UDP header is fixed at 8 bytes and contains a source port, destination port, the checksum used to verify packet integrity by the receiving device, and the length of the packet which equates to the sum of the payload and header. flip () println ( s "[server] I've received ${content.limit()} bytes " + s "from ${clientAddress.toString()}!

Scala

Scala Bytes Java Coding

Netflix Cloud Packaging in the Terabyte Era

Netflix Tech

SEPTEMBER 24, 2021

Lastly, the packager kicks in, adding a system layer to the asset, making it ready to be consumed by the clients. The index file keeps track of the physical location (URL) of each chunk and also keeps track of the physical location (URL + byte offset + size) of each video frame to facilitate downstream processing.

Cloud

Cloud Bytes Cloud Storage Media

Compare Redshift vs BigQuery vs Snowflake for Big Data Projects

ProjectPro

JUNE 6, 2025

Google offers "on-demand pricing," where users are charged for each byte of requested and processed data; the first 1 TB of data per month is free. Alternatively, Redshift might give you the flexibility you need if you have system experts who can modify the architecture to your demands. The hourly rate starts at $0.25

Big Data

Big Data Project Bytes Data Storage

Functional Python, Part II: Dial M for Monoid

Tweag

JANUARY 18, 2023

Last time I wrote about how Python’s 1 type system and syntax is now flexible enough to represent and utilise algebraic data types ergonomically. reading and writing to a byte stream). classmethod def decode ( cls , data : bytes ) - > Self : # Implementation goes here. Ivory towers are lonely places, after all.

Python

Python Bytes Software Engineer Software Engineering

100+ Kafka Interview Questions and Answers for 2025

ProjectPro

JUNE 6, 2025

Apache Kafka and Flume are distributed data systems, but there is a certain difference between Kafka and Flume in terms of features, scalability, etc. For a system to support multi-tenancy, the level of logical isolation must be complete, but the level of physical integration may vary.

Kafka

Kafka Bytes Big Data Java

A Brief Overview of the Unix File System

U-Next

NOVEMBER 24, 2022

The Unix File System is a framework for organizing and storing large amounts of data in a manageable manner. It includes components like files, a group of connected data that can be conceptualized as a stream of bytes (or characters). In the Unix File System, a file is also the smallest storage unit. .

Systems

Systems Bytes Media Utilities

Data Engineer’s Guide to 6 Essential Snowflake Data Types

ProjectPro

JUNE 6, 2025

This decision impacts disk performance, resource allocation, and overall system efficiency. String & Binary Snowflake Data Types VARCHAR, STRING, TEXT Snowflake data types It is a variable-length character string of a maximum of 16,777,216 bytes and holds Unicode characters(UTF-8).

Bytes

Bytes Data Unstructured Data Structured Data

50 PySpark Interview Questions and Answers For 2025

ProjectPro

JUNE 6, 2025

Hadoop Datasets: These are created from external data sources like the Hadoop Distributed File System (HDFS) , HBase, or any storage system supported by Hadoop. The following methods should be defined or inherited for a custom profiler- profile- this is identical to the system profile. dump- saves all of the profiles to a path.

Hadoop

Hadoop Metadata Java Datasets

Databricks Delta Lake: A Scalable Data Lake Solution

ProjectPro

JUNE 6, 2025

Want to process peta-byte scale data with real-time streaming ingestions rates, build 10 times faster data pipelines with 99.999% reliability, witness 20 x improvement in query performance compared to traditional data lakes, enter the world of Databricks Delta Lake now. This results in a fast and scalable metadata handling system.

Data Lake

Data Lake Data Warehouse Metadata Unstructured Data

Investigation of a Workbench UI Latency Issue

Netflix Tech

OCTOBER 14, 2024

Following through the single_open function , we will find that it uses the function show_smaps_rollup for the show operation, which can translate to the read system call on the file. Let’s first focus on the handler of open syscall on this /proc/<pid>/smaps_rollup. Next, we look at the show_smaps_rollup implementation. seconds!

Bytes

Bytes Utilities Coding Python

Modern Data Engineering: Free Spark to Snowpark Migration Accelerator for Faster, Cheaper Pipelines in Snowflake

Snowflake

JUNE 20, 2024

Designed for processing large data sets, Spark has been a popular solution, yet it is one that can be challenging to manage, especially for users who are new to big data processing or distributed systems. Batch Processing Pipelines : Large volumes of data can be processed on schedule using the tool.

Data Engineering

Data Engineering Data Engineer Scala Engineering

Monitoring Cloudera DataFlow Deployments With Prometheus and Grafana

Cloudera

JANUARY 17, 2024

you can now programmatically create NiFi reporting tasks to make relevant metrics available to various third party monitoring systems. By using component_name and “Hello World Prometheus,” we’re monitoring the bytes received aggregated by the entire process group and therefore the flow. Select the nifi_amount_bytes_received metric.

Bytes

Bytes Architecture Designing Building

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

DoorDash Engineering

JANUARY 16, 2024

Storage traffic: Includes traffic from microservices to stateful systems such as Aurora PostgreSQL, CockroachDB, Redis, and Kafka. In our production system, we observe 10% of traffic that is sent across AZs with this topologySpreadConstraints policy. topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone

Bytes

Bytes Cloud Management PostgreSQL

Practical Guide to Implementing Apache NiFi in Big Data Projects

ProjectPro

JUNE 6, 2025

This powerful platform addresses the challenges of data ingestion, distribution, and transformation across diverse systems. NiFi supports connectivity with many systems, including databases, cloud services, and IoT devices, while emphasizing data lineage, security, and extensibility. What is Apache NiFi Used For?

Big Data

Big Data Project Healthcare Medical

5 Big Data Challenges in 2024

Knowledge Hut

MARCH 7, 2024

quintillion bytes (or 2.5 Syncing Across Data Sources Once you import data into Big Data platforms you may also realize that data copies migrated from a wide range of sources on different rates and schedules can rapidly get out of the synchronization with the originating system. exabytes) of information is being generated every day.

Big Data

Big Data Bytes Data Governance Raw Data

Kafka Connect Deep Dive – JDBC Source Connector

Confluent

FEBRUARY 12, 2019

Bytes, Decimals, Numerics and oh my. This is useful to get a dump of the data, but very batchy and not always so appropriate for actually integrating source database systems into the streaming world of Kafka. Bytes, Decimals, Numerics and oh my. So our DECIMAL becomes a seemingly gibberish bytes value. Introduction.

Kafka

Kafka MySQL Bytes Java

Data Cleaning Techniques in Data Mining and Machine Learning

ProjectPro

JUNE 6, 2025

Quintillion Bytes of data per day. It is a data management system that facilitates and supports Business Intelligence. It is time-variant, meaning it has a timestamp to track the data coming into the system. As per statistics, we produce 2.5 The problem lies in the real-world data.

Data Mining

Data Mining Machine Learning Data Cleanse Data Warehouse

Introducing Netflix TimeSeries Data Abstraction Layer

Netflix Tech

OCTOBER 8, 2024

Note: Contrary to what the name may suggest, this system is not built as a general-purpose time series database. Those use cases are well served by the Netflix Atlas telemetry system. Effectively managing this data at scale to extract valuable insights is crucial for ensuring optimal user experiences and system reliability.

Bytes

Bytes Datasets Metadata Data

BPFAgent: eBPF for Monitoring at DoorDash

DoorDash Engineering

AUGUST 15, 2023

But these signals almost entirely rely on application-level instrumentation, which can leave gaps or conflicting semantics across different systems. We also have an unmarshalling function to convert the raw bytes from the kernel into our structure. Metrics, logs, and traces provide vital information about our service ecosystem.

Bytes

Bytes PostgreSQL Coding Database

Geospatial Index 102

Towards Data Science

APRIL 11, 2023

It makes geospatial data can be searched and retrieved efficiently so that the system can provide the best experience to its users. To make it work properly, a good understanding of both the algorithm and the system requirements is required. GB to 55 MB and 7M to 260k). However, it has a cost that should be well considered in advance.

Bytes

Bytes Google Cloud Datasets Programming Language

Packaging award-winning shows with award-winning technology

Netflix Tech

FEBRUARY 25, 2021

The output of an encoder is a sequence of bytes, called an elementary stream, which can only be parsed with some understanding of the elementary stream syntax. The Media Systems team at Netflix actively contributes to the development, the maintenance, and the adoption of ISOBMFF. Figure 1?—?Simplified We’re hiring!

Technology

Technology Bytes Media Entertainment

The Roots of Today's Modern Backend Engineering Practices

Fault Tolerance in Distributed Systems: Tracing with Apache Kafka and Jaeger

Webinars

Trending Sources

Apache Kafka Deployments and Systems Reliability – Part 1

Webinars

Understanding Change Data Capture (CDC) in MySQL and PostgreSQL: BinLog vs. WAL + Logical Decoding

Foundation Model for Personalized Recommendation

Google BigQuery: A Game-Changing Data Warehousing Solution

Handling Network Throttling with AWS EC2 at Pinterest

Life of a Netflix Partner Engineer?—?The case of extra 40 ms

Introducing Netflix’s Key-Value Data Abstraction Layer

Aligning Velox and Apache Arrow: Towards composable data management

Post-quantum readiness for TLS at Meta

How Optimizing Memory Management with LMDB Boosted Performance on Our API Service

Separating debug symbols from executables

A Definitive Guide to Using BigQuery Efficiently

MezzFS?—?Mounting object storage in Netflix’s media processing platform

Tulip: Modernizing Meta’s data platform

Improving Efficiency Of Goku Time Series Database at Pinterest (Part?—?1)

A Beginners Guide to Spark Streaming Architecture with Example

Memory Optimizations for Analytic Queries in Cloudera Data Warehouse

AVIF for Next-Generation Image Coding

Snowflake Architecture and It's Fundamental Concepts

Meta Quest 2: Defense through offense

How to Build a Multimodal RAG Pipeline in Python?

A guide to UDP in Scala with FS2

Netflix Cloud Packaging in the Terabyte Era

Compare Redshift vs BigQuery vs Snowflake for Big Data Projects

Functional Python, Part II: Dial M for Monoid

100+ Kafka Interview Questions and Answers for 2025

A Brief Overview of the Unix File System

Data Engineer’s Guide to 6 Essential Snowflake Data Types

50 PySpark Interview Questions and Answers For 2025

Databricks Delta Lake: A Scalable Data Lake Solution

Investigation of a Workbench UI Latency Issue

Modern Data Engineering: Free Spark to Snowpark Migration Accelerator for Faster, Cheaper Pipelines in Snowflake

Monitoring Cloudera DataFlow Deployments With Prometheus and Grafana

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

Practical Guide to Implementing Apache NiFi in Big Data Projects

5 Big Data Challenges in 2024

Kafka Connect Deep Dive – JDBC Source Connector

Data Cleaning Techniques in Data Mining and Machine Learning

Introducing Netflix TimeSeries Data Abstraction Layer

BPFAgent: eBPF for Monitoring at DoorDash

Geospatial Index 102

Packaging award-winning shows with award-winning technology

Stay Connected