Bytes and Systems - Data Engineering Digest

The Roots of Today's Modern Backend Engineering Practices

The Pragmatic Engineer

NOVEMBER 21, 2023

If you had a continuous deployment system up and running around 2010, you were ahead of the pack: but today it’s considered strange if your team would not have this for things like web applications. We dabbled in network engineering, database management, and system administration. and hand-rolled C -code.

Engineering

Engineering Bytes Cloud Computing AWS

Fault Tolerance in Distributed Systems: Tracing with Apache Kafka and Jaeger

Confluent

JULY 24, 2019

Using Jaeger tracing, I’ve been able to answer an important question that nearly every Apache Kafka ® project that I’ve worked on posed: how is data flowing through my distributed system? Before I discuss how Kafka can make a Jaeger tracing solution in a distributed system more robust, I’d like to start by providing some context.

Kafka

Kafka Systems Bytes Project

Apache Kafka Deployments and Systems Reliability – Part 1

Cloudera

SEPTEMBER 20, 2021

In Part 1, the discussion is related to: Serial and Parallel Systems Reliability as a concept, Kafka Clusters with and without Co-Located Apache Zookeeper, and Kafka Clusters deployed on VMs. . Serial and Parallel Systems Reliability . Serial Systems Reliability. Serial Systems Reliability.

Kafka

Kafka Systems Utilities Bytes

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Life of a Netflix Partner Engineer?—?The case of extra 40 ms

Netflix Tech

DECEMBER 14, 2020

All four players involved in the device were on the call: there was the large European pay TV company (the operator) launching the device, the contractor integrating the set-top-box firmware (the integrator), the system-on-a-chip provider (the chip vendor), and myself (Netflix). Audio data is moved at about 45 bytes/ms.

Bytes

Bytes Engineering Coding Manufacturing

Handling Network Throttling with AWS EC2 at Pinterest

Pinterest Engineering

APRIL 7, 2025

In recent years, while managing Pinterests EC2 infrastructure, particularly for our essential online storage systems, we identified a significant challenge: the lack of clear insights into EC2s network performance and its direct impact on our applications reliability and performance.

AWS

AWS Bytes Database Data Ingestion

Foundation Model for Personalized Recommendation

Netflix Tech

MARCH 28, 2025

By Ko-Jen Hsiao , Yesu Feng and Sudarshan Lamkhede Motivation Netflixs personalized recommender system is a complex system, boasting a variety of specialized machine learned models each catering to distinct needs including Continue Watching and Todays Top Picks for You. Refer to our recent overview for more details).

Metadata

Metadata Bytes Data Mining Entertainment

Post-quantum readiness for TLS at Meta

Engineering at Meta

MAY 22, 2024

Migrating systems to different cryptosystems always carries some risks such as interoperability issues and security vulnerabilities. In this way, we ensure that our systems remain protected against existing attacks while also providing protection against future threats. However, the key size becomes an issue during TLS resumption.

Bytes

Bytes Algorithm Coding Utilities

Introducing Netflix’s Key-Value Data Abstraction Layer

Netflix Tech

SEPTEMBER 18, 2024

The first level is a hashed string ID (the primary key), and the second level is a sorted map of a key-value pair of bytes. This flexibility allows our Data Platform to route different use cases to the most suitable storage system based on performance, durability, and consistency needs.

Bytes

Bytes Metadata Database Data

A Definitive Guide to Using BigQuery Efficiently

Towards Data Science

MARCH 5, 2024

Like a dragon guarding its treasure, each byte stored and each query executed demands its share of gold coins. Join as we journey through the depths of cost optimization, where every byte is a precious coin. It is also possible to set a maximum for the bytes billed for your query. Photo by Konstantin Evdokimov on Unsplash ?

Bytes

Bytes Google Cloud Cloud Storage Utilities

Separating debug symbols from executables

Tweag

NOVEMBER 22, 2023

But also, all build systems for larger C/C++ codebases typically compile and link in separate steps already, so this is closer to what we’ll encounter when we want to apply our learnings to a larger build system towards the end of this article. gold is a popular choice in many build systems, including Bazel. o hello.with-g.2

Bytes

Bytes Coding Building Project

Aligning Velox and Apache Arrow: Towards composable data management

Engineering at Meta

FEBRUARY 20, 2024

This new convergence helps Meta and the larger community build data management systems that are unified, more efficient, and composable. Meta’s Data Infrastructure teams have been rethinking how data management systems are designed. An introduction to Velox Velox is the first project in our composable data management system program.

Data Management

Data Management Bytes Management Datasets

MezzFS?—?Mounting object storage in Netflix’s media processing platform

Netflix Tech

MARCH 6, 2019

Mounting object storage in Netflix’s media processing platform By Barak Alon (on behalf of Netflix’s Media Cloud Engineering team) MezzFS (short for “Mezzanine File System”) is a tool we’ve developed at Netflix that mounts cloud objects as local files via FUSE. MezzFS can be configured to cache objects on the local disk. Regional caching? —?Netflix

Media

Media Bytes Process Accessibility

Memory Optimizations for Analytic Queries in Cloudera Data Warehouse

Cloudera

MARCH 2, 2022

For instance, in both the struct s above the largest member is a pointer of size 8 bytes. Total size of the Bucket is 16 bytes. Similarly, the total size of DuplicateNode is 24 bytes. However 12 bytes is not a valid size of Bucket , as it needs to be a multiple of 8 bytes (the size of the largest member of the struct ).

Data Warehouse

Data Warehouse Bytes Data Business Intelligence

Tulip: Modernizing Meta’s data platform

Engineering at Meta

JANUARY 26, 2023

Moreover, they become much harder at Meta because of: Technical debt: Systems have been built over years and have various levels of dependencies and deep integrations with other systems. Some systems serving a smaller scale began showing signs of being insufficient for the increased demands that were placed on them.

Bytes

Bytes Data Engineering Coding

Improving Efficiency Of Goku Time Series Database at Pinterest (Part?—?1)

Pinterest Engineering

NOVEMBER 22, 2023

Initial Architecture For Goku Short Term Ingestion Figure 1: Old push based ingestion pipeline into GokuS At Pinterest, we have a sidecar metrics agent running on every host that logs the application system metrics time series data points (metric name, tag value pairs, timestamp and value) into dedicated kafka topics.

Database

Database Bytes Kafka Architecture

AVIF for Next-Generation Image Coding

Netflix Tech

FEBRUARY 13, 2020

The goal is to have the compressed image look as close to the original as possible while reducing the number of bytes required. Shown below is one original source image from the Kodak dataset and the corresponding result with JPEG 444 @ 20,429 bytes and with AVIF 444 @ 19,788 bytes.

Coding

Coding Bytes Datasets Media

A guide to UDP in Scala with FS2

Rock the JVM

DECEMBER 17, 2023

The UDP header is fixed at 8 bytes and contains a source port, destination port, the checksum used to verify packet integrity by the receiving device, and the length of the packet which equates to the sum of the payload and header. flip () println ( s "[server] I've received ${content.limit()} bytes " + s "from ${clientAddress.toString()}!

Scala

Scala Bytes Java Coding

Meta Quest 2: Defense through offense

Engineering at Meta

SEPTEMBER 12, 2023

It contains customizations on top of AOSP to provide the VR experience on Quest hardware, including firmware, kernel modifications, device drivers, system services, SELinux policies, and applications. As an Android variant, VROS has many of the same security features as other modern Android systems.

Bytes

Bytes Coding Programming Process

Functional Python, Part II: Dial M for Monoid

Tweag

JANUARY 18, 2023

Last time I wrote about how Python’s 1 type system and syntax is now flexible enough to represent and utilise algebraic data types ergonomically. reading and writing to a byte stream). classmethod def decode ( cls , data : bytes ) - > Self : # Implementation goes here. Ivory towers are lonely places, after all.

Python

Python Bytes Software Engineering Software Engineer

Netflix Cloud Packaging in the Terabyte Era

Netflix Tech

SEPTEMBER 24, 2021

Lastly, the packager kicks in, adding a system layer to the asset, making it ready to be consumed by the clients. The index file keeps track of the physical location (URL) of each chunk and also keeps track of the physical location (URL + byte offset + size) of each video frame to facilitate downstream processing.

Cloud

Cloud Bytes Cloud Storage Media

A Brief Overview of the Unix File System

U-Next

NOVEMBER 24, 2022

The Unix File System is a framework for organizing and storing large amounts of data in a manageable manner. It includes components like files, a group of connected data that can be conceptualized as a stream of bytes (or characters). In the Unix File System, a file is also the smallest storage unit. .

Systems

Systems Bytes Media Utilities

Modern Data Engineering: Free Spark to Snowpark Migration Accelerator for Faster, Cheaper Pipelines in Snowflake

Snowflake

JUNE 20, 2024

Designed for processing large data sets, Spark has been a popular solution, yet it is one that can be challenging to manage, especially for users who are new to big data processing or distributed systems. Batch Processing Pipelines : Large volumes of data can be processed on schedule using the tool.

Data Engineering

Data Engineering Data Engineer Scala Engineering

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

DoorDash Engineering

JANUARY 16, 2024

Storage traffic: Includes traffic from microservices to stateful systems such as Aurora PostgreSQL, CockroachDB, Redis, and Kafka. In our production system, we observe 10% of traffic that is sent across AZs with this topologySpreadConstraints policy. topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone

Bytes

Bytes Cloud Management PostgreSQL

5 Big Data Challenges in 2024

Knowledge Hut

MARCH 7, 2024

quintillion bytes (or 2.5 Syncing Across Data Sources Once you import data into Big Data platforms you may also realize that data copies migrated from a wide range of sources on different rates and schedules can rapidly get out of the synchronization with the originating system. exabytes) of information is being generated every day.

Big Data

Big Data Bytes Data Governance Raw Data

Monitoring Cloudera DataFlow Deployments With Prometheus and Grafana

Cloudera

JANUARY 17, 2024

you can now programmatically create NiFi reporting tasks to make relevant metrics available to various third party monitoring systems. By using component_name and “Hello World Prometheus,” we’re monitoring the bytes received aggregated by the entire process group and therefore the flow. Select the nifi_amount_bytes_received metric.

Bytes

Bytes Architecture Building Designing

Kafka Connect Deep Dive – JDBC Source Connector

Confluent

FEBRUARY 12, 2019

Bytes, Decimals, Numerics and oh my. This is useful to get a dump of the data, but very batchy and not always so appropriate for actually integrating source database systems into the streaming world of Kafka. Bytes, Decimals, Numerics and oh my. So our DECIMAL becomes a seemingly gibberish bytes value. Introduction.

Kafka

Kafka MySQL Bytes Java

BPFAgent: eBPF for Monitoring at DoorDash

DoorDash Engineering

AUGUST 15, 2023

But these signals almost entirely rely on application-level instrumentation, which can leave gaps or conflicting semantics across different systems. We also have an unmarshalling function to convert the raw bytes from the kernel into our structure. Metrics, logs, and traces provide vital information about our service ecosystem.

Bytes

Bytes PostgreSQL Coding Database

Geospatial Index 102

Towards Data Science

APRIL 11, 2023

It makes geospatial data can be searched and retrieved efficiently so that the system can provide the best experience to its users. To make it work properly, a good understanding of both the algorithm and the system requirements is required. GB to 55 MB and 7M to 260k). However, it has a cost that should be well considered in advance.

Bytes

Bytes Google Cloud Datasets Programming Language

Introducing Netflix TimeSeries Data Abstraction Layer

Netflix Tech

OCTOBER 8, 2024

Note: Contrary to what the name may suggest, this system is not built as a general-purpose time series database. Those use cases are well served by the Netflix Atlas telemetry system. Effectively managing this data at scale to extract valuable insights is crucial for ensuring optimal user experiences and system reliability.

Bytes

Bytes Datasets Metadata Data

KSQL: What’s New in 5.2

Confluent

APRIL 3, 2019

END AS DEPARTMENT, PRODUCT FROM PRODUCTS; ksql> DESCRIBE PRODUCTS_ENRICHED; Name : PRODUCTS_ENRICHED Field | Type - ROWTIME | BIGINT (system) ROWKEY | VARCHAR(STRING) (system) SKU | VARCHAR(STRING) DEPARTMENT | VARCHAR(STRING) PRODUCT | VARCHAR(STRING). WHEN SKU LIKE 'F%' THEN 'Food'. ELSE 'Unknown'. 5476133448908187392.KsqlTopic.source.deserializer","time":1552564841423,"message":{"type":0,"deserializationError":{"errorMessage":"Converting

Food

Food Kafka Bytes Data Cleanse

Streaming Data from the Universe with Apache Kafka

Confluent

JUNE 13, 2019

Observational astronomers study many different types of objects, from asteroids in our own solar system to galaxies that are billions of lightyears away. The technology underlying the ZTF system should be a prototype that reliably scales to LSST needs. Alert data pipeline and system design. Astronomy in real time.

Kafka

Kafka Bytes Python Data Pipeline

Packaging award-winning shows with award-winning technology

Netflix Tech

FEBRUARY 25, 2021

The output of an encoder is a sequence of bytes, called an elementary stream, which can only be parsed with some understanding of the elementary stream syntax. The Media Systems team at Netflix actively contributes to the development, the maintenance, and the adoption of ISOBMFF. Figure 1?—?Simplified We’re hiring!

Technology

Technology Bytes Media Entertainment

Deploying Kafka Streams and KSQL with Gradle – Part 2: Managing KSQL Implementations

Confluent

MAY 29, 2019

zip Zip file size: 3593 bytes, number of entries: 9 drwxr-xr-x 2.0 unx 2312 b- defN 19-Feb-13 13:05 ksql-script.sql 9 files, 5502 bytes uncompressed, 2397 bytes compressed: 56.4%. . ==> zipinfo ksql/build/distributions/ksql-pipeline-1.0.0.zip zip Zip file size: 3593 bytes, number of entries: 9 drwxr-xr-x 2.0

Kafka

Kafka Management Bytes SQL

Customer Data Platform – An Expert Guide

U-Next

MARCH 7, 2023

The customer experience and marketing teams primarily use this to accelerate the acquisition of every byte of customer data from appropriate channels, devices, and platforms and its transformation into a unified customer profile. Companies frequently use CDP Software as the sole source of consumer information.

Bytes

Bytes Media Data Data Collection

Apache Ozone Fault Injection Framework

Cloudera

AUGUST 14, 2020

One of the key challenges of building an enterprise-class robust scalable storage system is to validate the system under duress and failing system components. This includes, but is not limited to: failed networks, failed or failing disks, arbitrary delays in the network or IO path, network partitions, and unresponsive systems.

Hadoop

Hadoop Bytes Metadata Programming Language

How to Navigate the Costs of Legacy SIEMS with Snowflake

Snowflake

APRIL 18, 2024

Data retention: Many legacy SIEMS delete activity logs, transaction records, and other details from their systems after a few days, weeks or months. In the cloud, computing can be measured in various ways, like bytes scanned or CPU cycles. With Snowflake, security teams don’t have to work around these data retention windows.

Data Lake

Data Lake Data Ingestion Bytes Cloud Computing

Getting Started with Rust and Apache Kafka

Confluent

OCTOBER 24, 2019

Alternatively, you can get money into the system by simply depositing money with the push of a button. The events are handled by the command handler, which is the part of the system that has been ported to Rust. Make sure it is indeed an ID and that the Value matches the expected type Fixed , with 16 bytes. The bank application.

Kafka

Kafka Java Banking Bytes

A Glimpse into the Redesigned Goku-Ingestor vNext at Pinterest

Pinterest Engineering

NOVEMBER 28, 2023

Pinterest metrics system Goku-Ingestor has been running and evolving for close to a decade. Reliability Issues In the initial months of 2023, certain problems arose as a result of Goku-Ingestor’s performance, leading to some instances where data loss occurred within the metrics system for a brief duration of time.

Kafka

Kafka Bytes Architecture Software Engineer

Solving Espresso’s scalability and performance challenges to support our member base

LinkedIn Engineering

SEPTEMBER 7, 2023

In this post, we will explain how we solved these challenges and improved system performance. Espresso System Overview Figure 1 is a high-level overview of the Espresso ecosystem, which includes the online operation section of Espresso (the main focus of this blog post). This delay can significantly affect the system's response time.

Bytes

Bytes Transportation Utilities Java

Why are database columns 191 characters?

Grouparoo

MAY 13, 2021

MySQL wanted to ensure that its index files could fit within a single page block on older file systems. However, the the most popular text encoding ( Latin1 or utf8 ) on the most popular MySQL database engine ( innodb ) assumed that 3 bytes was enough to store every character 2 , and once utf8mb4 came along with characters like ?

Database

Database Bytes MySQL Database-centric

Two-Factor Authentication in Scala with Http4s

Rock the JVM

JULY 26, 2023

When a user tries to perform a transaction or action on a system, he or she will present some credentials like an email or a phone number. The system will send a temporary secure PIN-code or token to the user by email or phone number valid for only that session. generate ( 512 ))) } private val secret : Array [ Byte ] = user.

Scala

Scala Java Bytes Algorithm

Netflix Drive

Netflix Tech

MAY 5, 2021

Netflix Drive relies on a data store that will be the persistent storage layer for assets, and a metadata store which will provide a relevant mapping from the file system hierarchy to the data store entities. 2 , are the file system interface, the API interface, and the metadata and data stores. The major pieces, as shown in Fig.

Metadata

Metadata Bytes Media Cloud Storage

Streaming Big Data Files from Cloud Storage

Towards Data Science

JANUARY 26, 2023

AWS, for example, offers services such as Amazon FSx and Amazon EFS for mirroring your data in a high-performance file system in the cloud. Here we show how to download specific byte-ranges of the file using the Boto3 get_object data streaming API. For these use cases, downloading the entire file can be extremely wasteful.

Cloud Storage

Cloud Storage Big Data Cloud AWS

Scaling Salt for Remote Execution to support LinkedIn Infra growth

LinkedIn Engineering

APRIL 18, 2023

Minion (an agent on host) sees jobs and results by subscribing to events published on the event bus by master service, It uses ZMQ (ZeroMQ) to achieve high-speed, asynchronous communication between connected systems. Targeted minions execute the job on the host and return to master. login is modified to rely on mTLS at Nginx level.

MySQL

MySQL Python Bytes Kafka

The Roots of Today's Modern Backend Engineering Practices

Fault Tolerance in Distributed Systems: Tracing with Apache Kafka and Jaeger

Webinars

Trending Sources

Apache Kafka Deployments and Systems Reliability – Part 1

Webinars

Life of a Netflix Partner Engineer?—?The case of extra 40 ms

Handling Network Throttling with AWS EC2 at Pinterest

Foundation Model for Personalized Recommendation

Post-quantum readiness for TLS at Meta

Introducing Netflix’s Key-Value Data Abstraction Layer

A Definitive Guide to Using BigQuery Efficiently

Separating debug symbols from executables

Aligning Velox and Apache Arrow: Towards composable data management

MezzFS?—?Mounting object storage in Netflix’s media processing platform

Memory Optimizations for Analytic Queries in Cloudera Data Warehouse

Tulip: Modernizing Meta’s data platform

Improving Efficiency Of Goku Time Series Database at Pinterest (Part?—?1)

AVIF for Next-Generation Image Coding

A guide to UDP in Scala with FS2

Meta Quest 2: Defense through offense

Functional Python, Part II: Dial M for Monoid

Netflix Cloud Packaging in the Terabyte Era

A Brief Overview of the Unix File System

Modern Data Engineering: Free Spark to Snowpark Migration Accelerator for Faster, Cheaper Pipelines in Snowflake

Staying in the Zone: How DoorDash used a service mesh to manage data transfer, reducing hops and cloud spend

5 Big Data Challenges in 2024

Monitoring Cloudera DataFlow Deployments With Prometheus and Grafana

Kafka Connect Deep Dive – JDBC Source Connector

BPFAgent: eBPF for Monitoring at DoorDash

Geospatial Index 102

Introducing Netflix TimeSeries Data Abstraction Layer

KSQL: What’s New in 5.2

Streaming Data from the Universe with Apache Kafka

Packaging award-winning shows with award-winning technology

Deploying Kafka Streams and KSQL with Gradle – Part 2: Managing KSQL Implementations

Customer Data Platform – An Expert Guide

Apache Ozone Fault Injection Framework

How to Navigate the Costs of Legacy SIEMS with Snowflake

Getting Started with Rust and Apache Kafka

A Glimpse into the Redesigned Goku-Ingestor vNext at Pinterest

Solving Espresso’s scalability and performance challenges to support our member base

Why are database columns 191 characters?

Two-Factor Authentication in Scala with Http4s

Netflix Drive

Streaming Big Data Files from Cloud Storage

Scaling Salt for Remote Execution to support LinkedIn Infra growth

Stay Connected