Blog, Process and Systems - Data Engineering Digest

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Metas systems. In this blog, we will delve into an early stage in PAI implementation: data lineage. Data lineage enables us to efficiently navigate these assets and protect user data.

Data Warehouse

Data Warehouse SQL Programming Language Data

An educational side project

The Pragmatic Engineer

JUNE 1, 2023

Juraj included system monitoring parts which monitor the server’s capacity he runs the app on: The monitoring page on the Rides app And it doesn’t end here. Juraj created a systems design explainer on how he built this project, and the technologies used: The systems design diagram for the Rides application The app uses: Node.js

Education

Education Project PostgreSQL Software Engineering

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms.

Architecture

Architecture Systems Data Lake Google Cloud

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Netflix’s Distributed Counter Abstraction

Netflix Tech

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Datasets

Datasets Computer Science Systems Kafka

I asked ChatGPT to write a blog post about Data Engineering. Here it is.

Confessions of a Data Guy

DECEMBER 29, 2022

Data engineering is a vital field within the realm of data science that focuses on the practical aspects of collecting, storing, and processing large amounts of data. Here it is. appeared first on Confessions of a Data Guy.

Data Engineer

Data Engineer Data Engineering Engineering IT

Unlocking Data Team Success: Are You Process-Centric or Data-Centric?

DataKitchen

MARCH 20, 2025

Unlocking Data Team Success: Are You Process-Centric or Data-Centric? We’ve identified two distinct types of data teams: process-centric and data-centric. Process-centric data teams focus their energies predominantly on orchestrating and automating workflows. They work in and on these pipelines.

Pipeline-centric

Pipeline-centric Database-centric Process Data

What is System Hacking? Types and Prevention

Edureka

APRIL 10, 2025

When you hear the term System Hacking, it might bring to mind shadowy figures behind computer screens and high-stakes cyber heists. In this blog, we’ll explore the definition, purpose, process, and methods of prevention related to system hacking, offering a detailed overview to help demystify the concept.

Systems

Systems Education Banking Accessible

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

LinkedIn Engineering

OCTOBER 19, 2023

Authors: Bingfeng Xia and Xinyu Liu Background At LinkedIn, Apache Beam plays a pivotal role in stream processing infrastructures that process over 4 trillion events daily through more than 3,000 pipelines across multiple production data centers.

Process

Process Lambda Architecture Kafka Machine Learning

The Stream Processing Model Behind Google Cloud Dataflow

Towards Data Science

APRIL 30, 2024

Balancing correctness, latency, and cost in unbounded data processing Image created by the author. Intro Google Dataflow is a fully managed data processing service that provides serverless unified stream and batch data processing. It is the first choice Google would recommend when dealing with a stream processing workload.

Google Cloud

Google Cloud Process Cloud Lambda Architecture

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Snowflake

NOVEMBER 26, 2024

This blog post is the second in a three-part series on migrations. A consolidated data system to accommodate a big(ger) WHOOP When a company experiences exponential growth over a short period, it’s easy for its data foundation to feel a bit like it was built on the fly.

Data Warehouse

Data Warehouse Cloud PostgreSQL Hadoop

Best Practices for Real-Time Stream Processing

Striim

MARCH 21, 2025

What is Real-Time Stream Processing? To access real-time data, organizations are turning to stream processing. To access real-time data, organizations are turning to stream processing. There are two main data processing paradigms: batch processing and stream processing.

Process

Process Data Warehouse Kafka Data Pipeline

Data News — Week 25.02

Christophe Blefari

JANUARY 11, 2025

Over the past four weeks, I took a break from blogging and LinkedIn to focus on building nao. AI companies are aiming for the moon—AGI—promising it will arrive once OpenAI develops a system capable of generating at least $100 billion in profits. Meanwhile, the AI landscape remains unpredictable.

Data

Data Data Warehouse Coding Programming Language

Real-Time AI for Crisis Management: Responding Faster with Smarter Systems

Striim

JANUARY 30, 2025

When integrated effectively, AI and machine learning (ML) models can process data streams at near-zero latency, empowering teams to make split-second decisions. Systems must be capable of handling high-velocity data without bottlenecks. Thats where real-time artificial intelligence (AI) can help.

Systems

Systems Management Hospitality Healthcare

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

Additionally, multiple copies of the same data locked in proprietary systems contribute to version control issues, redundancies, staleness, and management headaches. This guarantees data quality and automates the laborious, manual processes required to maintain data reliability.

Metadata

Metadata Management Data Governance Government

How Netflix Accurately Attributes eBPF Flow Logs

Netflix Tech

APRIL 8, 2025

By Cheng Xie , Bryan Shultz , and Christine Xu In a previous blog post , we described how Netflix uses eBPF to capture TCP flow logs at scale for enhanced network insights. Delays and failures are inevitable in distributed systems, which may delay IP address change events from reaching FlowCollector. With 30 c7i.2xlarge

AWS

AWS Kafka Cloud Programming

Handling Online-Offline Discrepancy in Pinterest Ads Ranking System

Pinterest Engineering

JANUARY 18, 2024

In particular, our machine learning powered ads ranking systems are trying to understand users’ engagement and conversion intent and promote the right ads to the right user at the right time. Specifically, such discrepancies unfold into the following scenarios: Bug-free scenario : Our ads ranking system is working bug-free.

Systems

Systems Machine Learning Data Mining Algorithm

Data Engineering Weekly #195

Data Engineering Weekly

OCTOBER 27, 2024

Astasia Myers: The three components of the unstructured data stack LLMs and vector databases significantly improved the ability to process and understand unstructured data. The blog is an excellent summary of the existing unstructured data landscape. The blog from Meta discusses how it designed a privacy-preserving storage.

Data Engineer

Data Engineer Data Engineering Engineering Unstructured Data

Unapologetically Technical Episode 17 – Semih Salihoglu

Jesse Anderson

FEBRUARY 11, 2025

Semih is a researcher and entrepreneur with a background in distributed systems and databases. He then pursued his doctoral studies at Stanford University, delving into the complexities of database systems. Dont forget to subscribe to my YouTube channel to get the latest on Unapologetically Technical!

Computer Science

Computer Science Database Design Software Engineering Software Engineer

Top 10 Data Engineering Trends in 2025

Edureka

APRIL 22, 2025

This blog will explore the significant advancements, challenges, and opportunities impacting data engineering in 2025, highlighting the increasing importance for companies to stay updated. In 2025, this blog will discuss the most important data engineering trends, problems, and opportunities that companies should be aware of.

Data Engineer

Data Engineer Data Engineering Engineering Consulting

How Meta understands data at scale

Engineering at Meta

APRIL 28, 2025

Meta’s vast and diverse systems make it particularly challenging to comprehend its structure, meaning, and context at scale. Specifically, we have adopted a “shift-left” approach, integrating data schematization and annotations early in the product development process.

Metadata

Metadata Data Utilities Data Warehouse

Introducing Impressions at Netflix

Netflix Tech

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily.

Kafka

Kafka Datasets Utilities Metadata

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

This blog dives into the remarkable journey of a data team that achieved unparalleled efficiency using DataOps principles and software that transformed their analytics and data teams into a hyper-efficient powerhouse. Starting simply and iterating quickly gave the team time to build foundational processes before adding complexity and scaling.

Pharmaceutical

Pharmaceutical Data Lake Cloud Storage Project

Data Engineering Weekly #217

Data Engineering Weekly

APRIL 20, 2025

The blog took out the last edition’s recommendation on AI and summarized the current state of AI adoption in enterprises. One of the core challenges of data engineering, as the author put it elegantly, The core difficulty lies in the fact that each step in the process requires specialized domain knowledge.

Data Engineer

Data Engineer Data Engineering Engineering Kafka

Change Data Capture at Pinterest

Pinterest Engineering

NOVEMBER 18, 2024

Liang Mou; Staff Software Engineer, Logging Platform | Elizabeth (Vi) Nguyen; Software Engineer I, Logging Platform | In today’s data-driven world, businesses need to process and analyze data in real-time to make informed decisions. Real-Time Data Processing : CDC enables real-time data processing by capturing changes as they happen.

Kafka

Kafka MySQL Database Software Engineer

A senior engineer/EM job search story

The Pragmatic Engineer

AUGUST 10, 2023

The period from April to mid-May was challenging: I found myself in hiring freezes and canceled processes. Simplify helped me a lot to automatically fill applications in the most popular applicant tracking systems (ATS.) ’ How did you find the interview processes?

Engineering

Engineering Recruitment Retail Software Engineer

Accelerate AI Development with Snowflake

Snowflake

NOVEMBER 11, 2024

Here’s how Snowflake Cortex AI and Snowflake ML are accelerating the delivery of trusted AI solutions for the most critical generative AI applications: Natural language processing (NLP) for data pipelines: Large language models (LLMs) have a transformative potential, but they often batch inference integration into pipelines, which can be cumbersome.

Unstructured Data

Unstructured Data SQL AWS Healthcare

Data Engineering Weekly #210

Data Engineering Weekly

MARCH 2, 2025

I found the blog to be a fresh take on the skill in demand by layoff datasets. DeepSeek continues to impact the Data and AI landscape with its recent open-source tools, such as Fire-Flyer File System (3FS) and smallpond. The blog provides an excellent analysis of smallpond compared to Spark and Daft.

Data Engineer

Data Engineer Data Engineering Engineering Datasets

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Netflix Tech

MAY 4, 2023

Behind the scenes, a myriad of systems and services are involved in orchestrating the product experience. These backend systems are consistently being evolved and optimized to meet and exceed customer and product expectations. This blog series will examine the tools, techniques, and strategies we have utilized to achieve this goal.

Utilities

Utilities Systems Architecture Coding

Announcing halide-haskell - a Haskell interface for the Halide image and array processing language

Tweag

JUNE 7, 2023

The availability of deep learning frameworks like PyTorch or JAX has revolutionized array processing, regardless of whether one is working on machine learning tasks or other numerical algorithms. Nix is needed to simplify the installation and patching of system dependencies (not all our patches have been merged upstream yet).

Process

Process Coding Python Deep Learning

Solving the weekly menu puzzle pt.2: recommendations at Picnic

Picnic Engineering

APRIL 7, 2025

A little over a year ago, we shared a blog post about our journey to enhance customers meal planning experience with personalized recipe recommendations. We explained how a system that learns from your tastes and habits could solve this issue, ultimately making the daily task of choosing meals both effortless and inspiring.

Datasets

Datasets Systems Architecture Machine Learning

Building ETL Pipeline with Snowpark

Cloudyard

DECEMBER 24, 2024

In this blog, well explore Building an ETL Pipeline with Snowpark by simulating a scenario where commerce data flows through distinct data layersRAW, SILVER, and GOLDEN.These tables form the foundation for insightful analytics and robust business intelligence. SILVER Layer : Cleansed and enriched data prepared for analytical processing.

Building

Building Raw Data Scala Business Intelligence

Handling Network Throttling with AWS EC2 at Pinterest

Pinterest Engineering

APRIL 7, 2025

In recent years, while managing Pinterests EC2 infrastructure, particularly for our essential online storage systems, we identified a significant challenge: the lack of clear insights into EC2s network performance and its direct impact on our applications reliability and performance. 4xl with up to 12.5

AWS

AWS Bytes Database Data Ingestion

Data Engineering Weekly #209

Data Engineering Weekly

FEBRUARY 23, 2025

It covers nine categories: storage systems, data lake platforms, processing, integration, orchestration, infrastructure, ML/AI, metadata management, and analytics. I found the blog to be a comprehensive roadmap for data engineering in 2025. Let me know in the comments. link] All rights reserved ProtoGrowth Inc, India.

Data Engineer

Data Engineer Data Engineering Engineering Kafka

Part 2: A Survey of Analytics Engineering Work at Netflix

Netflix Tech

JANUARY 2, 2025

However, due to the absence of a control group in these countries, we adopt a synthetic control framework ( blog post ) to estimate the counterfactual scenario. Each format has a different production process and different patterns of cash spend, called our Content Forecast. As plans change, the cash forecast will change.

Engineering

Engineering Entertainment Designing Technology

The Race For Data Quality in a Medallion Architecture

DataKitchen

NOVEMBER 5, 2024

The Medallion architecture is a design pattern that helps data teams organize data processing and storage into three distinct layers, often called Bronze, Silver, and Gold. This foundational layer is a repository for various data types, from transaction logs and sensor data to social media feeds and system logs.

Architecture

Architecture Raw Data Pipeline-centric Data Ingestion

Data Engineering Weekly #207

Data Engineering Weekly

FEBRUARY 9, 2025

[link] QuantumBlack: Solving data quality for gen AI applications Unstructured data processing is a top priority for enterprises that want to harness the power of GenAI. It brings challenges in data processing and quality, but what data quality means in unstructured data is a top question for every organization.

Data Engineer

Data Engineer Data Engineering Engineering Unstructured Data

Data Engineering Weekly #215

Data Engineering Weekly

APRIL 6, 2025

The article summarizes the recent macro trends in AI and data engineering, focusing on Vibe coding, human-in-the-loop system design, and rapid simplification of developer tooling. Unlike coding, we never (or rarely) apply a code review process for documentation. The Grab blog delights me since I have tried to do this many times.

Data Engineer

Data Engineer Data Engineering Engineering Kafka

Snowpark Magic: Auto-Validate Your S3 to Snowflake Data Loads

Cloudyard

APRIL 22, 2025

Read Time: 2 Minute, 34 Second Introduction In modern data pipelines, especially in cloud data platforms like Snowflake, data ingestion from external systems such as AWS S3 is common. In this blog, we introduce a Snowpark-powered Data Validation Framework that: Dynamically reads data files (CSV) from an S3 stage.

Data Validation

Data Validation Data Ingestion Data Pipeline AWS

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

Netflix Tech

JUNE 13, 2023

This is where large-scale system migrations come into play. Our previous blog post presented replay traffic testing — a crucial instrument in our toolkit that allows us to implement these transformations with precision and reliability. Sticky Canary is an improvement to the traditional canary process that addresses this limitation.

Systems

Systems Utilities Entertainment Coding

The Three Levels of SQL Comprehension: What they are and why you need to know about them

dbt Developer Hub

JANUARY 22, 2025

By building on top of tools that truly understand SQL, it is possible to create systems that are much more capable, resilient and flexible than weve seen to date. At Level 2, the system produces a complete Logical Plan. Thats hard to demonstrate in a blog post, but fortunately theres another easier option: look at some failing queries.

SQL

SQL Database Coding Technology

Data Engineering Weekly #196

Data Engineering Weekly

NOVEMBER 3, 2024

Foundation Capital: A System of Agents brings Service-as-Software to life software is no longer simply a tool for organizing work; software becomes the worker itself, capable of understanding, executing, and improving upon traditionally human-delivered services. It's good to know about Dapr and restate.dev.

Data Engineer

Data Engineer Data Engineering Engineering Pipeline-centric

Ensuring the Successful Launch of Ads on Netflix

Netflix Tech

JUNE 1, 2023

In this blog post, we’ll discuss the methods we used to ensure a successful launch, including: How we tested the system Netflix technologies involved Best practices we developed Realistic Test Traffic Netflix traffic ebbs and flows throughout the day in a sinusoidal pattern. Basic with ads was launched worldwide on November 3rd.

Algorithm

Algorithm Kafka Metadata Systems

Developing vAirify for the ECMWF by Benjamin Ell-Jones

Scott Logic

JANUARY 13, 2025

This blog post will highlight the experience I had as a relatively junior developer on a Scott Logic project. The aim was to build a system that would allow them to compare their forecasted Air Quality Index (AQI) values with actual measured AQI data. We use powerful Python libraries such as Pandas and Xarray for this task.

Python

Python Project Coding Database

Robinhood to Acquire Bitstamp

Robinhood

JUNE 6, 2024

Robinhood and Bitstamp customers can expect the same level of service, security and reliability and as we move forward, we are committed to maintaining transparency throughout this process. Robinhood expects the final deal consideration to be approximately $200 million in cash, subject to customary purchase price adjustments.

Retail

Retail Systems Management Process

How Meta discovers data flows via lineage at scale

An educational side project

Webinars

Trending Sources

Why Open Table Format Architecture is Essential for Modern Data Systems

Webinars

Netflix’s Distributed Counter Abstraction

I asked ChatGPT to write a blog post about Data Engineering. Here it is.

Unlocking Data Team Success: Are You Process-Centric or Data-Centric?

What is System Hacking? Types and Prevention

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

The Stream Processing Model Behind Google Cloud Dataflow

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Best Practices for Real-Time Stream Processing

Data News — Week 25.02

Real-Time AI for Crisis Management: Responding Faster with Smarter Systems

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

How Netflix Accurately Attributes eBPF Flow Logs

Handling Online-Offline Discrepancy in Pinterest Ads Ranking System

Data Engineering Weekly #195

Unapologetically Technical Episode 17 – Semih Salihoglu

Top 10 Data Engineering Trends in 2025

How Meta understands data at scale

Introducing Impressions at Netflix

Drug Launch Case Study: Amazing Efficiency Using DataOps

Data Engineering Weekly #217

Change Data Capture at Pinterest

A senior engineer/EM job search story

Accelerate AI Development with Snowflake

Data Engineering Weekly #210

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Announcing halide-haskell - a Haskell interface for the Halide image and array processing language

Solving the weekly menu puzzle pt.2: recommendations at Picnic

Building ETL Pipeline with Snowpark

Handling Network Throttling with AWS EC2 at Pinterest

Data Engineering Weekly #209

Part 2: A Survey of Analytics Engineering Work at Netflix

The Race For Data Quality in a Medallion Architecture

Data Engineering Weekly #207

Data Engineering Weekly #215

Snowpark Magic: Auto-Validate Your S3 to Snowflake Data Loads

Migrating Critical Traffic At Scale with No Downtime?—?Part 2

The Three Levels of SQL Comprehension: What they are and why you need to know about them

Data Engineering Weekly #196

Ensuring the Successful Launch of Ads on Netflix

Developing vAirify for the ECMWF by Benjamin Ell-Jones

Robinhood to Acquire Bitstamp

Stay Connected