Blog and Systems - Data Engineering Digest

Establishing a Large Scale Learned Retrieval System at Pinterest

Pinterest Engineering

JANUARY 31, 2025

Modern large-scale recommendation systems usually include multiple stages where retrieval aims at retrieving candidates from billions of candidate pools, and ranking predicts which item a user tends to engage from the trimmed candidate set retrieved from early stages [2]. General multi-stage recommendation system design in Pinterest.

Systems

Systems Metadata Machine Learning Architecture

An educational side project

The Pragmatic Engineer

JUNE 1, 2023

Juraj included system monitoring parts which monitor the server’s capacity he runs the app on: The monitoring page on the Rides app And it doesn’t end here. Juraj created a systems design explainer on how he built this project, and the technologies used: The systems design diagram for the Rides application The app uses: Node.js

Education

Education Project PostgreSQL Software Engineer

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. In this blog, we will discuss: What is the Open Table format (OTF)? These systems are built on open standards and offer immense analytical and transactional processing flexibility.

Architecture

Architecture Systems Data Lake Google Cloud

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Metas systems. In this blog, we will delve into an early stage in PAI implementation: data lineage. Data lineage enables us to efficiently navigate these assets and protect user data.

Data Warehouse

Data Warehouse SQL Programming Language Data

Build Compound AI Systems Faster with Databricks Mosaic AI

databricks

OCTOBER 1, 2024

Many of our customers are shifting from monolithic prompts with general-purpose models to specialized compound AI systems to achieve the quality needed for.

Systems

Systems Building Data Science Engineering

Building a Question-Answering System Using RAG

WeCloudData

APRIL 9, 2025

The ability to extract information from vast amounts of text has made question-answering (QA) systems essential in the modern era of AI-driven apps. RAG-based question-answering systems use large language models to generate human-like responses to user queries.

Systems

Systems Building IT Data Science

Netflix’s Distributed Counter Abstraction

Netflix Tech

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Datasets

Datasets Computer Science Systems Kafka

What is System Hacking? Types and Prevention

Edureka

APRIL 10, 2025

When you hear the term System Hacking, it might bring to mind shadowy figures behind computer screens and high-stakes cyber heists. In this blog, we’ll explore the definition, purpose, process, and methods of prevention related to system hacking, offering a detailed overview to help demystify the concept.

Systems

Systems Education Banking Accessibility

Data News — Week 25.02

Christophe Blefari

JANUARY 11, 2025

Over the past four weeks, I took a break from blogging and LinkedIn to focus on building nao. AI companies are aiming for the moon—AGI—promising it will arrive once OpenAI develops a system capable of generating at least $100 billion in profits. Meanwhile, the AI landscape remains unpredictable.

Data

Data Data Warehouse Coding Programming Language

Real-Time AI for Crisis Management: Responding Faster with Smarter Systems

Striim

JANUARY 30, 2025

Systems must be capable of handling high-velocity data without bottlenecks. However, leveraging AI agents like Striims Sherlock and Sentinel, which enable encryption and masking for PII, can help ensure that data is safe even in the event a breach occurs. As you can see, theres a lot to consider in adopting real-time AI.

Systems

Systems Management Hospitality Healthcare

PagerDuty alternatives

The Pragmatic Engineer

MAY 12, 2023

For a realtime alerting system! I have since talked with engineers on the OpsGenie team who said that it felt that Atlassian rushed the OpsGenie integration - after buying the company - onto their unified internal stack, ignoring warnings that an outage in their identity system would take OpsGenie down. Yes: 2 for weeks!

Systems

Systems Management Engineering IT

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Snowflake

NOVEMBER 26, 2024

This blog post is the second in a three-part series on migrations. A consolidated data system to accommodate a big(ger) WHOOP When a company experiences exponential growth over a short period, it’s easy for its data foundation to feel a bit like it was built on the fly.

Data Warehouse

Data Warehouse Cloud PostgreSQL Hadoop

AI Agent Systems: Modular Engineering for Reliable Enterprise AI Applications

databricks

NOVEMBER 12, 2024

Monolithic to Modular The proof of concept (POC) of any new technology often starts with large, monolithic units that are difficult to characterize.

Systems

Systems Engineering Technology Data

Databricks Named a Leader in 2023 Gartner® Magic Quadrant™ for Cloud Database Management Systems

databricks

DECEMBER 21, 2023

We are excited to announce that Gartner has recognized Databricks as a Leader for a third consecutive year in the 2023 Gartner® Magic.

Database

Database Systems Cloud Management

Data Engineering Weekly #219

Data Engineering Weekly

MAY 4, 2025

The blog highlights how emerging AI tools automate otherwise cognitively intensive manual tasks to bring reliability in software engineering. link] Whatnot: Evolving Feed Ranking at Whatnot Whatnot describes their transition from a batch prediction system to an online inference framework for ranking, which is shown in their "For You Feed."

Data Engineering

Data Engineering Data Engineer Engineering Java

Data Engineering Weekly #195

Data Engineering Weekly

OCTOBER 27, 2024

The blog is an excellent summary of the existing unstructured data landscape. It is exciting to read probably the first blog on building a vector search infrastructure at scale. The blog from Meta discusses how it designed a privacy-preserving storage.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

Additionally, multiple copies of the same data locked in proprietary systems contribute to version control issues, redundancies, staleness, and management headaches. The post Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate appeared first on Cloudera Blog.

Metadata

Metadata Management Data Governance Government

Unapologetically Technical Episode 20 – Shane Murray

Jesse Anderson

MAY 5, 2025

Finally, Shane outlines how observability is crucial for emerging AI/ML workflows like RAG pipelines, discussing the monitoring of vector databases (like Pinecone), unstructured data, and the entire AI system lifecycle, concluding with a look at Monte Carlo’s exciting roadmap, including AI-powered troubleshooting agents.

Unstructured Data

Unstructured Data Finance Metadata Architecture

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

The blog highlights how moving from 6-character base-64 to 20-digit base-2 file distribution brings more distribution in S3 and reduces request failures. The blog is a good summary of how to use Snowflake QUERY_TAG to measure and monitor query performance. The blog post made me curious to understand DataFusion's internals.

Data Engineering

Data Engineering Data Engineer Engineering Insurance

How Meta understands data at scale

Engineering at Meta

APRIL 28, 2025

Meta’s vast and diverse systems make it particularly challenging to comprehend its structure, meaning, and context at scale. We discovered that a flexible and incremental approach was necessary to onboard the wide variety of systems and languages used in building Metas products. We believe that privacy drives product innovation.

Metadata

Metadata Data Utilities Data Warehouse

Data Engineering Weekly #217

Data Engineering Weekly

APRIL 20, 2025

The blog took out the last edition’s recommendation on AI and summarized the current state of AI adoption in enterprises. The simplistic model expressed in the blog made it easy for me to reason about the transactional system design. Kafka is probably the most reliable data infrastructure in the modern data era.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Netflix Tech

MAY 4, 2023

Behind the scenes, a myriad of systems and services are involved in orchestrating the product experience. These backend systems are consistently being evolved and optimized to meet and exceed customer and product expectations. This blog series will examine the tools, techniques, and strategies we have utilized to achieve this goal.

Utilities

Utilities Systems Architecture Coding

Did Automattic commit open source theft?

The Pragmatic Engineer

OCTOBER 18, 2024

Corporate conflict recap Automattic is the creator of open source WordPress content management system (CMS), and WordPress powers an incredible 43% of webpages and 65% of CMSes. This event is shameful and unprecedented in the history of open source on the web. Automattic raised $980M in venture funding and was valued at $7.5B

Engineering

Engineering Government Project AWS

Cloudera AI Inference Service Enables Easy Integration and Deployment of GenAI Into Your Production Environments

Cloudera

DECEMBER 4, 2024

System metrics, such as inference latency and throughput, are available as Prometheus metrics. Users can manage all of their models and applications on the Cloudera AI Inference service with common CI/CD systems using Cloudera service accounts, also known as machine users.

Architecture

Architecture Machine Learning BI Deep Learning

Securing the Future: How AI Gateways Protect AI Agent Systems in the Era of Generative AI

databricks

NOVEMBER 13, 2024

As organizations integrate AI agent systems into. Generative AI has become a powerful reality, transforming industries by enhancing customer experiences and automating decisions.

Systems

Top 10 Data Engineering Trends in 2025

Edureka

APRIL 22, 2025

This blog will explore the significant advancements, challenges, and opportunities impacting data engineering in 2025, highlighting the increasing importance for companies to stay updated. In 2025, this blog will discuss the most important data engineering trends, problems, and opportunities that companies should be aware of.

Data Engineering

Data Engineering Data Engineer Engineering Consulting

Data Engineering Weekly #210

Data Engineering Weekly

MARCH 2, 2025

I found the blog to be a fresh take on the skill in demand by layoff datasets. DeepSeek continues to impact the Data and AI landscape with its recent open-source tools, such as Fire-Flyer File System (3FS) and smallpond. The blog provides an excellent analysis of smallpond compared to Spark and Daft.

Data Engineering

Data Engineering Data Engineer Engineering Datasets

Unapologetically Technical Episode 17 – Semih Salihoglu

Jesse Anderson

FEBRUARY 11, 2025

Semih is a researcher and entrepreneur with a background in distributed systems and databases. He then pursued his doctoral studies at Stanford University, delving into the complexities of database systems.

Computer Science

Computer Science Database Design Software Engineer Software Engineering

Announcing Public Preview of Delta Sharing with Cloudflare R2 Integration

databricks

FEBRUARY 29, 2024

Special thanks to Phillip Jones, Senior Product Manager, and Harshal Brahmbhatt, Systems Engineer from Cloudflare for their contributions to this blog. Organizations across.

Engineering

Engineering Systems Management

Databricks Named a Leader in 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

databricks

DECEMBER 23, 2024

We are excited to announce that Gartner has recognized Databricks as a Leader for a fourth consecutive year in the 2024 Gartner Magic.

Database

Database Systems Cloud Management

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

This blog dives into the remarkable journey of a data team that achieved unparalleled efficiency using DataOps principles and software that transformed their analytics and data teams into a hyper-efficient powerhouse. A software system where processes can be developed and shared is required. Here is another example.

Pharmaceutical

Pharmaceutical Data Lake Cloud Storage Project

Improving Pinterest Search Relevance Using Large Language Models

Pinterest Engineering

APRIL 4, 2025

In this blog, we will go through the technical design and share some offline and online results for our LLM-based search relevance pipeline. Pin Text Representations Pins on Pinterest are rich multimedia entities that feature images, videos, and other contents, often linked to external webpages or blogs.

Machine Learning

Machine Learning Metadata Architecture Datasets

Change Data Capture at Pinterest

Pinterest Engineering

NOVEMBER 18, 2024

In this blog post, we’ll explore what CDC is, why it’s important, and our journey of implementing Generic CDC solutions for all online databases at Pinterest. This is crucial for applications that require up-to-date information, such as fraud detection systems or recommendation engines. What is Change Data Capture?

Kafka

Kafka MySQL Database Software Engineer

Introducing Impressions at Netflix

Netflix Tech

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily.

Kafka

Kafka Datasets Metadata Utilities

Solving the weekly menu puzzle pt.2: recommendations at Picnic

Picnic Engineering

APRIL 7, 2025

A little over a year ago, we shared a blog post about our journey to enhance customers meal planning experience with personalized recipe recommendations. We explained how a system that learns from your tastes and habits could solve this issue, ultimately making the daily task of choosing meals both effortless and inspiring.

Datasets

Datasets Systems Architecture Machine Learning

Data Engineering Weekly #207

Data Engineering Weekly

FEBRUARY 9, 2025

I found the product blog from QuantumBlack gives a view of data quality in unstructured data. link] Pinterest: Advancements in Embedding-Based Retrieval at Pinterest Homefeed Pinterest writes about its embedding-based retrieval system enhancements for Homefeed personalization and engagement.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Unapologetically Technical Episode 14 – Cliff Crosland

Jesse Anderson

OCTOBER 29, 2024

He sees logs as a treasure trove of insights and believes effective log analysis is critical in today’s complex systems. We discussed his early experiences with distributed systems, including his work on creating graphs and entity resolution. Lastly, we go in-depth into Scanner.dev, covering what it is and how it works.

Data Engineering

Data Engineering Data Engineer Systems Engineering

Data Engineering Weekly #209

Data Engineering Weekly

FEBRUARY 23, 2025

It covers nine categories: storage systems, data lake platforms, processing, integration, orchestration, infrastructure, ML/AI, metadata management, and analytics. I found the blog to be a comprehensive roadmap for data engineering in 2025. I wonder if these systems expand more capabilities that eventually fall on their own weight.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Weekly #220

Data Engineering Weekly

MAY 11, 2025

[link] Alibaba: A Comprehensive Analysis and Practical Implementation of the New Features in the MCP Specification When I delved further into learning about the MCP specification, Alibaba's blog was a handy guide to understanding the protocol spec's evolution over the last four months.

Data Engineering

Data Engineering Data Engineer Engineering Data

Data Engineering Weekly #196

Data Engineering Weekly

NOVEMBER 3, 2024

Foundation Capital: A System of Agents brings Service-as-Software to life software is no longer simply a tool for organizing work; software becomes the worker itself, capable of understanding, executing, and improving upon traditionally human-delivered services. It's good to know about Dapr and restate.dev.

Data Engineering

Data Engineering Data Engineer Engineering Pipeline-centric

Introducing Configurable Metaflow

Netflix Tech

DECEMBER 19, 2024

This followed a previous blog on the same topic. In this context, having a single configuration system to manage a ML project holistically gives users increased project coherence and decreased projectrisk. Configuration in Metaboost Ease of configuration and templatizing are core values of Metaboost.

Machine Learning

Machine Learning Project Data Warehouse Coding

Handling Network Throttling with AWS EC2 at Pinterest

Pinterest Engineering

APRIL 7, 2025

In recent years, while managing Pinterests EC2 infrastructure, particularly for our essential online storage systems, we identified a significant challenge: the lack of clear insights into EC2s network performance and its direct impact on our applications reliability and performance. 4xl with up to 12.5

AWS

AWS Bytes Database Data Ingestion

Ensuring the Successful Launch of Ads on Netflix

Netflix Tech

JUNE 1, 2023

In this blog post, we’ll discuss the methods we used to ensure a successful launch, including: How we tested the system Netflix technologies involved Best practices we developed Realistic Test Traffic Netflix traffic ebbs and flows throughout the day in a sinusoidal pattern. Basic with ads was launched worldwide on November 3rd.

Algorithm

Algorithm Kafka Metadata Systems

Part 2: A Survey of Analytics Engineering Work at Netflix

Netflix Tech

JANUARY 2, 2025

However, due to the absence of a control group in these countries, we adopt a synthetic control framework ( blog post ) to estimate the counterfactual scenario. We further break down performance by various dimensions, e.g. content type, genre, etc to help us pinpoint specific areas where the ASR system may encounter difficulties.

Engineering

Engineering Entertainment Designing Technology

Establishing a Large Scale Learned Retrieval System at Pinterest

An educational side project

Webinars

Trending Sources

Why Open Table Format Architecture is Essential for Modern Data Systems

Webinars

How Meta discovers data flows via lineage at scale

Build Compound AI Systems Faster with Databricks Mosaic AI

Building a Question-Answering System Using RAG

Netflix’s Distributed Counter Abstraction

What is System Hacking? Types and Prevention

Data News — Week 25.02

Real-Time AI for Crisis Management: Responding Faster with Smarter Systems

PagerDuty alternatives

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

AI Agent Systems: Modular Engineering for Reliable Enterprise AI Applications

Databricks Named a Leader in 2023 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Data Engineering Weekly #219

Data Engineering Weekly #195

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Unapologetically Technical Episode 20 – Shane Murray

Data Engineering Weekly #198

How Meta understands data at scale

Data Engineering Weekly #217

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Did Automattic commit open source theft?

Cloudera AI Inference Service Enables Easy Integration and Deployment of GenAI Into Your Production Environments

Securing the Future: How AI Gateways Protect AI Agent Systems in the Era of Generative AI

Top 10 Data Engineering Trends in 2025

Data Engineering Weekly #210

Unapologetically Technical Episode 17 – Semih Salihoglu

Announcing Public Preview of Delta Sharing with Cloudflare R2 Integration

Databricks Named a Leader in 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Drug Launch Case Study: Amazing Efficiency Using DataOps

Improving Pinterest Search Relevance Using Large Language Models

Change Data Capture at Pinterest

Introducing Impressions at Netflix

Solving the weekly menu puzzle pt.2: recommendations at Picnic

Data Engineering Weekly #207

Unapologetically Technical Episode 14 – Cliff Crosland

Data Engineering Weekly #209

Data Engineering Weekly #220

Data Engineering Weekly #196

Introducing Configurable Metaflow

Handling Network Throttling with AWS EC2 at Pinterest

Ensuring the Successful Launch of Ads on Netflix

Part 2: A Survey of Analytics Engineering Work at Netflix

Stay Connected