Blog and Systems - Data Engineering Digest

Establishing a Large Scale Learned Retrieval System at Pinterest

Pinterest Engineering

JANUARY 31, 2025

Modern large-scale recommendation systems usually include multiple stages where retrieval aims at retrieving candidates from billions of candidate pools, and ranking predicts which item a user tends to engage from the trimmed candidate set retrieved from early stages [2]. General multi-stage recommendation system design in Pinterest.

Systems

Systems Metadata Machine Learning Architecture

Building a Question-Answering System Using RAG

WeCloudData

APRIL 9, 2025

The ability to extract information from vast amounts of text has made question-answering (QA) systems essential in the modern era of AI-driven apps. RAG-based question-answering systems use large language models to generate human-like responses to user queries.

Systems

Systems Building IT Data Science

An educational side project

The Pragmatic Engineer

JUNE 1, 2023

Juraj included system monitoring parts which monitor the server’s capacity he runs the app on: The monitoring page on the Rides app And it doesn’t end here. Juraj created a systems design explainer on how he built this project, and the technologies used: The systems design diagram for the Rides application The app uses: Node.js

Education

Education Project PostgreSQL Software Engineering

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. In this blog, we will discuss: What is the Open Table format (OTF)? These systems are built on open standards and offer immense analytical and transactional processing flexibility.

Architecture

Architecture Systems Data Lake Google Cloud

I asked ChatGPT to write a blog post about Data Engineering. Here it is.

Confessions of a Data Guy

DECEMBER 29, 2022

It involves designing and building the infrastructure to store and process data, as well as developing the tools and systems to extract valuable insights and knowledge from that […] The post I asked ChatGPT to write a blog post about Data Engineering. Here it is. appeared first on Confessions of a Data Guy.

Data Engineering

Data Engineering Data Engineer Engineering IT

What is System Hacking? Types and Prevention

Edureka

APRIL 10, 2025

When you hear the term System Hacking, it might bring to mind shadowy figures behind computer screens and high-stakes cyber heists. In this blog, we’ll explore the definition, purpose, process, and methods of prevention related to system hacking, offering a detailed overview to help demystify the concept.

Systems

Systems Education Banking Accessible

Build Compound AI Systems Faster with Databricks Mosaic AI

databricks

OCTOBER 1, 2024

Many of our customers are shifting from monolithic prompts with general-purpose models to specialized compound AI systems to achieve the quality needed for.

Systems

Systems Building Data Science Engineering

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Metas systems. In this blog, we will delve into an early stage in PAI implementation: data lineage. Data lineage enables us to efficiently navigate these assets and protect user data.

Data Warehouse

Data Warehouse SQL Programming Language Data

Netflix’s Distributed Counter Abstraction

Netflix Tech

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Datasets

Datasets Computer Science Systems Kafka

Data News — Week 25.02

Christophe Blefari

JANUARY 11, 2025

Over the past four weeks, I took a break from blogging and LinkedIn to focus on building nao. AI companies are aiming for the moon—AGI—promising it will arrive once OpenAI develops a system capable of generating at least $100 billion in profits. Meanwhile, the AI landscape remains unpredictable.

Data

Data Data Warehouse Coding Programming Language

Real-Time AI for Crisis Management: Responding Faster with Smarter Systems

Striim

JANUARY 30, 2025

Systems must be capable of handling high-velocity data without bottlenecks. However, leveraging AI agents like Striims Sherlock and Sentinel, which enable encryption and masking for PII, can help ensure that data is safe even in the event a breach occurs. As you can see, theres a lot to consider in adopting real-time AI.

Systems

Systems Management Hospitality Healthcare

PagerDuty alternatives

The Pragmatic Engineer

MAY 12, 2023

For a realtime alerting system! I have since talked with engineers on the OpsGenie team who said that it felt that Atlassian rushed the OpsGenie integration - after buying the company - onto their unified internal stack, ignoring warnings that an outage in their identity system would take OpsGenie down. Yes: 2 for weeks!

Systems

Systems Management Engineering IT

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Snowflake

NOVEMBER 26, 2024

This blog post is the second in a three-part series on migrations. A consolidated data system to accommodate a big(ger) WHOOP When a company experiences exponential growth over a short period, it’s easy for its data foundation to feel a bit like it was built on the fly.

Data Warehouse

Data Warehouse Cloud PostgreSQL Hadoop

AI Agent Systems: Modular Engineering for Reliable Enterprise AI Applications

databricks

NOVEMBER 12, 2024

Monolithic to Modular The proof of concept (POC) of any new technology often starts with large, monolithic units that are difficult to characterize.

Systems

Systems Engineering Technology Data

Data Engineering Weekly #195

Data Engineering Weekly

OCTOBER 27, 2024

The blog is an excellent summary of the existing unstructured data landscape. It is exciting to read probably the first blog on building a vector search infrastructure at scale. The blog from Meta discusses how it designed a privacy-preserving storage.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Databricks Named a Leader in 2023 Gartner® Magic Quadrant™ for Cloud Database Management Systems

databricks

DECEMBER 21, 2023

We are excited to announce that Gartner has recognized Databricks as a Leader for a third consecutive year in the 2023 Gartner® Magic.

Database

Database Systems Cloud Management

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

Additionally, multiple copies of the same data locked in proprietary systems contribute to version control issues, redundancies, staleness, and management headaches. The post Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate appeared first on Cloudera Blog.

Metadata

Metadata Management Data Governance Government

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

The blog highlights how moving from 6-character base-64 to 20-digit base-2 file distribution brings more distribution in S3 and reduces request failures. The blog is a good summary of how to use Snowflake QUERY_TAG to measure and monitor query performance. The blog post made me curious to understand DataFusion's internals.

Data Engineering

Data Engineering Data Engineer Engineering Insurance

Foundation Model for Personalized Recommendation

Netflix Tech

MARCH 28, 2025

By Ko-Jen Hsiao , Yesu Feng and Sudarshan Lamkhede Motivation Netflixs personalized recommender system is a complex system, boasting a variety of specialized machine learned models each catering to distinct needs including Continue Watching and Todays Top Picks for You. Refer to our recent overview for more details).

Metadata

Metadata Bytes Data Mining Entertainment

Improving Pinterest Search Relevance Using Large Language Models

Pinterest Engineering

APRIL 4, 2025

In this blog, we will go through the technical design and share some offline and online results for our LLM-based search relevance pipeline. Pin Text Representations Pins on Pinterest are rich multimedia entities that feature images, videos, and other contents, often linked to external webpages or blogs.

Machine Learning

Machine Learning Metadata Architecture Datasets

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Netflix Tech

MAY 4, 2023

Behind the scenes, a myriad of systems and services are involved in orchestrating the product experience. These backend systems are consistently being evolved and optimized to meet and exceed customer and product expectations. This blog series will examine the tools, techniques, and strategies we have utilized to achieve this goal.

Utilities

Utilities Systems Architecture Coding

Solving the weekly menu puzzle pt.2: recommendations at Picnic

Picnic Engineering

APRIL 7, 2025

A little over a year ago, we shared a blog post about our journey to enhance customers meal planning experience with personalized recipe recommendations. We explained how a system that learns from your tastes and habits could solve this issue, ultimately making the daily task of choosing meals both effortless and inspiring.

Datasets

Datasets Systems Architecture Machine Learning

Did Automattic commit open source theft?

The Pragmatic Engineer

OCTOBER 18, 2024

Corporate conflict recap Automattic is the creator of open source WordPress content management system (CMS), and WordPress powers an incredible 43% of webpages and 65% of CMSes. This event is shameful and unprecedented in the history of open source on the web. Automattic raised $980M in venture funding and was valued at $7.5B

Engineering

Engineering Government Project AWS

Cloudera AI Inference Service Enables Easy Integration and Deployment of GenAI Into Your Production Environments

Cloudera

DECEMBER 4, 2024

System metrics, such as inference latency and throughput, are available as Prometheus metrics. Users can manage all of their models and applications on the Cloudera AI Inference service with common CI/CD systems using Cloudera service accounts, also known as machine users.

Architecture

Architecture Machine Learning BI Deep Learning

Securing the Future: How AI Gateways Protect AI Agent Systems in the Era of Generative AI

databricks

NOVEMBER 13, 2024

As organizations integrate AI agent systems into. Generative AI has become a powerful reality, transforming industries by enhancing customer experiences and automating decisions.

Systems

How Netflix Accurately Attributes eBPF Flow Logs

Netflix Tech

APRIL 8, 2025

By Cheng Xie , Bryan Shultz , and Christine Xu In a previous blog post , we described how Netflix uses eBPF to capture TCP flow logs at scale for enhanced network insights. Delays and failures are inevitable in distributed systems, which may delay IP address change events from reaching FlowCollector.

AWS

AWS Kafka Cloud Programming

Data Engineering Weekly #210

Data Engineering Weekly

MARCH 2, 2025

I found the blog to be a fresh take on the skill in demand by layoff datasets. DeepSeek continues to impact the Data and AI landscape with its recent open-source tools, such as Fire-Flyer File System (3FS) and smallpond. The blog provides an excellent analysis of smallpond compared to Spark and Daft.

Data Engineering

Data Engineering Data Engineer Engineering Datasets

Handling Network Throttling with AWS EC2 at Pinterest

Pinterest Engineering

APRIL 7, 2025

In recent years, while managing Pinterests EC2 infrastructure, particularly for our essential online storage systems, we identified a significant challenge: the lack of clear insights into EC2s network performance and its direct impact on our applications reliability and performance. 4xl with up to 12.5

AWS

AWS Bytes Database Data Ingestion

Unapologetically Technical Episode 17 – Semih Salihoglu

Jesse Anderson

FEBRUARY 11, 2025

Semih is a researcher and entrepreneur with a background in distributed systems and databases. He then pursued his doctoral studies at Stanford University, delving into the complexities of database systems.

Computer Science

Computer Science Database Design Software Engineer Software Engineering

Announcing Public Preview of Delta Sharing with Cloudflare R2 Integration

databricks

FEBRUARY 29, 2024

Special thanks to Phillip Jones, Senior Product Manager, and Harshal Brahmbhatt, Systems Engineer from Cloudflare for their contributions to this blog. Organizations across.

Engineering

Engineering Systems Management

Databricks Named a Leader in 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

databricks

DECEMBER 23, 2024

We are excited to announce that Gartner has recognized Databricks as a Leader for a fourth consecutive year in the 2024 Gartner Magic.

Database

Database Systems Cloud Management

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

This blog dives into the remarkable journey of a data team that achieved unparalleled efficiency using DataOps principles and software that transformed their analytics and data teams into a hyper-efficient powerhouse. A software system where processes can be developed and shared is required. Here is another example.

Pharmaceutical

Pharmaceutical Data Lake Cloud Storage Project

Change Data Capture at Pinterest

Pinterest Engineering

NOVEMBER 18, 2024

In this blog post, we’ll explore what CDC is, why it’s important, and our journey of implementing Generic CDC solutions for all online databases at Pinterest. This is crucial for applications that require up-to-date information, such as fraud detection systems or recommendation engines. What is Change Data Capture?

Kafka

Kafka MySQL Database Software Engineering

Introducing Impressions at Netflix

Netflix Tech

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily.

Kafka

Kafka Datasets Metadata Utilities

Data Engineering Weekly #207

Data Engineering Weekly

FEBRUARY 9, 2025

I found the product blog from QuantumBlack gives a view of data quality in unstructured data. link] Pinterest: Advancements in Embedding-Based Retrieval at Pinterest Homefeed Pinterest writes about its embedding-based retrieval system enhancements for Homefeed personalization and engagement.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Unapologetically Technical Episode 14 – Cliff Crosland

Jesse Anderson

OCTOBER 29, 2024

He sees logs as a treasure trove of insights and believes effective log analysis is critical in today’s complex systems. We discussed his early experiences with distributed systems, including his work on creating graphs and entity resolution. Lastly, we go in-depth into Scanner.dev, covering what it is and how it works.

Data Engineering

Data Engineering Data Engineer Systems Engineering

Making Email Better With AI At Shortwave

Data Engineering Podcast

APRIL 21, 2024

How do you manage the personalization of the AI functionality in your system for each user/team? Contact Info LinkedIn Blog Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? When is Shortwave the wrong choice? What do you have planned for the future of Shortwave?

Data Lake

Data Lake High Quality Data Machine Learning Data Pipeline

Data Engineering Weekly #209

Data Engineering Weekly

FEBRUARY 23, 2025

It covers nine categories: storage systems, data lake platforms, processing, integration, orchestration, infrastructure, ML/AI, metadata management, and analytics. I found the blog to be a comprehensive roadmap for data engineering in 2025. I wonder if these systems expand more capabilities that eventually fall on their own weight.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Adding Write Functionality to Pages with Self-Service APIs

Picnic Engineering

APRIL 14, 2025

(Written by Kirill Voloshin & Abdullah Abusamrah ) In our previous blog posts , we have covered our server-driven UI framework called Picnic Page Platform. This blog post explores how weve further evolved our framework to support more complex flows that interact with our back-end systems, persist data andmore.

Java

Java Retail SQL Database

Data Engineering Weekly #196

Data Engineering Weekly

NOVEMBER 3, 2024

Foundation Capital: A System of Agents brings Service-as-Software to life software is no longer simply a tool for organizing work; software becomes the worker itself, capable of understanding, executing, and improving upon traditionally human-delivered services. It's good to know about Dapr and restate.dev.

Data Engineering

Data Engineering Data Engineer Engineering Pipeline-centric

Building Holiday Finds: How Pinterest Engineers Reimagined Gift Discovery

Pinterest Engineering

MARCH 26, 2025

Unified Logging System: We implemented comprehensive engagement tracking that helps us understand how users interact with gift content differently from standardPins. Unified Logging System: We implemented comprehensive engagement tracking that helps us understand how users interact with gift content differently from standardPins.

Building

Building Engineering Algorithm Systems

Introducing Configurable Metaflow

Netflix Tech

DECEMBER 19, 2024

This followed a previous blog on the same topic. In this context, having a single configuration system to manage a ML project holistically gives users increased project coherence and decreased projectrisk. Configuration in Metaboost Ease of configuration and templatizing are core values of Metaboost.

Machine Learning

Machine Learning Project Data Warehouse Coding

Ensuring the Successful Launch of Ads on Netflix

Netflix Tech

JUNE 1, 2023

In this blog post, we’ll discuss the methods we used to ensure a successful launch, including: How we tested the system Netflix technologies involved Best practices we developed Realistic Test Traffic Netflix traffic ebbs and flows throughout the day in a sinusoidal pattern. Basic with ads was launched worldwide on November 3rd.

Algorithm

Algorithm Kafka Metadata Systems

5 Advantages of Real-Time ETL for Snowflake

Striim

MARCH 21, 2025

This blog post describes the advantages of real-time ETL and how it increases the value gained from Snowflake implementations. If you have Snowflake or are considering it, now is the time to think about your ETL for Snowflake.

Data Warehouse

Data Warehouse MongoDB MySQL Hadoop

Establishing a Large Scale Learned Retrieval System at Pinterest

Building a Question-Answering System Using RAG

Trending Sources

An educational side project

Why Open Table Format Architecture is Essential for Modern Data Systems

I asked ChatGPT to write a blog post about Data Engineering. Here it is.

What is System Hacking? Types and Prevention

Build Compound AI Systems Faster with Databricks Mosaic AI

How Meta discovers data flows via lineage at scale

Netflix’s Distributed Counter Abstraction

Data News — Week 25.02

Real-Time AI for Crisis Management: Responding Faster with Smarter Systems

PagerDuty alternatives

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

AI Agent Systems: Modular Engineering for Reliable Enterprise AI Applications

Data Engineering Weekly #195

Databricks Named a Leader in 2023 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Data Engineering Weekly #198

Foundation Model for Personalized Recommendation

Improving Pinterest Search Relevance Using Large Language Models

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Solving the weekly menu puzzle pt.2: recommendations at Picnic

Did Automattic commit open source theft?

Cloudera AI Inference Service Enables Easy Integration and Deployment of GenAI Into Your Production Environments

Securing the Future: How AI Gateways Protect AI Agent Systems in the Era of Generative AI

How Netflix Accurately Attributes eBPF Flow Logs

Data Engineering Weekly #210

Handling Network Throttling with AWS EC2 at Pinterest

Unapologetically Technical Episode 17 – Semih Salihoglu

Announcing Public Preview of Delta Sharing with Cloudflare R2 Integration

Databricks Named a Leader in 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Drug Launch Case Study: Amazing Efficiency Using DataOps

Change Data Capture at Pinterest

Introducing Impressions at Netflix

Data Engineering Weekly #207

Unapologetically Technical Episode 14 – Cliff Crosland

Making Email Better With AI At Shortwave

Data Engineering Weekly #209

Adding Write Functionality to Pages with Self-Service APIs

Data Engineering Weekly #196

Building Holiday Finds: How Pinterest Engineers Reimagined Gift Discovery

Introducing Configurable Metaflow

Ensuring the Successful Launch of Ads on Netflix

5 Advantages of Real-Time ETL for Snowflake

Stay Connected