Designing and Systems - Data Engineering Digest

Learning System Design: Top 5 Essential Reads

KDnuggets

MAY 23, 2024

Explore system design with these expert-recommended books.

Designing

Designing Systems Programming

Data Engineering Interview Series #2: System Design

Start Data Engineering

JANUARY 20, 2025

Pipeline design] Design data pipelines to populate your data models 2.5. [Requirements gathering] Make sure you clearly understand the requirements & business use case 2.2. Understand source data] Know what you have to work with 2.3. Model your data] Define data models for historical analytics 2.4.

Designing

Designing Systems Data Engineering Data Engineer

Designing Data Transfer Systems That Scale

Data Engineering Podcast

DECEMBER 3, 2023

Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. With Datafold, you can seamlessly plan, translate, and validate data across systems, massively accelerating your migration project. When is DoubleCloud Data Transfer the wrong choice?

Systems

Systems Designing Data Lake SQL

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Establishing a Large Scale Learned Retrieval System at Pinterest

Pinterest Engineering

JANUARY 31, 2025

Modern large-scale recommendation systems usually include multiple stages where retrieval aims at retrieving candidates from billions of candidate pools, and ranking predicts which item a user tends to engage from the trimmed candidate set retrieved from early stages [2]. General multi-stage recommendation system design in Pinterest.

Systems

Systems Metadata Machine Learning Architecture

LLMs in Production: Tooling, Process, and Team Structure

Speaker: Dr. Greg Loughnane and Chris Alexiuk

However, during development – and even more so once deployed to production – best practices for operating and improving generative AI applications are less understood.

Process

8 Essential Data Pipeline Design Patterns You Should Know

Monte Carlo

NOVEMBER 21, 2024

That’s where data pipeline design patterns come in. So, why does choosing the right data pipeline design matter? In this guide, we’ll explore the patterns that can help you design data pipelines that actually work. Table of Contents Common Data Pipeline Design Patterns Explained 1. Batch Processing Pattern 2.

Data Pipeline

Data Pipeline Designing Lambda Architecture Kafka

Designing A Non-Relational Database Engine

Data Engineering Podcast

APRIL 14, 2024

In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database. When designing and building a database, what are the initial set of questions that need to be answered? Can you describe what constitutes a NoSQL database?

Non-relational Database

Non-relational Database Relational Database Database Designing

Data Migration Strategies For Large Scale Systems

Data Engineering Podcast

MAY 26, 2024

Summary Any software system that survives long enough will require some form of migration or evolution. When that system is responsible for the data layer the process becomes more challenging. As you have gone through successive migration projects, how has that influenced the ways that you think about architecting data systems?

Systems

Systems Data Lake High Quality Data Google Cloud

Build faster with Buck2: Our open source build system

Engineering at Meta

APRIL 6, 2023

Buck2, our new open source, large-scale build system , is now available on GitHub. Buck2 is an extensible and performant build system written in Rust and designed to make your build experience faster and more efficient. In particular, we support Sapling-based file systems. Why rebuild Buck?

Building

Building Systems Java Coding

Leading the Development of Profitable and Sustainable Products

Speaker: Jason Tanner

A sustainable business model contains a system of interrelated choices made not once but over time. Discover how to design and evolve profit streams over time, focusing on solution sustainability, economic sustainability, and relationship sustainability.

Certification

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. These systems are built on open standards and offer immense analytical and transactional processing flexibility. These formats are transforming how organizations manage large datasets.

Architecture

Architecture Systems Data Lake Google Cloud

A Tour Around Buck2, Meta's New Build System

Tweag

JULY 5, 2023

Buck2 is a from-scratch rewrite of Buck , a polyglot, monorepo build system that was developed and used at Meta (Facebook), and shares a few similarities with Bazel. As you may know, the Scalable Builds Group at Tweag has a strong interest in such scalable build systems. Meta recently announced they have made Buck2 open-source.

Systems

Systems Building Java Programming Language

Designing and testing for accessibility in GIS and mapping

ArcGIS

MAY 13, 2024

Review best practices for designing and testing for accessibility maps and apps throughout the ArcGIS system during the development process.

Accessible

Accessible Accessibility Designing Systems

Netflix’s Distributed Counter Abstraction

Netflix Tech

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Datasets

Datasets Computer Science Systems Kafka

Designing a Declarative Data Stack: From Theory to Practice

Simon Späti

DECEMBER 20, 2024

While attempting to build a system that could define an entire data stack through a single YAML file, I encountered architectural questions that challenged my initial assumptions: Should we generate production-ready code from templates or create a boilerplate repository with best-in-class tools?

Designing

Designing Architecture Data Engineering

The Roots of Today's Modern Backend Engineering Practices

The Pragmatic Engineer

NOVEMBER 21, 2023

If you had a continuous deployment system up and running around 2010, you were ahead of the pack: but today it’s considered strange if your team would not have this for things like web applications. We dabbled in network engineering, database management, and system administration. Subscribe here. and hand-rolled C -code.

Engineering

Engineering Bytes Cloud Computing AWS

What Is PDFMiner And Should You Use It – How To Extract Data From PDFs

Seattle Data Guy

JANUARY 18, 2025

Because they can preserve the visual layout of documents and are compatible with a wide range of devices and operating systems, PDFs are used for everything from business forms and educational material to creative designs. PDF files are one of the most popular file formats today.

IT

IT Education Data Designing

Paying down tech debt: further learnings

The Pragmatic Engineer

SEPTEMBER 19, 2024

In the early 90’s, DOS programs like the ones my company made had its own Text UI screen rendering system. This rendering system was easy for me to understand, even on day one. Our rendering system was very memory inefficient, but that could be fixed. By doing so, I got to see every screen of the system.

Recruitment

Recruitment Java Coding Project

Unapologetically Technical Episode 17 – Semih Salihoglu

Jesse Anderson

FEBRUARY 11, 2025

Semih is a researcher and entrepreneur with a background in distributed systems and databases. He then pursued his doctoral studies at Stanford University, delving into the complexities of database systems.

Computer Science

Computer Science Database Design Software Engineer Software Engineering

An educational side project

The Pragmatic Engineer

JUNE 1, 2023

Juraj included system monitoring parts which monitor the server’s capacity he runs the app on: The monitoring page on the Rides app And it doesn’t end here. Juraj created a systems design explainer on how he built this project, and the technologies used: The systems design diagram for the Rides application The app uses: Node.js

Education

Education Project PostgreSQL Software Engineer

What is System Hacking? Types and Prevention

Edureka

APRIL 10, 2025

When you hear the term System Hacking, it might bring to mind shadowy figures behind computer screens and high-stakes cyber heists. In this blog, we’ll explore the definition, purpose, process, and methods of prevention related to system hacking, offering a detailed overview to help demystify the concept.

Systems

Systems Education Banking Accessibility

A Guide to Data Pipelines (And How to Design One From Scratch)

Striim

SEPTEMBER 11, 2024

A data pipeline is a systematic sequence of components designed to automate the extraction, organization, transfer, transformation, and processing of data from one or more sources to a designated destination. Understanding the essential components of data pipelines is crucial for designing efficient and effective data architectures.

Data Pipeline

Data Pipeline Designing Data Lake Data Warehouse

Most Essential 2023 Interview Questions on Data Engineering

Analytics Vidhya

FEBRUARY 7, 2023

Introduction Data engineering is the field of study that deals with the design, construction, deployment, and maintenance of data processing systems. This includes designing and implementing […] The post Most Essential 2023 Interview Questions on Data Engineering appeared first on Analytics Vidhya.

Data Engineering

Data Engineering Data Engineer Engineering Data

Change Data Capture Using Debezium Kafka and Pg

Start Data Engineering

MAY 9, 2020

Change data capture is a software design pattern used to capture changes to data and take corresponding action based on that change. The corresponding action usually is supposed to occur in another system in response to the change that was made in the source system. The change to data is usually one of read, update or delete.

Kafka

Kafka Data Designing Systems

YARN for Large Scale Computing: Beginner’s Edition

Analytics Vidhya

JANUARY 31, 2023

It is a powerful resource management system for a horizontal server environment. It is designed to be more flexible and generic than the original Hadoop MapReduce system, making it an attractive choice for companies looking to implement Hadoop. Introduction YARN stands for Yet Another Resource Negotiator.

Hadoop

Hadoop Designing Systems Process

Data News — Week 25.02

Christophe Blefari

JANUARY 11, 2025

AI companies are aiming for the moon—AGI—promising it will arrive once OpenAI develops a system capable of generating at least $100 billion in profits. Meaning: a YAML configuration system for ingestion and transformations, and now, visualisation with BI-as-code. Meanwhile, the AI landscape remains unpredictable.

Data

Data Data Warehouse Coding Programming Language

Building cost effective data pipelines with Python & DuckDB

Start Data Engineering

MAY 28, 2024

Distributed systems are scalable, resilient to failures, & designed for high availability 4.5. Use DuckDB to process data, not for multiple users to access data 4.2. Cost calculation: DuckDB + Ephemeral VMs = dirt cheap data processing 4.3. Processing data less than 100GB? Use DuckDB 4.4.

Data Pipeline

Data Pipeline Python Building Data

What is a Senior Software Engineer at Wise and Amazon?

The Pragmatic Engineer

AUGUST 1, 2023

We put a lot of emphasis on communication and prioritization and the ability to unblock yourself or your team – this comes on top of the programming and design skills. It’s down to them to create well-designed, extensible, performant and secure solutions. Lead a strategic team effort, starting at the design stage.

Software Engineer

Software Engineer Software Engineering Engineering Designing

The “10x engineer:" 50 years ago and now

The Pragmatic Engineer

MARCH 12, 2024

Tools and approaches at our disposal, which didn’t exist in 1975, or were not widespread in 1995, include: Git – the now-dominant version control system used by much of the industry, with exceptions for projects with very large assets, like video games Code reviews : these became common in parallel with version control.

Engineering

Engineering Programming Language Hospitality Programming

Data logs: The latest evolution in Meta’s access tools

Engineering at Meta

FEBRUARY 4, 2025

Here we explore initial system designs we considered, an overview of the current architecture, and some important principles Meta takes into account in making data accessible and easy to understand. We also considered caching data logs in an online system capable of supporting a range of indexed per-user queries.

Accessible

Accessible Accessibility Raw Data Data Warehouse

Title Launch Observability at Netflix Scale

Netflix Tech

JANUARY 6, 2025

In this case, the main stakeholders are: - Title Launch Operators Role: Responsible for setting up the title and its metadata into our systems. In this context, were focused on developing systems that ensure successful title launches, build trust between content creators and our brand, and reduce engineering operational overhead.

Metadata

Metadata Algorithm Systems Building

OLTP Vs OLAP – What Is The Difference

Seattle Data Guy

MAY 8, 2023

If you’re relying on your OLTP system to provide analytics, you might be in for a surprise. While it can work initially, these systems aren’t designed to handle complex queries. … Read more The post OLTP Vs OLAP – What Is The Difference appeared first on Seattle Data Guy.

MongoDB

MongoDB SQL Database Designing

Foundation Model for Personalized Recommendation

Netflix Tech

MARCH 28, 2025

By Ko-Jen Hsiao , Yesu Feng and Sudarshan Lamkhede Motivation Netflixs personalized recommender system is a complex system, boasting a variety of specialized machine learned models each catering to distinct needs including Continue Watching and Todays Top Picks for You. Refer to our recent overview for more details).

Metadata

Metadata Bytes Entertainment Data Mining

Snowflake Startup Challenge 2025: Meet the Top 10

Snowflake

APRIL 9, 2025

KAWA Analytics Digital transformation is an admirable goal, but legacy systems and inefficient processes hold back many companies efforts. AI agents can assist with research, analytics, reconciliation and more just one part of KAWAs AI-native platform designed to enable automation with transparency and enterprise-grade security.

Pharmaceutical

Pharmaceutical Manufacturing Data Ingestion SQL

Interesting startup idea: benchmarking cloud platform pricing

The Pragmatic Engineer

OCTOBER 17, 2024

This grant is designed to “support entrepreneurs, tech-geeks, developers, and socially engaged people, who are capable of challenging the way we search and discover information and resources on the internet” The team is tiny; only three people.

Cloud

Cloud AWS Metadata Cloud Computing

A Dive into the Basics of Big Data Storage with HDFS

Analytics Vidhya

FEBRUARY 6, 2023

Introduction HDFS (Hadoop Distributed File System) is not a traditional database but a distributed file system designed to store and process big data. It is a core component of the Apache Hadoop ecosystem and allows for storing and processing large datasets across multiple commodity servers.

Data Storage

Data Storage Big Data Hadoop Datasets

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Data Engineering Weekly

JANUARY 15, 2025

The answer lies in unstructured data processing—a field that powers modern artificial intelligence (AI) systems. To address these challenges, AI Data Engineers have emerged as key players, designing scalable data workflows that fuel the next generation of AI systems. Their role is not just important; it is essential.

Data Engineering

Data Engineering Data Engineer Unstructured Data Engineering

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

dbt Developer Hub

APRIL 20, 2025

Both AI agents and business stakeholders will then operate on top of LLM-driven systems hydrated by the dbt MCP context. Todays system is not a full realization of the vision in the posts shared above, but it is a meaningful step towards safely integrating your structured enterprise data into AI workflows. Why does this matter?

Structured Data

Structured Data SQL BI Project

Klarna’s AI chatbot: how revolutionary is it, really?

The Pragmatic Engineer

AUGUST 8, 2024

It can also venture in other areas – for example, it gets chatty when asking about the history of the company: Overall, it feels like the chatbot is carefully designed to not allow it to go into details that are not on a whitelist. With clever-enough probing, this system prompt can be revealed. Translate to English if needed.

IT

IT Software Engineer Software Engineering Systems

How Games Typically Get Built

The Pragmatic Engineer

AUGUST 22, 2023

Typical roles On a typical games project, programmers work alongside designers, artists, animators, writers, sound designers and other disciplines. I often explain this working relationship as that artists make it pretty , while designers and programmers make it work.

Software Engineer

Software Engineer Software Engineering Consulting Entertainment

Data Engineering Weekly #219

Data Engineering Weekly

MAY 4, 2025

The recommendation engine to find the data flow violation is an interesting design to monitor the data assets at scale. link] Whatnot: Evolving Feed Ranking at Whatnot Whatnot describes their transition from a batch prediction system to an online inference framework for ranking, which is shown in their "For You Feed."

Data Engineering

Data Engineering Data Engineer Engineering Java

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Snowflake

DECEMBER 4, 2024

Beyond working with well-structured data in a data warehouse, modern AI systems can use deep learning and natural language processing to work effectively with unstructured and semi-structured data in data lakes and lakehouses. Rather than answering a specific question, independent agents will act on broad instructions from a human user.

Unstructured Data

Unstructured Data Data Lake Deep Learning Structured Data

How Meta understands data at scale

Engineering at Meta

APRIL 28, 2025

Meta’s vast and diverse systems make it particularly challenging to comprehend its structure, meaning, and context at scale. We discovered that a flexible and incremental approach was necessary to onboard the wide variety of systems and languages used in building Metas products. We believe that privacy drives product innovation.

Metadata

Metadata Data Utilities Data Warehouse

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

The data warehouse solved for performance and scale but, much like the databases that preceded it, relied on proprietary formats to build vertically integrated systems. Its vendor-neutral by design, and the Polaris governance structure and community-driven development ensures it remains so.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Learning System Design: Top 5 Essential Reads

Data Engineering Interview Series #2: System Design

Webinars

Trending Sources

Designing Data Transfer Systems That Scale

Webinars

Establishing a Large Scale Learned Retrieval System at Pinterest

LLMs in Production: Tooling, Process, and Team Structure

8 Essential Data Pipeline Design Patterns You Should Know

Designing A Non-Relational Database Engine

Data Migration Strategies For Large Scale Systems

Build faster with Buck2: Our open source build system

Leading the Development of Profitable and Sustainable Products

Why Open Table Format Architecture is Essential for Modern Data Systems

A Tour Around Buck2, Meta's New Build System

Designing and testing for accessibility in GIS and mapping

Netflix’s Distributed Counter Abstraction

Designing a Declarative Data Stack: From Theory to Practice

The Roots of Today's Modern Backend Engineering Practices

What Is PDFMiner And Should You Use It – How To Extract Data From PDFs

Paying down tech debt: further learnings

Unapologetically Technical Episode 17 – Semih Salihoglu

An educational side project

What is System Hacking? Types and Prevention

A Guide to Data Pipelines (And How to Design One From Scratch)

Most Essential 2023 Interview Questions on Data Engineering

Change Data Capture Using Debezium Kafka and Pg

YARN for Large Scale Computing: Beginner’s Edition

Data News — Week 25.02

Building cost effective data pipelines with Python & DuckDB

What is a Senior Software Engineer at Wise and Amazon?

The “10x engineer:" 50 years ago and now

Data logs: The latest evolution in Meta’s access tools

Title Launch Observability at Netflix Scale

OLTP Vs OLAP – What Is The Difference

Foundation Model for Personalized Recommendation

Snowflake Startup Challenge 2025: Meet the Top 10

Interesting startup idea: benchmarking cloud platform pricing

A Dive into the Basics of Big Data Storage with HDFS

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

Klarna’s AI chatbot: how revolutionary is it, really?

How Games Typically Get Built

Data Engineering Weekly #219

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

How Meta understands data at scale

How Apache Iceberg Is Changing the Face of Data Lakes

Stay Connected