Data Pipeline and Systems - Data Engineering Digest

Building cost effective data pipelines with Python & DuckDB

Start Data Engineering

MAY 28, 2024

Building efficient data pipelines with DuckDB 4.1. Use DuckDB to process data, not for multiple users to access data 4.2. Cost calculation: DuckDB + Ephemeral VMs = dirt cheap data processing 4.3. Processing data less than 100GB? Introduction 2. Project demo 3. Use DuckDB 4.4.

Data Pipeline

Data Pipeline Python Building Data

How to add tests to your data pipelines

Start Data Engineering

OCTOBER 12, 2021

Introduction Testing your data pipeline 1. End-to-end system testing 2. Data quality testing 3. Unit and contract testing Conclusion Further reading Introduction Testing data pipelines are different from testing other applications, like a website backend. Monitoring and alerting 4.

Data Pipeline

Data Pipeline Data Systems

How To Future-Proof Your Data Pipelines

Ascend.io

NOVEMBER 14, 2024

Why Future-Proofing Your Data Pipelines Matters Data has become the backbone of decision-making in businesses across the globe. The ability to harness and analyze data effectively can make or break a company’s competitive edge. Resilience and adaptability are the cornerstones of a future-proof data pipeline.

Data Pipeline

Data Pipeline Amazon Web Services Data Integration Data

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

8 Essential Data Pipeline Design Patterns You Should Know

Monte Carlo

NOVEMBER 21, 2024

Whether it’s customer transactions, IoT sensor readings, or just an endless stream of social media hot takes, you need a reliable way to get that data from point A to point B while doing something clever with it along the way. That’s where data pipeline design patterns come in. Data Mesh Pattern 8.

Data Pipeline

Data Pipeline Designing Lambda Architecture Kafka

Entity Resolution: Your Guide to Deciding Whether to Build It or Buy It

Adding high-quality entity resolution capabilities to enterprise applications, services, data fabrics or data pipelines can be daunting and expensive. This will help you decide whether to build an in-house entity resolution system or utilize an existing solution like the Senzing® API for entity resolution.

IT

Inside Facebook’s video delivery system

Engineering at Meta

DECEMBER 10, 2024

Were explaining the end-to-end systems the Facebook app leverages to deliver relevant content to people. At Facebooks scale, the systems built to support and overcome these challenges require extensive trade-off analyses, focused optimizations, and architecture built to allow our engineers to push for the same user and business outcomes.

Systems

Systems Architecture Engineering Data Pipeline

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

by Jasmine Omeke , Obi-Ike Nwoke , Olek Gorajek Intro This post is for all data practitioners, who are interested in learning about bootstrapping, standardization and automation of batch data pipelines at Netflix. You may remember Dataflow from the post we wrote last year titled Data pipeline asset management with Dataflow.

Data Pipeline

Data Pipeline Scala Metadata Food

Declarative Data Pipelines with Hoptimator

LinkedIn Engineering

JUNE 26, 2023

However, we've found that this vertical self-service model doesn't work particularly well for data pipelines, which involve wiring together many different systems into end-to-end data flows. Data pipelines power foundational parts of LinkedIn's infrastructure, including replication between data centers.

Data Pipeline

Data Pipeline Kafka SQL MySQL

From Data Pipelines to Agentic Applications: Deploying LLM Apps That Actually Work

Striim

APRIL 30, 2025

Get More Insights In Your Inbox Spencer Cook, Senior Solutions Architect at Databricks, joins to unpack how enterprises are moving beyond hype and building practical AI systems using vector search, RAG, and real-time data pipelines.

Data Pipeline

Data Pipeline Government Data Systems

Data Migration Strategies For Large Scale Systems

Data Engineering Podcast

MAY 26, 2024

Summary Any software system that survives long enough will require some form of migration or evolution. When that system is responsible for the data layer the process becomes more challenging. Sriram Panyam has been involved in several projects that required migration of large volumes of data in high traffic environments.

Systems

Systems Data Lake High Quality Data Google Cloud

Designing Data Transfer Systems That Scale

Data Engineering Podcast

DECEMBER 3, 2023

Summary The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor.

Systems

Systems Designing Data Lake SQL

Data Engineering Interview Series #2: System Design

Start Data Engineering

JANUARY 20, 2025

Understand source data] Know what you have to work with 2.3. Model your data] Define data models for historical analytics 2.4. Pipeline design] Design data pipelines to populate your data models 2.5. Data quality] Ensure you quality check your data before usage 2.

Designing

Designing Systems Data Engineer Data Engineering

Understanding Streaming Data Pipelines

Hevo

DECEMBER 20, 2024

In this fast-paced digital era, multiple sources like IoT devices, social media platforms, and financial systems generate the data continuously and in real-time. Every business wants to analyze these data in real-time to be ahead in the competitive game. It has the ability to […]

Data Pipeline

Data Pipeline Media Data Systems

Deploying Data Pipelines using the Saga pattern

Picnic Engineering

FEBRUARY 8, 2023

Delivering the right events at low latency and with a high volume is critical to Picnic’s system architecture. In our previous blog, Dima Kalashnikov explained how we configure our Internal services pipeline in the Analytics Platform. In this post, we will explain how our team automates the creation of new data pipeline deployments.

Data Pipeline

Data Pipeline Kafka Data Architecture

Data Pipeline Observability: A Model For Data Engineers

Databand.ai

JUNE 28, 2023

Data Pipeline Observability: A Model For Data Engineers Eitan Chazbani June 29, 2023 Data pipeline observability is your ability to monitor and understand the state of a data pipeline at any time. We believe the world’s data pipelines need better data observability.

Data Pipeline

Data Pipeline Data Engineer Data Engineering Engineering

The Future of Reliable Data + AI—Observing the Data, System, Code, and Model

Monte Carlo

MARCH 28, 2025

In this article, Ill share how even the best AI applications can break, and share how leading teams are managing reliability at scale across the ever-evolving data + AI estate. Theres endless ways a data source can and does change, and its unavoidable for owners of data pipelines and products to be occasionally surprised by it.

Coding

Coding Systems Data Pipeline ETL Tools

How to Extract Data from APIs for Data Pipelines using Python

Start Data Engineering

APRIL 14, 2025

APIs are a way to communicate between systems on the Internet 2.1. API Data extraction = GET-ting data from a server 3.1. GET data 3.1.1. GET data for a specific entity 3. Introduction 2. HTTP is a protocol commonly used for websites 2.1.1. Request: Ask the Internet exactly what you want 2.1.2.

Data Pipeline

Data Pipeline Python Data Systems

A Guide to Data Pipelines (And How to Design One From Scratch)

Striim

SEPTEMBER 11, 2024

Data pipelines are the backbone of your business’s data architecture. Implementing a robust and scalable pipeline ensures you can effectively manage, analyze, and organize your growing data. We’ll answer the question, “What are data pipelines?” Table of Contents What are Data Pipelines?

Data Pipeline

Data Pipeline Designing Data Lake Data Warehouse

A Look At The Data Systems Behind The Gameplay For League Of Legends

Data Engineering Podcast

NOVEMBER 20, 2022

In this episode Ian Schweer shares his experiences at Riot Games supporting player-focused features such as machine learning models and recommeder systems that are deployed as part of the game binary. The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it.

Systems

Systems Metadata Data Pipeline MongoDB

Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems

Data Engineering Podcast

DECEMBER 25, 2022

Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. When is Opaque the wrong choice?

Machine Learning

Machine Learning Systems Data Lake Data Warehouse

Real-Time AI for Crisis Management: Responding Faster with Smarter Systems

Striim

JANUARY 30, 2025

However, leveraging AI agents like Striims Sherlock and Sentinel, which enable encryption and masking for PII, can help ensure that data is safe even in the event a breach occurs. Systems must be capable of handling high-velocity data without bottlenecks. As you can see, theres a lot to consider in adopting real-time AI.

Systems

Systems Management Hospitality Healthcare

Should Python Data Pipelines be Function based or Object-Oriented (OOP)?

Start Data Engineering

FEBRUARY 10, 2025

Data transformations as functions lead to maintainable code 3. Track connections & configs when connecting to external systems 3.2. Track pipeline progress (logging, Observer) with objects 3.3. Use objects to store configurations of data systems (e.g., Templatize data flow patterns with a Class 4.

Data Pipeline

Data Pipeline Python Coding Data

Streaming Data Pipelines: What Are They and How to Build One

Precisely

DECEMBER 28, 2023

Business success is based on how we use continuously changing data. That’s where streaming data pipelines come into play. This article explores what streaming data pipelines are, how they work, and how to build this data pipeline architecture. What is a streaming data pipeline?

Data Pipeline

Data Pipeline Building Kafka Big Data

Why Your Data Pipelines Need Closed-Loop Feedback Control

Towards Data Science

SEPTEMBER 10, 2023

At scale with dozens of data engineers building hundreds of production jobs, controlling their performance at scale is untenable for a myriad of reasons from technical to human. The missing link today is the establishment of a closed loop feedback system that helps automatically drive pipeline infrastructure towards business goals.

Data Pipeline

Data Pipeline Data Coding Utilities

Automatically Managing Data Pipeline Infrastructures With Terraform

Towards Data Science

MAY 2, 2023

I know the manual work you did last summer Photo by EJ Yao on Unsplash Introduction A few weeks ago, I wrote a post about developing a data pipeline using both on-premise and AWS tools. This post is part of my recent effort in bringing more cloud-oriented data engineering posts. Adding a module to the Glue job Our main.tf

Data Pipeline

Data Pipeline Management AWS Data

A Complete Guide to Scale Your Data Pipelines and Data Products with Contract Testing and Dbt

Towards Data Science

OCTOBER 25, 2023

As a data or analytics engineer, you knew where to find all the transformation logic and models because they were all in the same codebase. You probably work closely with the colleague who builds the data pipeline that you were consuming. There was only one data team, two at most. How did they do it?

Data Pipeline

Data Pipeline SQL Data Architecture Data

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems

Towards Data Science

FEBRUARY 6, 2024

ERP and CRM systems are designed and built to fulfil a broad range of business processes and functions. This generalisation makes their data models complex and cryptic and require domain expertise. As you do not want to start your development with uncertainty, you decide to go for the operational raw data directly.

Systems

Systems Raw Data Metadata Data Cleanse

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Data Engineering Podcast

FEBRUARY 18, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. What are the key points of comparison for that combination in relation to other possible selections?

Data Lake

Data Lake High Quality Data Data Warehouse Google Cloud

Creating a Data Pipeline with Spark, Google Cloud Storage and Big Query

Towards Data Science

MARCH 6, 2023

On-premise and cloud working together to deliver a data product Photo by Toro Tseleng on Unsplash Developing a data pipeline is somewhat similar to playing with lego, you mentalize what needs to be achieved (the data requirements), choose the pieces (software, tools, platforms), and fit them together. Google Cloud.

Google Cloud

Google Cloud Cloud Storage Data Pipeline Cloud

Supporting Diverse ML Systems at Netflix

Netflix Tech

MARCH 7, 2024

The Machine Learning Platform (MLP) team at Netflix provides an entire ecosystem of tools around Metaflow , an open source machine learning infrastructure framework we started, to empower data scientists and machine learning practitioners to build and manage a variety of ML systems. ETL workflows), as well as downstream (e.g.

Systems

Systems Media Machine Learning Data Warehouse

Accelerate AI Development with Snowflake

Snowflake

NOVEMBER 11, 2024

Here’s how Snowflake Cortex AI and Snowflake ML are accelerating the delivery of trusted AI solutions for the most critical generative AI applications: Natural language processing (NLP) for data pipelines: Large language models (LLMs) have a transformative potential, but they often batch inference integration into pipelines, which can be cumbersome.

Unstructured Data

Unstructured Data SQL AWS Healthcare

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Monte Carlo

OCTOBER 31, 2024

AI data engineers play a critical role in developing and managing AI-powered data systems. Table of Contents What Does an AI Data Engineer Do? AI data engineers are data engineers that are responsible for developing and managing data pipelines that support AI and GenAI data products.

Data Engineer

Data Engineer Data Engineering Engineering Unstructured Data

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Data Engineering Weekly

JANUARY 15, 2025

The answer lies in unstructured data processing—a field that powers modern artificial intelligence (AI) systems. Unlike neatly organized rows and columns in spreadsheets, unstructured data—such as text, images, videos, and audio—requires advanced processing techniques to derive meaningful insights.

Data Engineer

Data Engineer Data Engineering Unstructured Data Engineering

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

Snowflake

MARCH 2, 2023

Snowflake enables organizations to be data-driven by offering an expansive set of features for creating performant, scalable, and reliable data pipelines that feed dashboards, machine learning models, and applications. But before data can be transformed and served or shared, it must be ingested from source systems.

Kafka

Kafka Data Ingestion Data Pipeline Cloud Storage

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

As we look towards 2025, it’s clear that data teams must evolve to meet the demands of evolving technology and opportunities. In this blog post, we’ll explore key strategies that data teams should adopt to prepare for the year ahead. Tool sprawl is another hurdle that data teams must overcome.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker

Towards Data Science

FEBRUARY 9, 2024

The first phase focuses on building a data pipeline. This involves getting data from an API and storing it in a PostgreSQL database. Overview Let’s break down the data pipeline process step-by-step: Data Streaming: Initially, data is streamed from the API into a Kafka topic. Image by the author.

Kafka

Kafka Data Engineer Data Engineering PostgreSQL

Airflow Sensors: What you need to know

Marc Lamberti

OCTOBER 1, 2023

Airflow Sensors are one of the most common tasks in data pipelines. If you want to make complex and robust data pipelines, you have to understand how Sensors work genuinely. Back to your data pipeline: Image your DAG runs every day at midnight but the files from sources A, B, and C, never come. It depends.

Data Pipeline

Data Pipeline SQL Algorithm Coding

Snowflake Startup Challenge 2025: Meet the Top 10

Snowflake

APRIL 9, 2025

By enabling advanced analytics and centralized document management, Digityze AI helps pharmaceutical manufacturers eliminate data silos and accelerate data sharing. KAWA Analytics Digital transformation is an admirable goal, but legacy systems and inefficient processes hold back many companies efforts.

Pharmaceutical

Pharmaceutical Manufacturing Data Ingestion SQL

Top 10 Data Engineering Trends in 2025

Edureka

APRIL 22, 2025

To meet this need, people who work in data engineering will focus on making systems that can handle ongoing data streams with little delay. Cloud-Native Data Engineering These days, cloud-based systems are the best choice for data engineering infrastructure because they are flexible and can grow as needed.

Data Engineer

Data Engineer Data Engineering Engineering Consulting

Monte Carlo and Databricks Partner to Deliver Data + AI Observability

Monte Carlo

MARCH 19, 2025

Monte Carlo and Databricks double-down on their partnership, helping organizations build trusted AI applications by expanding visibility into the data pipelines that fuel the Databricks Data Intelligence Platform. For too long, data teams have been flying blind when it comes to AI systems.

Unstructured Data

Unstructured Data Data Pipeline High Quality Data Banking

The Race For Data Quality in a Medallion Architecture

DataKitchen

NOVEMBER 5, 2024

The Bronze layer is the initial landing zone for all incoming raw data, capturing it in its unprocessed, original form. This foundational layer is a repository for various data types, from transaction logs and sensor data to social media feeds and system logs. However, this architecture is not without its challenges.

Architecture

Architecture Raw Data Pipeline-centric Data Ingestion

The Challenge of Data Quality and Availability—And Why It’s Holding Back AI and Analytics

Striim

APRIL 18, 2025

If the underlying data is incomplete, inconsistent, or delayed, even the most advanced AI models and business intelligence systems will produce unreliable insights. Many organizations struggle with: Inconsistent data formats : Different systems store data in varied structures, requiring extensive preprocessing before analysis.

High Quality Data

High Quality Data Business Intelligence Unstructured Data Data Pipeline

Tackling Real Time Streaming Data With SQL Using RisingWave

Data Engineering Podcast

FEBRUARY 4, 2024

Summary Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.

SQL

SQL Data Lake High Quality Data Machine Learning

Making Email Better With AI At Shortwave

Data Engineering Podcast

APRIL 21, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. How do you manage the personalization of the AI functionality in your system for each user/team?

Data Lake

Data Lake High Quality Data Machine Learning Data Pipeline

Building cost effective data pipelines with Python & DuckDB

How to add tests to your data pipelines

Webinars

Trending Sources

How To Future-Proof Your Data Pipelines

Webinars

8 Essential Data Pipeline Design Patterns You Should Know

Entity Resolution: Your Guide to Deciding Whether to Build It or Buy It

Inside Facebook’s video delivery system

Ready-to-go sample data pipelines with Dataflow

Declarative Data Pipelines with Hoptimator

From Data Pipelines to Agentic Applications: Deploying LLM Apps That Actually Work

Data Migration Strategies For Large Scale Systems

Designing Data Transfer Systems That Scale

Data Engineering Interview Series #2: System Design

Understanding Streaming Data Pipelines

Deploying Data Pipelines using the Saga pattern

Data Pipeline Observability: A Model For Data Engineers

The Future of Reliable Data + AI—Observing the Data, System, Code, and Model

How to Extract Data from APIs for Data Pipelines using Python

A Guide to Data Pipelines (And How to Design One From Scratch)

A Look At The Data Systems Behind The Gameplay For League Of Legends

Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems

Real-Time AI for Crisis Management: Responding Faster with Smarter Systems

Should Python Data Pipelines be Function based or Object-Oriented (OOP)?

Streaming Data Pipelines: What Are They and How to Build One

Why Your Data Pipelines Need Closed-Loop Feedback Control

Automatically Managing Data Pipeline Infrastructures With Terraform

A Complete Guide to Scale Your Data Pipelines and Data Products with Contract Testing and Dbt

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Creating a Data Pipeline with Spark, Google Cloud Storage and Big Query

Supporting Diverse ML Systems at Netflix

Accelerate AI Development with Snowflake

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Stream Rows and Kafka Topics Directly into Snowflake with Snowpipe Streaming

How To Prepare Your Data Team for 2025

End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker

Airflow Sensors: What you need to know

Snowflake Startup Challenge 2025: Meet the Top 10

Top 10 Data Engineering Trends in 2025

Monte Carlo and Databricks Partner to Deliver Data + AI Observability

The Race For Data Quality in a Medallion Architecture

The Challenge of Data Quality and Availability—And Why It’s Holding Back AI and Analytics

Tackling Real Time Streaming Data With SQL Using RisingWave

Making Email Better With AI At Shortwave

Stay Connected