Data Workflow and Systems - Data Engineering Digest

Redefining AIOps IT Workflows with Legacy System Visibility

Precisely

DECEMBER 16, 2024

Key Takeaways: Centralized visibility of data is key. Modern IT environments require comprehensive data for successful AIOps, that includes incorporating data from legacy systems like IBM i and IBM Z into ITOps platforms. Legacy systems operate in isolation.

Systems

Systems IT Machine Learning Insurance

Designing Data Transfer Systems That Scale

Data Engineering Podcast

DECEMBER 3, 2023

Summary The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Sponsored By: Starburst : ![Starburst

Systems

Systems Designing Data Lake SQL

Introducing WorkflowGuard: The Workflow Governance and Observability System That Oversees over 120,000 Data Workflows

Uber Engineering

JANUARY 16, 2023

Our Data Workflow Platform team introduces WorkflowGuard: a new service to govern executions, prioritize resources, and manage life cycle for repetitive data jobs. Check out how it improved workflow reliability and cost efficiency while bringing more observability to users.

Data Workflow

Data Workflow Government Systems Data

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Data Migration Strategies For Large Scale Systems

Data Engineering Podcast

MAY 26, 2024

Summary Any software system that survives long enough will require some form of migration or evolution. When that system is responsible for the data layer the process becomes more challenging. Sriram Panyam has been involved in several projects that required migration of large volumes of data in high traffic environments.

Systems

Systems Data Lake High Quality Data Google Cloud

Scale Unstructured Text Analytics with Batch LLM Inference

Snowflake

MARCH 6, 2025

Document RAG preparation : Ingesting, cleaning and chunking documents before embedding them into vector representations, enabling efficient retrieval and enhanced LLM responses in retrieval-augmented generation (RAG) systems. An efficient batch processing system scales in a cost-effective manner to handle growing volumes of unstructured data.

Unstructured Data

Unstructured Data Medical Media Data Workflow

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Data Engineering Weekly

JANUARY 15, 2025

The answer lies in unstructured data processing—a field that powers modern artificial intelligence (AI) systems. Unlike neatly organized rows and columns in spreadsheets, unstructured data—such as text, images, videos, and audio—requires advanced processing techniques to derive meaningful insights.

Data Engineer

Data Engineer Data Engineering Unstructured Data Engineering

Understanding The Immune System With Data At ImmunAI

Data Engineering Podcast

FEBRUARY 20, 2022

Summary The life sciences as an industry has seen incredible growth in scale and sophistication, along with the advances in data technology that make it possible to analyze massive amounts of genomic information. Interview Introduction (see Guy’s bio below) How did you get involved in the area of data management?

Systems

Systems Software Engineering Software Engineer Data Warehouse

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Data Engineering Podcast

FEBRUARY 18, 2024

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Data Lake

Data Lake High Quality Data Data Warehouse Google Cloud

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

dbt Developer Hub

APRIL 20, 2025

Over the coming months and years, data teams will increasingly focus on building the rich context that feeds into the dbt MCP server. Both AI agents and business stakeholders will then operate on top of LLM-driven systems hydrated by the dbt MCP context. These tools can be called by LLM systems to learn about your data and metadata.

Structured Data

Structured Data SQL BI Project

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Data Engineering Podcast

APRIL 7, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment.

Data Lake

Data Lake High Quality Data BI Data Workflow

Troubleshooting Kafka In Production

Data Engineering Podcast

DECEMBER 24, 2023

Summary Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Data lakes are notoriously complex. Data lakes are notoriously complex. Operating it at scale, however, is notoriously challenging.

Kafka

Kafka Data Lake High Quality Data SQL

Top 10 Data Engineering Trends in 2025

Edureka

APRIL 22, 2025

To meet this need, people who work in data engineering will focus on making systems that can handle ongoing data streams with little delay. Cloud-Native Data Engineering These days, cloud-based systems are the best choice for data engineering infrastructure because they are flexible and can grow as needed.

Data Engineer

Data Engineer Data Engineering Engineering Consulting

Improve Data Quality Through Engineering Rigor And Business Engagement With Synq

Data Engineering Podcast

JUNE 30, 2024

Petr shares his journey from being an engineer to founding Synq, emphasizing the importance of treating data systems with the same rigor as engineering systems. He discusses the challenges and solutions in data reliability, including the need for transparency and ownership in data systems.

Pipeline-centric

Pipeline-centric Engineering Data Lake High Quality Data

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

The system leverages a combination of an event-based storage model in its TimeSeries Abstraction and continuous background aggregation to calculate counts across millions of counters efficiently. link] Grab: Metasense V2 - Enhancing, improving, and productionisation of LLM-powered data governance.

Data Engineer

Data Engineer Data Engineering Engineering Insurance

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

Instead of driving innovation, data engineers often find themselves bogged down with maintenance tasks. On average, engineers spend over half of their time maintaining existing systems rather than developing new solutions. Tool sprawl is another hurdle that data teams must overcome.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Tackling Real Time Streaming Data With SQL Using RisingWave

Data Engineering Podcast

FEBRUARY 4, 2024

Summary Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.

SQL

SQL Data Lake High Quality Data Machine Learning

Making Email Better With AI At Shortwave

Data Engineering Podcast

APRIL 21, 2024

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Data Lake

Data Lake High Quality Data Machine Learning Data Pipeline

Data logs: The latest evolution in Meta’s access tools

Engineering at Meta

FEBRUARY 4, 2025

Were sharing how Meta built support for data logs, which provide people with additional data about how they use our products. Here we explore initial system designs we considered, an overview of the current architecture, and some important principles Meta takes into account in making data accessible and easy to understand.

Accessible

Accessible Accessibility Raw Data Data Warehouse

Barking Up The Wrong GPTree: Building Better AI With A Cognitive Approach

Data Engineering Podcast

MAY 5, 2024

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Building

Building Data Lake High Quality Data Machine Learning

X-Ray Vision For Your Flink Stream Processing With Datorios

Data Engineering Podcast

JUNE 9, 2024

To address this shortcoming Datorios created an observability platform for Flink that brings visibility to the internals of this popular stream processing system. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat.

Process

Process Data Lake High Quality Data Machine Learning

Designing A Non-Relational Database Engine

Data Engineering Podcast

APRIL 14, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment.

Non-relational Database

Non-relational Database Relational Database Database Designing

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

JUNE 16, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. What are the other systems that feed into and rely on the Trino/Iceberg service? What are the other systems that feed into and rely on the Trino/Iceberg service?

Data Lake

Data Lake High Quality Data Metadata Machine Learning

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Data Engineering Podcast

JANUARY 7, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. What are the experimental methods that you are using to gain understanding in the opportunities and practical limits of those systems?

Data Process

Data Process Process Data Lake High Quality Data

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Data Engineering Podcast

MARCH 31, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Your first 30 days are free!

Project

Project Data Lake High Quality Data Data Workflow

Data Engineering Weekly #196

Data Engineering Weekly

NOVEMBER 3, 2024

Foundation Capital: A System of Agents brings Service-as-Software to life software is no longer simply a tool for organizing work; software becomes the worker itself, capable of understanding, executing, and improving upon traditionally human-delivered services. It's good to know about Dapr and restate.dev.

Data Engineer

Data Engineer Data Engineering Engineering Pipeline-centric

Practical First Steps In Data Governance For Long Term Success

Data Engineering Podcast

JUNE 2, 2024

Summary Modern businesses aspire to be data driven, and technologists enjoy working through the challenge of building data systems to support that goal. Data governance is the binding force between these two parts of the organization. What are some of the misconceptions that you encounter about data governance?

Data Governance

Data Governance Government Data Lake High Quality Data

Complete Guide to Data Transformation: Basics to Advanced

Ascend.io

OCTOBER 28, 2024

Read More: Discover how to build a data pipeline in 6 steps Data Integration Data integration involves combining data from different sources into a single, unified view. This technique is vital for ensuring consistency and accuracy across datasets, especially in organizations that rely on multiple data systems.

Raw Data

Raw Data Datasets Aggregated Data Data Pipeline

6 Ways To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

Instead of driving innovation, data engineers often find themselves bogged down with maintenance tasks. On average, engineers spend over half of their time maintaining existing systems rather than developing new solutions. Tool sprawl is another hurdle that data teams must overcome.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

When And How To Conduct An AI Program

Data Engineering Podcast

MARCH 3, 2024

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Programming

Programming Data Lake High Quality Data Machine Learning

Release Management For Data Platform Services And Logic

Data Engineering Podcast

MAY 12, 2024

Summary Building a data platform is a substrantial engineering endeavor. The services and systems need to be kept up to date, but so does the code that controls their behavior. Data lakes are notoriously complex. Support Data Engineering Podcast Summary Building a data platform is a substrantial engineering endeavor.

Management

Management Data Lake High Quality Data Machine Learning

Reconciling The Data In Your Databases With Datafold

Data Engineering Podcast

MARCH 17, 2024

Summary A significant portion of data workflows involve storing and processing information in database engines. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data. Data lakes are notoriously complex.

Database

Database Data Lake High Quality Data Data Workflow

Data Appending vs. Data Enrichment: How to Maximize Data Quality and Insights

Precisely

APRIL 7, 2025

Each of our input datasets needs to be carefully evaluated, and we must define specific: fields values transformation processes rules update frequencies This project involves many datasets including data from a leading firmographics supplier, our own products, and data from internal systems. Plan for it.

Retail

Retail Datasets Data Portfolio

Data Engineering Weekly #206

Data Engineering Weekly

FEBRUARY 2, 2025

Evals are introduced to evaluate LLM responses through various techniques, including self-evaluation, using another LLM as a judge, or human evaluation to ensure the system's behavior aligns with intentions. It employs a two-tower model approach to learn query and item embeddings from user engagement data.

Data Engineer

Data Engineer Data Engineering Engineering Data Lake

Building Linked Data Products With JSON-LD

Data Engineering Podcast

SEPTEMBER 17, 2023

what are the characteristics that distinguish a knowledge graph from What are the layers/stages of applications and data that can/should incorporate JSON-LD as the representation for records and events? What is the level of native support/compatibiliity that you see for JSON-LD in data systems? When is JSON-LD the wrong choice?

Building

Building SQL BI Python

Data Sharing Across Business And Platform Boundaries

Data Engineering Podcast

FEBRUARY 11, 2024

There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex.

Data Lake

Data Lake High Quality Data Government Machine Learning

Modern Customer Data Platform Principles

Data Engineering Podcast

JANUARY 21, 2024

A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex.

Data Lake

Data Lake High Quality Data NoSQL Data Warehouse

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Data Engineering Podcast

MARCH 24, 2024

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Data Lake

Data Lake High Quality Data Hadoop Machine Learning

Data Engineering Weekly #214

Data Engineering Weekly

MARCH 30, 2025

[link] BVP: Roadmap: Data 3.0 in the Lakehouse Era BVP writes about its thesis around Data 3.0 A few exciting theses exist around composite data stack, catalogs, and MCP. • Leverage synthetic data effectively to bootstrap evaluation processes. The article provides an excellent overview of evaluating an LLM system.

Data Engineer

Data Engineer Data Engineering Engineering Pipeline-centric

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Data Engineering Podcast

FEBRUARY 25, 2024

Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. Data lakes are notoriously complex. Data lakes are notoriously complex. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!

Database

Database Technology Data Lake High Quality Data

Build A Data Lake For Your Security Logs With Scanner

Data Engineering Podcast

JANUARY 28, 2024

Summary Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing.

Data Lake

Data Lake Building High Quality Data AWS

Build Your Second Brain One Piece At A Time

Data Engineering Podcast

APRIL 28, 2024

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Building

Building Data Lake High Quality Data Machine Learning

Getting the Most From Your Modern Data Platform: A Three-Phase Approach

Snowflake

JULY 22, 2024

In this blog, we offer guidance for leveraging Snowflake’s capabilities around data and AI to build apps and unlock innovation. By making full use of these functionalities, they can enable new use cases and drive new data, AI and apps initiatives to enhance the customer experience and expand their data platform for the better.

Government

Government Data Cloud Hadoop

Version Your Data Lakehouse Like Your Software With Nessie

Data Engineering Podcast

MARCH 10, 2024

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Data Lake

Data Lake High Quality Data Architecture Machine Learning

Adding An Easy Mode For The Modern Data Stack With 5X

Data Engineering Podcast

DECEMBER 17, 2023

Summary The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. Data lakes are notoriously complex. What are the points of friction that teams run into when trying to build their data platform? Data lakes are notoriously complex.

Data Lake

Data Lake High Quality Data SQL Architecture

Redefining AIOps IT Workflows with Legacy System Visibility

Designing Data Transfer Systems That Scale

Webinars

Trending Sources

Introducing WorkflowGuard: The Workflow Governance and Observability System That Oversees over 120,000 Data Workflows

Webinars

Data Migration Strategies For Large Scale Systems

Scale Unstructured Text Analytics with Batch LLM Inference

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Understanding The Immune System With Data At ImmunAI

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Troubleshooting Kafka In Production

Top 10 Data Engineering Trends in 2025

Improve Data Quality Through Engineering Rigor And Business Engagement With Synq

Data Engineering Weekly #198

How To Prepare Your Data Team for 2025

Tackling Real Time Streaming Data With SQL Using RisingWave

Making Email Better With AI At Shortwave

Data logs: The latest evolution in Meta’s access tools

Barking Up The Wrong GPTree: Building Better AI With A Cognitive Approach

X-Ray Vision For Your Flink Stream Processing With Datorios

Designing A Non-Relational Database Engine

Being Data Driven At Stripe With Trino And Iceberg

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Data Engineering Weekly #196

Practical First Steps In Data Governance For Long Term Success

Complete Guide to Data Transformation: Basics to Advanced

6 Ways To Prepare Your Data Team for 2025

When And How To Conduct An AI Program

Release Management For Data Platform Services And Logic

Reconciling The Data In Your Databases With Datafold

Data Appending vs. Data Enrichment: How to Maximize Data Quality and Insights

Data Engineering Weekly #206

Building Linked Data Products With JSON-LD

Data Sharing Across Business And Platform Boundaries

Modern Customer Data Platform Principles

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Data Engineering Weekly #214

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Build A Data Lake For Your Security Logs With Scanner

Build Your Second Brain One Piece At A Time

Getting the Most From Your Modern Data Platform: A Three-Phase Approach

Version Your Data Lakehouse Like Your Software With Nessie

Adding An Easy Mode For The Modern Data Stack With 5X

Stay Connected