Data Workflow, Metadata and Systems - Data Engineering Digest

Scale Unstructured Text Analytics with Batch LLM Inference

Snowflake

MARCH 6, 2025

Meanwhile, operations teams use entity extraction on documents to automate workflows and enable metadata-driven analytical filtering. As data volumes grow and AI automation expands, cost efficiency in processing with LLMs depends on both system architecture and model flexibility.

Unstructured Data

Unstructured Data Medical Media Data Workflow

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Data Engineering Podcast

OCTOBER 15, 2021

Summary The binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. No more scripts, just SQL.

Metadata

Metadata BI Data Warehouse Government

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

The system leverages a combination of an event-based storage model in its TimeSeries Abstraction and continuous background aggregation to calculate counts across millions of counters efficiently. link] Grab: Metasense V2 - Enhancing, improving, and productionisation of LLM-powered data governance. Boyter on Bloom Filters and SQLite.

Data Engineering

Data Engineering Data Engineer Engineering Insurance

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Understanding The Immune System With Data At ImmunAI

Data Engineering Podcast

FEBRUARY 20, 2022

Summary The life sciences as an industry has seen incredible growth in scale and sophistication, along with the advances in data technology that make it possible to analyze massive amounts of genomic information. You can observe your pipelines with built in metadata search and column level lineage.

Systems

Systems Software Engineer Software Engineering Data Warehouse

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

JUNE 16, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. What are the other systems that feed into and rely on the Trino/Iceberg service? What are the other systems that feed into and rely on the Trino/Iceberg service?

Data Lake

Data Lake High Quality Data Metadata Machine Learning

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

Instead of driving innovation, data engineers often find themselves bogged down with maintenance tasks. On average, engineers spend over half of their time maintaining existing systems rather than developing new solutions. Tool sprawl is another hurdle that data teams must overcome.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Data logs: The latest evolution in Meta’s access tools

Engineering at Meta

FEBRUARY 4, 2025

Were sharing how Meta built support for data logs, which provide people with additional data about how they use our products. Here we explore initial system designs we considered, an overview of the current architecture, and some important principles Meta takes into account in making data accessible and easy to understand.

Accessible

Accessible Accessibility Raw Data Data Warehouse

Metadata: What Is It and Why it Matters

Ascend.io

JULY 11, 2024

Metadata is the information that provides context and meaning to data, ensuring it’s easily discoverable, organized, and actionable. It enhances data quality, governance, and automation, transforming raw data into valuable insights. Imagine a library with millions of books but no catalog system to organize them.

Metadata

Metadata IT Government High Quality Data

6 Ways To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

Instead of driving innovation, data engineers often find themselves bogged down with maintenance tasks. On average, engineers spend over half of their time maintaining existing systems rather than developing new solutions. Tool sprawl is another hurdle that data teams must overcome.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Data Engineering Weekly #196

Data Engineering Weekly

NOVEMBER 3, 2024

Foundation Capital: A System of Agents brings Service-as-Software to life software is no longer simply a tool for organizing work; software becomes the worker itself, capable of understanding, executing, and improving upon traditionally human-delivered services. It's good to know about Dapr and restate.dev.

Data Engineering

Data Engineering Data Engineer Engineering Pipeline-centric

Making The Total Cost Of Ownership For External Data Manageable With Crux

Data Engineering Podcast

JULY 17, 2022

Summary There are extensive and valuable data sets that are available outside the bounds of your organization. Whether that data is public, paid, or scraped it requires investment and upkeep to acquire and integrate it with your systems. Atlan is the metadata hub for your data ecosystem.

Data Management

Data Management Management Metadata MongoDB

Building Linked Data Products With JSON-LD

Data Engineering Podcast

SEPTEMBER 17, 2023

Summary A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. What is the level of native support/compatibiliity that you see for JSON-LD in data systems?

Building

Building SQL BI Python

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Data Engineering Podcast

DECEMBER 18, 2022

In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows. Atlan is the metadata hub for your data ecosystem. Missing data? Stale dashboards?

Metadata

Metadata Business Intelligence Data Lake BI

What "Data Lineage Done Right" Looks Like And How They're Doing It At Manta

Data Engineering Podcast

JULY 31, 2022

Summary Data lineage is the roadmap for your data platform, providing visibility into all of the dependencies for any report, machine learning model, or data warehouse table that you are working with. Atlan is the metadata hub for your data ecosystem. Data lineage and metadata systems are a hot topic right now.

IT

IT Metadata MongoDB MySQL

An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality

Data Engineering Podcast

AUGUST 28, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. RudderStack helps you build a customer data platform on your warehouse or data lake.

Data Engineering

Data Engineering Data Engineer MongoDB Metadata

Building A Shared Understanding Of Data Assets In A Business Through A Single Pane Of Glass With Workstream

Data Engineering Podcast

SEPTEMBER 18, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. What are the contributing factors that lead to fragmentation of visibility for data workflows at different stages?

Building

Building Metadata MongoDB MySQL

3. Psyberg: Automated end to end catch up

Netflix Tech

NOVEMBER 14, 2023

Input : List of source tables and required processing mode Output : Psyberg identifies new events that have occurred since the last high watermark (HWM) and records them in the session metadata table. The session metadata table can then be read to determine the pipeline input. Audit Run various quality checks on the staged data.

Metadata

Metadata Data Pipeline Scala Data Process

The Grand Vision And Present Reality of DataOps

Data Engineering Podcast

MAY 3, 2021

Summary The Data industry is changing rapidly, and one of the most active areas of growth is automation of data workflows. Taking cues from the DevOps movement of the past decade data professionals are orienting around the concept of DataOps. How does security factor into the design of robust DataOps systems?

Data Warehouse

Data Warehouse Data Pipeline BI Metadata

Put Your Whole Data Team On The Same Page With Atlan

Data Engineering Podcast

APRIL 5, 2021

In this episode Prukalpa Sankar shares her experiences across multiple attempts at building a system that brings everyone onto the same page, ultimately bringing her to found Atlan. What portions of the data workflow is Atlan responsible for? What components of the data stack might Atlan replace?

Data Warehouse

Data Warehouse Data Pipeline BI Metadata

Data Catalog - A Broken Promise

Data Engineering Weekly

DECEMBER 29, 2022

Data catalogs are the most expensive data integration systems you never intended to build. Data Catalog as a passive web portal to display metadata requires significant rethinking to adopt modern data workflow, not just adding “modern” in its prefix. It makes rolling out the data catalogs.

Metadata

Metadata Data Warehouse ETL Tools Data Workflow

Hadoop vs Spark: Main Big Data Tools Explained

AltexSoft

JUNE 7, 2021

You don’t need to archive or clean data before loading. The system automatically replicates information to prevent data loss in the case of a node failure. It doesn’t belong to the master-slave paradigm, being responsible for loading data into the cluster, describing how the data must be processed, and retrieving the output.

Big Data Tools

Big Data Tools Hadoop Big Data Database-centric

Unleashing the Power of CDC With Snowflake

Workfall

JUNE 12, 2023

Change Data Capture (CDC) is a powerful technique that revolutionises data engineering by capturing and applying incremental changes to databases or data sources. It bridges gaps in data ecosystems, ensuring consistency and synchronisation across systems.

Telecommunication

Telecommunication Metadata Healthcare Finance

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

Netflix Tech

OCTOBER 18, 2022

Due to its popularity, the number of workflows managed by the system has grown exponentially. The scheduler on-call has to closely monitor the system during non-business hours. As the usage increased, we had to vertically scale the system to keep up and were approaching AWS instance type limits.

Java

Java Data Machine Learning Systems

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Netflix Tech

OCTOBER 27, 2020

The data warehouse is not designed to serve point requests from microservices with low latency. Therefore, we must efficiently move data from the data warehouse to a global, low-latency and highly-reliable key-value store. Figure 1 shows how we use Bulldozer to move data at Netflix.

Data Warehouse

Data Warehouse Datasets Data Big Data

The Evolution of Table Formats

Monte Carlo

MAY 14, 2024

At its core, a table format is a sophisticated metadata layer that defines, organizes, and interprets multiple underlying data files. Table formats incorporate aspects like columns, rows, data types, and relationships, but can also include information about the structure of the data itself.

Data Lake

Data Lake Metadata Hadoop Data Governance

Data Engineering Weekly #105

Data Engineering Weekly

OCTOBER 30, 2022

Editor’s Note: The current state of the Data Catalog The results are out for our poll on the current state of the Data Catalogs. The highlights are that 59% of folks think data catalogs are sometimes helpful. The current state of these systems is inherently passive systems.

Data Engineering

Data Engineering Data Engineer Engineering Data Ingestion

Data Orchestration: Defining, Understanding, and Applying

Ascend.io

DECEMBER 11, 2023

Is your data stuck in separate areas within your company, making it hard to use effectively? Here’s the deal: for data to truly drive your business forward, you need a reliable and scalable system to keep it moving without hiccups. In other words, you need data orchestration. So, why is data orchestration a big deal?

Data Workflow

Data Workflow Data Pipeline Data Lake Data

DataOps Architecture: 5 Key Components and How to Get Started

Databand.ai

AUGUST 30, 2023

It aims to streamline data ingestion, processing, and analytics by automating and integrating various data workflows. It encompasses the systems, tools, and processes that enable businesses to manage their data more efficiently and effectively. Data Sources Data sources are the backbone of any DataOps architecture.

Architecture

Architecture Data Ingestion Data Governance Data Cleanse

DataOps Tools: Key Capabilities & 5 Tools You Must Know About

Databand.ai

AUGUST 30, 2023

DataOps tools should provide a comprehensive data cataloging solution that allows organizations to create a centralized repository of their data assets, complete with metadata, data lineage information, and data samples. This allows it to be easily integrated with other services and systems.

Data Cleanse

Data Cleanse Data Pipeline Data Ingestion Data Validation

Data Engineering Weekly Radio #120

Data Engineering Weekly

MARCH 11, 2023

[link] Data Engineering Weekly Data Catalog - A Broken Promise Data catalogs are the most expensive data integration systems you never intended to build.

Data Engineering

Data Engineering Data Engineer Engineering High Quality Data

Data Engineering Zoomcamp – Data Ingestion (Week 2)

Hepta Analytics

FEBRUARY 14, 2022

The multiple databases will be queried, then passed through the transformation server to transform the data to comply with the business rule then the transformed the data is stored in a data store. This is also used for legacy systems. This means that your data store has to have capabilities of transforming the data.

Data Ingestion

Data Ingestion Data Engineering Data Engineer Engineering

Interpreting the Gartner Data Observability Market Guide

Monte Carlo

AUGUST 13, 2024

This is accomplished by continuously monitoring, tracking, alerting, analyzing and troubleshooting data workflows to reduce problems and prevent data errors or system downtime.” Our perspective: Gartner does a good job covering what’s being monitored and the workflows associated with data observability tools.

Data

Data Data Warehouse Data Pipeline Data Architecture

From Patchwork to Platform: The Rise of the Post-Modern Data Stack

Ascend.io

MAY 19, 2023

For example, a global media company struggled because they were juggling different tools like Fivetran for bringing in data, dbt for transforming it, Airflow for coordinating everything, Monte Carlo for monitoring and scanning for troubled data, and Hightouch for getting data out to other systems.

Data Pipeline

Data Pipeline Data Engineering Data Engineer Media

The Top Data Strategy Influencers and Content Creators on LinkedIn

Databand.ai

DECEMBER 29, 2022

Follow Mico on LinkedIn 3) Chad Sanderson Chief Operator at Data Quality Camp Chad has extensive experience in data from hyper-growth startups to established tech supergiants to small businesses just dipping their toes into ecommerce. Seth has experience in leading corporate wide strategies across complex corporate organizations.

BI

BI Consulting Data Science Data Governance

Azure Data Engineer (DP-203) Certification Cost in 2023

Knowledge Hut

SEPTEMBER 29, 2023

The Azure Data Engineer Certification test evaluates one's capacity for organizing and putting into practice data processing, security, and storage, as well as their capacity for keeping track of and maximizing data processing and storage. Why Should You Get an Azure Data Engineer Certification?

Certification

Certification Data Engineering Data Engineer Engineering

Big Data (Quality), Small Data Team: How Prefect Saved 20 Hours Per Week with Data Observability

Monte Carlo

SEPTEMBER 20, 2022

Here’s how Prefect , Series B startup and creator of the popular data orchestration tool, harnessed the power of data observability to preserve headcount, improve data quality and reduce time to detection and resolution for data incidents. But Monte Carlo doesn’t stop at the “most important” tables.

Big Data

Big Data Data Warehouse Data Data Governance

The Good and the Bad of the Elasticsearch Search and Analytics Engine

AltexSoft

SEPTEMBER 21, 2023

From those home-made beginnings as Compass, Elasticsearch has matured into one of the leading enterprise search engines, standing among the top 10 most popular database management systems globally according to the Stack Overflow 2023 Developer Survey. But like any technology, it has its share of pros and cons.

Engineering

Engineering NoSQL Programming Language Java

What is integration runtime in Azure data factory?

Edureka

AUGUST 19, 2024

One of the key elements of Azure Data Factory that permits data integration between various network environments is Integration Runtime. It offers the infrastructure needed to transfer data safely between cloud and on-site data storage.

Transportation

Transportation Data Integration Data Storage Utilities

What Is Data Pipeline Automation?

Ascend.io

MARCH 17, 2023

Theoretically, data and analytics should be the backbones of decision-making in business. Like mitochondria power a cell, data powers a business. Today, there are no intelligent systems that deliver data at the pace, and with the impact, leaders need to power the business. But for most companies, that’s not the reality.

Data Pipeline

Data Pipeline Datasets Data Software Engineer

What Is Data Pipeline Automation?

Ascend.io

MARCH 17, 2023

Theoretically, data and analytics should be the backbones of decision-making in business. Like mitochondria power a cell, data powers a business. Today, there are no intelligent systems that deliver data at the pace, and with the impact, leaders need to power the business. But for most companies, that’s not the reality.

Data Pipeline

Data Pipeline Datasets Data Software Engineer

Data Pipeline Architecture Explained: 6 Diagrams and Best Practices

Monte Carlo

JUNE 14, 2023

What is data pipeline architecture? Data pipeline architecture is the process of designing how data is surfaced from its source system to the consumption layer. Data orchestration Airflow : Airflow is the most common data orchestrator used by data teams. Now Go Build Some Data Pipelines!

Data Pipeline

Data Pipeline Architecture Data Lake Data Warehouse

5 Data Engineering Best Practices Every Data Team Should Use

Ascend.io

MARCH 19, 2025

Data pipeline observability means having robust monitoring, logging, and alerting across all pipeline components. Metadata is the foundation of observability, providing essential insights into data pipeline health, execution status, dependencies, and performance metrics. Did yesterdays data load complete successfully?

Data Engineering

Data Engineering Data Engineer Engineering Metadata

The Good and the Bad of Apache Airflow Pipeline Orchestration

AltexSoft

NOVEMBER 7, 2022

DevOps tasks — for example, creating scheduled backups and restoring data from them. Airflow is especially useful for orchestrating Big Data workflows. Airflow is not a data processing tool by itself but rather an instrument to manage multiple components of data processing. Metadata database. Since the 2.0

PostgreSQL

PostgreSQL Metadata Python MySQL

The State of Data Engineering in 2024: Key Insights and Trends

Data Engineering Weekly

DECEMBER 16, 2024

Automated Data Classification and Governance LLMs are reshaping governance practices. Grab’s Metasense , Uber’s DataK9 , and Meta’s classification systems use AI to automatically categorize vast data sets, reducing manual efforts and improving accuracy.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Scale Unstructured Text Analytics with Batch LLM Inference

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Webinars

Trending Sources

Data Engineering Weekly #198

Webinars

Understanding The Immune System With Data At ImmunAI

Being Data Driven At Stripe With Trino And Iceberg

How To Prepare Your Data Team for 2025

Data logs: The latest evolution in Meta’s access tools

Metadata: What Is It and Why it Matters

6 Ways To Prepare Your Data Team for 2025

Data Engineering Weekly #196

Making The Total Cost Of Ownership For External Data Manageable With Crux

Building Linked Data Products With JSON-LD

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

What "Data Lineage Done Right" Looks Like And How They're Doing It At Manta

An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality

Building A Shared Understanding Of Data Assets In A Business Through A Single Pane Of Glass With Workstream

3. Psyberg: Automated end to end catch up

The Grand Vision And Present Reality of DataOps

Put Your Whole Data Team On The Same Page With Atlan

Data Catalog - A Broken Promise

Hadoop vs Spark: Main Big Data Tools Explained

Unleashing the Power of CDC With Snowflake

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

The Evolution of Table Formats

Data Engineering Weekly #105

Data Orchestration: Defining, Understanding, and Applying

DataOps Architecture: 5 Key Components and How to Get Started

DataOps Tools: Key Capabilities & 5 Tools You Must Know About

Data Engineering Weekly Radio #120

Data Engineering Zoomcamp – Data Ingestion (Week 2)

Interpreting the Gartner Data Observability Market Guide

From Patchwork to Platform: The Rise of the Post-Modern Data Stack

The Top Data Strategy Influencers and Content Creators on LinkedIn

Azure Data Engineer (DP-203) Certification Cost in 2023

Big Data (Quality), Small Data Team: How Prefect Saved 20 Hours Per Week with Data Observability

The Good and the Bad of the Elasticsearch Search and Analytics Engine

What is integration runtime in Azure data factory?

What Is Data Pipeline Automation?

What Is Data Pipeline Automation?

Data Pipeline Architecture Explained: 6 Diagrams and Best Practices

5 Data Engineering Best Practices Every Data Team Should Use

The Good and the Bad of Apache Airflow Pipeline Orchestration

The State of Data Engineering in 2024: Key Insights and Trends

Stay Connected