Data Workflow and Metadata - Data Engineering Digest

Scale Unstructured Text Analytics with Batch LLM Inference

Snowflake

MARCH 6, 2025

Meanwhile, operations teams use entity extraction on documents to automate workflows and enable metadata-driven analytical filtering. And to create significant technology and team efficiencies, organizations need to consider opportunities to integrate LLM pipelines with existing structured data workflows.

Unstructured Data

Unstructured Data Medical Media Data Workflow

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Data Engineering Podcast

OCTOBER 15, 2021

Summary The binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. How is the governance of DataHub being managed?

Metadata

Metadata BI Data Warehouse Government

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

JUNE 16, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. what kinds of questions are you answering with table metadata what use case/team does that support comparative utility of iceberg REST catalog What are the shortcomings of Trino and Iceberg?

Data Lake

Data Lake High Quality Data Metadata Machine Learning

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

Canva writes about its custom solution using dbt and metadata capturing to attribute costs, monitor performance, and enable data-driven decision-making, significantly enhancing its Snowflake environment management. link] JBarti: Write Manageable Queries With The BigQuery Pipe Syntax Our quest to simplify SQL is always an adventure.

Data Engineer

Data Engineer Data Engineering Engineering Insurance

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

Deploy DataOps DataOps , or Data Operations, is an approach that applies the principles of DevOps to data management. It aims to streamline and automate data workflows, enhance collaboration and improve the agility of data teams. How effective are your current data workflows?

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Metadata: What Is It and Why it Matters

Ascend.io

JULY 11, 2024

Metadata is the information that provides context and meaning to data, ensuring it’s easily discoverable, organized, and actionable. It enhances data quality, governance, and automation, transforming raw data into valuable insights. This is what managing data without metadata feels like. Chaos, right?

Metadata

Metadata IT Government High Quality Data

Toward a Data Mesh (part 2) : Architecture & Technologies

François Nguyen

MARCH 22, 2021

TL;DR After setting up and organizing the teams, we are describing 4 topics to make data mesh a reality. Data As Code is a very strong choice : we do not want any UI because it is an heritage of the ETL period. What you have to code is this workflow ! We want to have our hands free and be totally devoted to devops principles.

Technology

Technology Architecture Google Cloud Metadata

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

dbt Developer Hub

APRIL 20, 2025

These tools can be called by LLM systems to learn about your data and metadata. Remember, as with any AI workflows, to make sure that you are taking appropriate caution in terms of giving these access to production systems and data. What is the best workflow for the current iteration of the MCP server?

Structured Data

Structured Data SQL BI Project

6 Ways To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

Deploy DataOps DataOps , or Data Operations, is an approach that applies the principles of DevOps to data management. It aims to streamline and automate data workflows, enhance collaboration and improve the agility of data teams. How effective are your current data workflows?

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Data Engineering Podcast

DECEMBER 18, 2022

In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows. Atlan is the metadata hub for your data ecosystem. Missing data? Missing data?

Metadata

Metadata Business Intelligence Data Lake BI

Making The Total Cost Of Ownership For External Data Manageable With Crux

Data Engineering Podcast

JULY 17, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Data Management

Data Management Management Metadata MongoDB

Data Engineering Weekly #196

Data Engineering Weekly

NOVEMBER 3, 2024

Data Engineering Weekly readers get 15% discount by registering the following link, [link] Gustavo Akashi: Building data pipelines effortlessly with a DAG Builder for Apache Airflow Every code-first data workflow grew into a UI-based or Yaml-based workflow.

Data Engineer

Data Engineer Data Engineering Engineering Pipeline-centric

Building Linked Data Products With JSON-LD

Data Engineering Podcast

SEPTEMBER 17, 2023

Summary A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. Linked data technologies provide a means of tightly coupling metadata with raw information.

Building

Building SQL BI Python

An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality

Data Engineering Podcast

AUGUST 28, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. RudderStack helps you build a customer data platform on your warehouse or data lake.

Data Engineer

Data Engineer Data Engineering MongoDB Metadata

Data logs: The latest evolution in Meta’s access tools

Engineering at Meta

FEBRUARY 4, 2025

For each data logs table, we initiate a new worker task that fetches the relevant metadata describing how to correctly query the data. Once we know what to query for a specific table, we create a task for each partition that executes a job in Dataswarm (our data pipeline system).

Accessible

Accessible Accessibility Raw Data Data Warehouse

What "Data Lineage Done Right" Looks Like And How They're Doing It At Manta

Data Engineering Podcast

JULY 31, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. The only thing worse than having bad data is not knowing that you have it. Atlan is the metadata hub for your data ecosystem.

IT

IT Metadata MongoDB MySQL

Building A Shared Understanding Of Data Assets In A Business Through A Single Pane Of Glass With Workstream

Data Engineering Podcast

SEPTEMBER 18, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. What are the contributing factors that lead to fragmentation of visibility for data workflows at different stages?

Building

Building Metadata MongoDB MySQL

Addressing The Challenges Of Component Integration In Data Platform Architectures

Data Engineering Podcast

NOVEMBER 26, 2023

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Architecture

Architecture Data Lake High Quality Data SQL

3. Psyberg: Automated end to end catch up

Netflix Tech

NOVEMBER 14, 2023

Input : List of source tables and required processing mode Output : Psyberg identifies new events that have occurred since the last high watermark (HWM) and records them in the session metadata table. The session metadata table can then be read to determine the pipeline input. Audit Run various quality checks on the staged data.

Metadata

Metadata Data Pipeline Scala Data Process

Solving Data Discovery At Lyft

Data Engineering Podcast

AUGUST 5, 2019

Finding the data that you need is tricky, and Amundsen will help you solve that problem. And as your data grows in volume and complexity, there are foundational principles that you can follow to keep data workflows streamlined. Finding the data that you need is tricky, and Amundsen will help you solve that problem.

MongoDB

MongoDB PostgreSQL Metadata Media

Understanding The Immune System With Data At ImmunAI

Data Engineering Podcast

FEBRUARY 20, 2022

Summary The life sciences as an industry has seen incredible growth in scale and sophistication, along with the advances in data technology that make it possible to analyze massive amounts of genomic information. You can observe your pipelines with built in metadata search and column level lineage.

Systems

Systems Software Engineer Software Engineering Data Warehouse

Data Catalog - A Broken Promise

Data Engineering Weekly

DECEMBER 29, 2022

Data Catalog as a passive web portal to display metadata requires significant rethinking to adopt modern data workflow, not just adding “modern” in its prefix. I know that is an expensive statement to make😊 To be fair, I’m a big fan of data catalogs, or metadata management , to be precise.

Metadata

Metadata Data Warehouse ETL Tools Data Workflow

Effective Pandas Patterns For Data Engineering

Data Engineering Podcast

JANUARY 30, 2022

You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark.

Data Engineer

Data Engineer Data Engineering Engineering Python

Put Your Whole Data Team On The Same Page With Atlan

Data Engineering Podcast

APRIL 5, 2021

What portions of the data workflow is Atlan responsible for? What components of the data stack might Atlan replace? How would you characterize Atlan’s position in the current data ecosystem? What makes Atlan stand out from other systems for data cataloguing, metadata management, or data governance?

Data Warehouse

Data Warehouse Data Pipeline BI Metadata

The Grand Vision And Present Reality of DataOps

Data Engineering Podcast

MAY 3, 2021

Summary The Data industry is changing rapidly, and one of the most active areas of growth is automation of data workflows. Taking cues from the DevOps movement of the past decade data professionals are orienting around the concept of DataOps.

Data Warehouse

Data Warehouse Data Pipeline BI Metadata

Hadoop vs Spark: Main Big Data Tools Explained

AltexSoft

JUNE 7, 2021

A HDFS Master Node, called a NameNode , keeps metadata with critical information about system files (like their names, locations, number of data blocks in the file, etc.) and keeps track of storage capacity, a volume of data being transferred, etc. Among solutions facilitation data management are. Apache Hadoop ecosystem.

Big Data Tools

Big Data Tools Hadoop Big Data Database-centric

The State of Data Engineering in 2024: Key Insights and Trends

Data Engineering Weekly

DECEMBER 16, 2024

Grab’s Metasense , Uber’s DataK9 , and Meta’s classification systems use AI to automatically categorize vast data sets, reducing manual efforts and improving accuracy. Beyond classification, organizations now use AI for automated metadata generation and data lineage tracking, creating more intelligent data infrastructures.

Data Engineer

Data Engineer Data Engineering Engineering Unstructured Data

Unleashing the Power of CDC With Snowflake

Workfall

JUNE 12, 2023

It facilitates data synchronisation, replication, real-time analytics, and event-driven processing, empowering data-driven decision-making and operational efficiency. These additional columns store metadata like timestamps, user IDs, and change types, ensuring granular change tracking and auditability.

Telecommunication

Telecommunication Metadata Healthcare Finance

Unlocking Effective Data Governance with Unity Catalog – Data Bricks

RandomTrees

SEPTEMBER 17, 2024

The Unity Catalog is Databricks governance solution which integrates with Databricks workspaces and provides a centralized platform for managing metadata, data access, and security. Improved Data Discovery The tagging and documentation features in Unity Catalog facilitate better data discovery.

Data Governance

Data Governance Government Metadata Machine Learning

The Evolution of Table Formats

Monte Carlo

MAY 14, 2024

At its core, a table format is a sophisticated metadata layer that defines, organizes, and interprets multiple underlying data files. Table formats incorporate aspects like columns, rows, data types, and relationships, but can also include information about the structure of the data itself.

Data Lake

Data Lake Metadata Hadoop Data Governance

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Netflix Tech

OCTOBER 27, 2020

Netflix Scheduler is built on top of Meson which is a general purpose workflow orchestration and scheduling framework to execute and manage the lifecycle of the data workflow. Bulldozer makes data warehouse tables more accessible to different microservices and reduces each individual team’s burden to build their own solutions.

Data Warehouse

Data Warehouse Datasets Data Big Data

Data Engineering Weekly #114

Data Engineering Weekly

JANUARY 15, 2023

SiliconANGLE theCUBE: Analyst Predictions 2023 - The Future of Data Management By far one of the best analyses of trends in Data Management. 2023 predictions from the panel are; Unified metadata becomes kingmaker. The author walked through various strategies, from sync to async job submission and batch job submission strategy.

Data Engineer

Data Engineer Data Engineering Engineering Metadata

October 2021 dbt Update: Metrics and Hat Tricks ?

dbt Developer Hub

OCTOBER 14, 2021

It uses the dbt Cloud Metadata API to surface metadata from dbt right in Hex, letting you quickly get the context you need on things like data freshness without juggling multiple apps and browser tabs. Hex just launched an integration with dbt! Get started here. Things to Watch ? What's missing? Spreadsheets?

Metadata

Metadata BI Software Engineer Software Engineering

Data Engineering Weekly #105

Data Engineering Weekly

OCTOBER 30, 2022

Editor’s Note: The current state of the Data Catalog The results are out for our poll on the current state of the Data Catalogs. The highlights are that 59% of folks think data catalogs are sometimes helpful. We saw in the Data Catalog poll how far it has to go to be helpful and active within a data workflow.

Data Engineer

Data Engineer Data Engineering Engineering Data Ingestion

Incremental Processing using Netflix Maestro and Apache Iceberg

Netflix Tech

NOVEMBER 20, 2023

This enables auto propagation of backfill data in multi-stage pipelines. Netflix Maestro Maestro is the Netflix data workflow orchestration platform built to meet the current and future needs of Netflix. As we know, an iceberg table contains a list of snapshots with a set of metadata data.

Process

Process Data Pipeline Datasets SQL

Three Takeaways from Gartner’s 2019 Magic Quadrant for Data Management Solutions for Analytics

Cloudera

FEBRUARY 11, 2019

Cloudera provides a unified platform with multiple data apps and tools, big data management, hybrid cloud deployment flexibility, admin tools for platform provisioning and control, and a shared data experience for centralized security, governance, and metadata management. 3. Expansion beyond core data management.

Data Management

Data Management Management Metadata Government

Data Orchestration: Defining, Understanding, and Applying

Ascend.io

DECEMBER 11, 2023

Data orchestration is the process of efficiently coordinating the movement and processing of data across multiple, disparate systems and services within a company. So, why is data orchestration a big deal? It automates and optimizes data processes, reducing manual effort and the likelihood of errors.

Data Workflow

Data Workflow Data Pipeline Data Lake Data

DataOps Architecture: 5 Key Components and How to Get Started

Databand.ai

AUGUST 30, 2023

DataOps is a collaborative approach to data management that combines the agility of DevOps with the power of data analytics. It aims to streamline data ingestion, processing, and analytics by automating and integrating various data workflows.

Architecture

Architecture Data Ingestion Data Governance Data Cleanse

DataOps Tools: Key Capabilities & 5 Tools You Must Know About

Databand.ai

AUGUST 30, 2023

DataOps tools should provide a comprehensive data cataloging solution that allows organizations to create a centralized repository of their data assets, complete with metadata, data lineage information, and data samples.

Data Cleanse

Data Cleanse Data Pipeline Data Ingestion Data Validation

Data Engineering Weekly Radio #120

Data Engineering Weekly

MARCH 11, 2023

[link] Data Engineering Weekly Data Catalog - A Broken Promise Data catalogs are the most expensive data integration systems you never intended to build.

Data Engineer

Data Engineer Data Engineering Engineering High Quality Data

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

Netflix Tech

OCTOBER 18, 2022

With the high growth of workflows in the past few years?—?increasing increasing at > 100% a year, the need for a scalable data workflow orchestrator has become paramount for Netflix’s business needs. A workflow instance is an execution of a workflow, similarly, an execution of a step is called a step instance.

Java

Java Data Machine Learning Systems

Monte Carlo Recognized as the #1 Data Observability Platform by G2 for 6th Consecutive Quarter

Monte Carlo

OCTOBER 1, 2024

AI-powered Monitor Recommendations that leverage the power of data profiling to suggest appropriate monitors based on rich metadata and historic patterns — greatly simplifying the process of discovering, defining, and deploying field-specific monitors.

Database

Database Metadata Software Engineer Software Engineering

Data Engineering Zoomcamp – Data Ingestion (Week 2)

Hepta Analytics

FEBRUARY 14, 2022

Disadvantages of a data lake are: Can easily become a data swamp data has no versioning Same data with incompatible schemas is a problem without versioning Has no metadata associated It is difficult to join the data Data warehouse stores processed data, mostly structured data.

Data Ingestion

Data Ingestion Data Engineer Data Engineering Engineering

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

Cloudera

AUGUST 21, 2020

The governance aspect is perhaps even more important and businesses need to be able to understand where the data comes from. Data lineage, personally identifiable information or PPI and metadata all fall under a broad data governance banner which is critically important in terms of what needs to be protected and mapped out.

Banking

Banking Kafka Cloud Storage Government

Scale Unstructured Text Analytics with Batch LLM Inference

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Webinars

Trending Sources

Being Data Driven At Stripe With Trino And Iceberg

Webinars

Data Engineering Weekly #198

How To Prepare Your Data Team for 2025

Metadata: What Is It and Why it Matters

Toward a Data Mesh (part 2) : Architecture & Technologies

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

6 Ways To Prepare Your Data Team for 2025

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Making The Total Cost Of Ownership For External Data Manageable With Crux

Data Engineering Weekly #196

Building Linked Data Products With JSON-LD

An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality

Data logs: The latest evolution in Meta’s access tools

What "Data Lineage Done Right" Looks Like And How They're Doing It At Manta

Building A Shared Understanding Of Data Assets In A Business Through A Single Pane Of Glass With Workstream

Addressing The Challenges Of Component Integration In Data Platform Architectures

3. Psyberg: Automated end to end catch up

Solving Data Discovery At Lyft

Understanding The Immune System With Data At ImmunAI

Data Catalog - A Broken Promise

Effective Pandas Patterns For Data Engineering

Put Your Whole Data Team On The Same Page With Atlan

The Grand Vision And Present Reality of DataOps

Hadoop vs Spark: Main Big Data Tools Explained

The State of Data Engineering in 2024: Key Insights and Trends

Unleashing the Power of CDC With Snowflake

Unlocking Effective Data Governance with Unity Catalog – Data Bricks

The Evolution of Table Formats

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Data Engineering Weekly #114

October 2021 dbt Update: Metrics and Hat Tricks ?

Data Engineering Weekly #105

Incremental Processing using Netflix Maestro and Apache Iceberg

Three Takeaways from Gartner’s 2019 Magic Quadrant for Data Management Solutions for Analytics

Data Orchestration: Defining, Understanding, and Applying

DataOps Architecture: 5 Key Components and How to Get Started

DataOps Tools: Key Capabilities & 5 Tools You Must Know About

Data Engineering Weekly Radio #120

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

Monte Carlo Recognized as the #1 Data Observability Platform by G2 for 6th Consecutive Quarter

Data Engineering Zoomcamp – Data Ingestion (Week 2)

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

Stay Connected