Data Pipeline and Metadata - Data Engineering Digest

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

by Jasmine Omeke , Obi-Ike Nwoke , Olek Gorajek Intro This post is for all data practitioners, who are interested in learning about bootstrapping, standardization and automation of batch data pipelines at Netflix. You may remember Dataflow from the post we wrote last year titled Data pipeline asset management with Dataflow.

Data Pipeline

Data Pipeline Scala Metadata Food

Data Engineering Best Practices - #2. Metadata & Logging

Start Data Engineering

FEBRUARY 22, 2024

Data Pipeline Logging Best Practices 3.1. Metadata: Information about pipeline runs, & data flowing through your pipeline 3.2. Introduction 2. Setup & Logging architecture 3. Obtain visibility into the code’s execution sequence using text logs 3.3. Monitoring UI & Traceability 3.5.

Metadata

Metadata Data Engineer Data Engineering Engineering

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance.

Metadata

Metadata MongoDB MySQL Scala

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Build your data pipelines like the Toyota Way

François Nguyen

FEBRUARY 28, 2021

Today, we are going to apply these principles to the data pipelines. The idea is to transpose these 7 principles to data pipeline knowing that Data pipelines are 100% flexible : if you have the skills, you build the pipeline you want. How does a bad data pipeline process look like ?

Data Pipeline

Data Pipeline Building Manufacturing BI

Declarative Data Pipelines with Hoptimator

LinkedIn Engineering

JUNE 26, 2023

However, we've found that this vertical self-service model doesn't work particularly well for data pipelines, which involve wiring together many different systems into end-to-end data flows. Data pipelines power foundational parts of LinkedIn's infrastructure, including replication between data centers.

Data Pipeline

Data Pipeline Kafka SQL MySQL

Why Column-Aware Metadata Is Key to Automating Data Transformations

Snowflake

JANUARY 25, 2023

We cannot scale our expertise as fast as we can scale the Data Cloud. There are just not enough hours in a day to do all the data profiling, design, and coding required to build, deploy, manage, and troubleshoot an ever-growing set of data pipelines with transformations.

Metadata

Metadata Data Pipeline Government Data

Data News — Week 24.11

Christophe Blefari

MARCH 15, 2024

Attributing Snowflake cost to whom it belongs — Fernando gives ideas about metadata management to attribute better Snowflake cost. Forward thinking Dataviz is hierarchical — Malloy, once again, provides an excellent article about a new way to see data visualisations. This is Croissant. It's inspirational.

Metadata

Metadata Data Data Warehouse Software Engineer

Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

Data Engineering Podcast

NOVEMBER 10, 2021

Summary A significant source of friction and wasted effort in building and integrating data management systems is the fragmentation of metadata across various tools. What are the common challenges faced by engineers and data practitioners in organizing the metadata for their systems? What are the goals of the project?

Metadata

Metadata Data Warehouse Data Lake BI

Next Stop – Building a Data Pipeline from Edge to Insight

Cloudera

FEBRUARY 8, 2021

Below is the entire set of steps in the data lifecycle, and each step in the lifecycle will be supported by a dedicated blog post(see Fig. 1): Data Collection – data ingestion and monitoring at the edge (whether the edge be industrial sensors or people in a vehicle showroom).

Data Pipeline

Data Pipeline Building Manufacturing Data Warehouse

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Data Engineering Podcast

OCTOBER 15, 2021

Summary The binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. How is the governance of DataHub being managed?

Metadata

Metadata BI Data Warehouse Government

Data Pipeline Observability: A Model For Data Engineers

Databand.ai

JUNE 28, 2023

Data Pipeline Observability: A Model For Data Engineers Eitan Chazbani June 29, 2023 Data pipeline observability is your ability to monitor and understand the state of a data pipeline at any time. We believe the world’s data pipelines need better data observability.

Data Pipeline

Data Pipeline Data Engineering Data Engineer Engineering

The Data Turf Wars are Over, But the Metadata Turf Wars Have Just Begun

Cloudera

AUGUST 6, 2024

Open data is the future. And for that future to be a reality, data teams must shift their attention to metadata, the new turf war for data. The need for unified metadata While open and distributed architectures offer many benefits, they come with their own set of challenges. A few solutions manage both.

Metadata

Metadata Government Datasets Architecture

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

As we look towards 2025, it’s clear that data teams must evolve to meet the demands of evolving technology and opportunities. In this blog post, we’ll explore key strategies that data teams should adopt to prepare for the year ahead. The anticipated growth in data pipelines presents both challenges and opportunities.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

How Meta understands data at scale

Engineering at Meta

APRIL 28, 2025

Understanding DataSchema requires grasping schematization , which defines the logical structure and relationships of data assets, specifying field names, types, metadata, and policies. This is achieved by attaching these elements to individual fields in data assets, providing a thorough understanding of the data.

Metadata

Metadata Data Utilities Data Warehouse

Data Pipeline with Airflow and AWS Tools (S3, Lambda & Glue)

Towards Data Science

APRIL 6, 2023

Today’s post follows the same philosophy: fitting local and cloud pieces together to build a data pipeline. And, when it comes to data engineering solutions, it’s no different: They have databases, ETL tools, streaming platforms, and so on — a set of tools that makes our life easier (as long as you pay for them). not sponsored.

AWS

AWS Data Pipeline Amazon Web Services Python

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Cloudera

MARCH 14, 2023

In this blog post we will put these capabilities in context and dive deeper into how the built-in, end-to-end data flow life cycle enables self-service data pipeline development. Key requirements for building data pipelines Every data pipeline starts with a business requirement.

Data Pipeline

Data Pipeline Designing Kafka Metadata

The Best Data Dictionary Tools in 2025

Monte Carlo

APRIL 28, 2025

Next, look for automatic metadata scanning. This basically means the tool updates itself by pulling in changes to data structures from your systems. It has real-time metadata updates, deep data lineage, and its flexible if you want to customize or extend it for your teams specific needs.

Metadata

Metadata Hadoop Data SQL

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

JUNE 16, 2024

what kinds of questions are you answering with table metadata what use case/team does that support comparative utility of iceberg REST catalog What are the shortcomings of Trino and Iceberg? What were the requirements and selection criteria that led to the selection of that combination of technologies?

Data Lake

Data Lake High Quality Data Metadata Machine Learning

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

Canva writes about its custom solution using dbt and metadata capturing to attribute costs, monitor performance, and enable data-driven decision-making, significantly enhancing its Snowflake environment management. link] JBarti: Write Manageable Queries With The BigQuery Pipe Syntax Our quest to simplify SQL is always an adventure.

Data Engineer

Data Engineer Data Engineering Engineering Insurance

Metadata: What Is It and Why it Matters

Ascend.io

JULY 11, 2024

Metadata is the information that provides context and meaning to data, ensuring it’s easily discoverable, organized, and actionable. It enhances data quality, governance, and automation, transforming raw data into valuable insights. This is what managing data without metadata feels like. Chaos, right?

Metadata

Metadata IT Government High Quality Data

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Data Engineering Podcast

AUGUST 13, 2022

Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Data stacks are becoming more and more complex.

Metadata

Metadata MongoDB MySQL Scala

A Look At The Data Systems Behind The Gameplay For League Of Legends

Data Engineering Podcast

NOVEMBER 20, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it.

Systems

Systems Metadata Data Pipeline MongoDB

A Complete Guide to Scale Your Data Pipelines and Data Products with Contract Testing and Dbt

Towards Data Science

OCTOBER 25, 2023

As a data or analytics engineer, you knew where to find all the transformation logic and models because they were all in the same codebase. You probably work closely with the colleague who builds the data pipeline that you were consuming. There was only one data team, two at most. null, null).

Data Pipeline

Data Pipeline SQL Data Architecture Data

6 Ways To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

As we look towards 2025, it’s clear that data teams must evolve to meet the demands of evolving technology and opportunities. In this blog post, we’ll explore key strategies that data teams should adopt to prepare for the year ahead. The anticipated growth in data pipelines presents both challenges and opportunities.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Netflix Tech

NOVEMBER 14, 2023

We’ll discuss batch data processing, the limitations we faced, and how Psyberg emerged as a solution. Furthermore, we’ll delve into the inner workings of Psyberg, its unique features, and how it integrates into our data pipelining workflows. This is mainly used to identify new changes since the last update.

Data Engineer

Data Engineer Data Engineering Engineering Metadata

Expanding The Reach of Business Intelligence Through Ubiquitous Embedded Analytics With Sisense

Data Engineering Podcast

OCTOBER 30, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc.,

Business Intelligence

Business Intelligence Metadata MongoDB MySQL

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Data Engineering Podcast

DECEMBER 18, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Struggling with broken pipelines? Missing data? Atlan is the metadata hub for your data ecosystem.

Metadata

Metadata Business Intelligence Data Lake BI

Toward a Data Mesh (part 2) : Architecture & Technologies

François Nguyen

MARCH 22, 2021

TL;DR After setting up and organizing the teams, we are describing 4 topics to make data mesh a reality. We want interoperability for any data stored versus we have to think how to store the data in a specific node to optimize the processing. We want to have our hands free and be totally devoted to devops principles.

Technology

Technology Architecture Google Cloud Metadata

Build Data Products Without A Data Team Using AgileData

Data Engineering Podcast

NOVEMBER 13, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc.,

Building

Building Metadata MongoDB MySQL

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Data Engineering Podcast

DECEMBER 29, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Struggling with broken pipelines? Missing data? Atlan is the metadata hub for your data ecosystem.

Management

Management Metadata Business Intelligence Data Lake

Build Better Data Products By Creating Data, Not Consuming It

Data Engineering Podcast

NOVEMBER 6, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc.,

Building

Building IT Metadata MongoDB

Snowflake Invests in Metaplane for Deep, End-to-End Observability in the Data Cloud

Snowflake

MAY 15, 2024

We’re excited to announce today that Snowflake has invested in Metaplane , a leading end-to-end data observability platform that helps data teams improve the quality and performance of their data. Metaplane ensures that every company can trust the data that powers their business.

Cloud

Cloud Metadata Data Pipeline Government

Making The Total Cost Of Ownership For External Data Manageable With Crux

Data Engineering Podcast

JULY 17, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Data Management

Data Management Management Metadata MongoDB

The Data Discovery Team

Jesse Anderson

NOVEMBER 14, 2023

The mission of the data discovery team is twofold: 1) The data team must discover the data in the IT landscape 2) The data team must make the data in the IT landscape discoverable to the operations team Let’s unpack this. 1) The data discovery team must work on discovering the IT landscape.

Metadata

Metadata Data Science Big Data Data

How To Bring Agile Practices To Your Data Projects

Data Engineering Podcast

OCTOBER 23, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc.,

Project

Project Metadata MongoDB MySQL

3. Psyberg: Automated end to end catch up

Netflix Tech

NOVEMBER 14, 2023

Now, let’s explore the state of our pipelines after incorporating Psyberg. Pipelines After Psyberg Let’s explore how different modes of Psyberg could help with a multistep data pipeline. The session metadata table can then be read to determine the pipeline input.

Metadata

Metadata Data Pipeline Scala Data Process

Combining The Simplicity Of Spreadsheets With The Power Of Modern Data Infrastructure At Canvas

Data Engineering Podcast

JUNE 19, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Metadata

Metadata Unstructured Data MongoDB MySQL

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Data Engineering Weekly

FEBRUARY 18, 2025

Benefits: Cost and Time Efficiency : no longer need to move data between system Data Consistency : reduces the occurrence of similar-yet-different datasets, leading to fewer data pipelines and simpler data management. It excels in event-driven architectures and data pipelines.

Kafka

Kafka Lambda Architecture SQL Architecture

The last (but not least)”ops” you need for your data : DataGovops

François Nguyen

JANUARY 18, 2021

In every step,we do not just read, transform and write data, we are also doing that with the metadata. Last part, it was added the data security and privacy part. Every data governance policy about this topic must be read by a code to act in your data platform (access management, masking, etc.)

Data Governance

Data Governance Metadata Government Data Pipeline

Bring Geospatial Analytics Across Disparate Datasets Into Your Toolkit With The Unfolded Platform

Data Engineering Podcast

JUNE 26, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Datasets

Datasets Unstructured Data Metadata MongoDB

Our First Netflix Data Engineering Summit

Netflix Tech

DECEMBER 14, 2023

Engineers from across the company came together to share best practices on everything from Data Processing Patterns to Building Reliable Data Pipelines. The result was a series of talks which we are now sharing with the rest of the Data Engineering community!

Data Engineering

Data Engineering Data Engineer Engineering Metadata

Data Engineering Weekly #217

Data Engineering Weekly

APRIL 20, 2025

We all know that data freshness plays a critical role in the performance of Lakehouse. If we can place the metadata, indexing, and recent data files in Express One, we can potentially build a Snowflake-style performant architecture in Lakehouse. Apache Hudi, for example, introduces an indexing technique to Lakehouse.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

How to Simplify Data Pipelines with DBT and Airflow?

Workfall

AUGUST 14, 2023

Reading Time: 7 minutes In today’s data-driven world, efficient data pipelines have become the backbone of successful organizations. These pipelines ensure that data flows smoothly from various sources to its intended destinations, enabling businesses to make informed decisions and gain valuable insights.

Data Pipeline

Data Pipeline Data Raw Data Database

What is Apache Airflow?

Marc Lamberti

SEPTEMBER 22, 2023

Stemming from this analogy, “you” is the orchestrator in data orchestration, and the recipe is the data pipeline. It was created in 2014 by Airbnb and has since been widely adopted by the data engineering community, primarily as it was the first orchestrator allowing to author data pipelines programmatically.

Data Pipeline

Data Pipeline Python Metadata Database

Ready-to-go sample data pipelines with Dataflow

Data Engineering Best Practices - #2. Metadata & Logging

Webinars

Trending Sources

Level Up Your Data Platform With Active Metadata

Webinars

Build your data pipelines like the Toyota Way

Declarative Data Pipelines with Hoptimator

Why Column-Aware Metadata Is Key to Automating Data Transformations

Data News — Week 24.11

Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

Next Stop – Building a Data Pipeline from Edge to Insight

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Data Pipeline Observability: A Model For Data Engineers

The Data Turf Wars are Over, But the Metadata Turf Wars Have Just Begun

How To Prepare Your Data Team for 2025

How Meta understands data at scale

Data Pipeline with Airflow and AWS Tools (S3, Lambda & Glue)

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

The Best Data Dictionary Tools in 2025

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Weekly #198

Metadata: What Is It and Why it Matters

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

A Look At The Data Systems Behind The Gameplay For League Of Legends

A Complete Guide to Scale Your Data Pipelines and Data Products with Contract Testing and Dbt

6 Ways To Prepare Your Data Team for 2025

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Expanding The Reach of Business Intelligence Through Ubiquitous Embedded Analytics With Sisense

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Toward a Data Mesh (part 2) : Architecture & Technologies

Build Data Products Without A Data Team Using AgileData

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Build Better Data Products By Creating Data, Not Consuming It

Snowflake Invests in Metaplane for Deep, End-to-End Observability in the Data Cloud

Making The Total Cost Of Ownership For External Data Manageable With Crux

The Data Discovery Team

How To Bring Agile Practices To Your Data Projects

3. Psyberg: Automated end to end catch up

Combining The Simplicity Of Spreadsheets With The Power Of Modern Data Infrastructure At Canvas

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

The last (but not least)”ops” you need for your data : DataGovops

Bring Geospatial Analytics Across Disparate Datasets Into Your Toolkit With The Unfolded Platform

Our First Netflix Data Engineering Summit

Data Engineering Weekly #217

How to Simplify Data Pipelines with DBT and Airflow?

What is Apache Airflow?

Stay Connected