Data Pipeline, Metadata and Systems - Data Engineering Digest

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

by Jasmine Omeke , Obi-Ike Nwoke , Olek Gorajek Intro This post is for all data practitioners, who are interested in learning about bootstrapping, standardization and automation of batch data pipelines at Netflix. You may remember Dataflow from the post we wrote last year titled Data pipeline asset management with Dataflow.

Data Pipeline

Data Pipeline Scala Metadata Food

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. A variety of platforms have been developed to capture and analyze that information to great effect, but they are inherently limited in their utility due to their nature as storage systems.

Metadata

Metadata MongoDB MySQL Scala

Inside Facebook’s video delivery system

Engineering at Meta

DECEMBER 10, 2024

Were explaining the end-to-end systems the Facebook app leverages to deliver relevant content to people. At Facebooks scale, the systems built to support and overcome these challenges require extensive trade-off analyses, focused optimizations, and architecture built to allow our engineers to push for the same user and business outcomes.

Systems

Systems Architecture Engineering Data Pipeline

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Build your data pipelines like the Toyota Way

François Nguyen

FEBRUARY 28, 2021

Today, we are going to apply these principles to the data pipelines. The idea is to transpose these 7 principles to data pipeline knowing that Data pipelines are 100% flexible : if you have the skills, you build the pipeline you want. How does a bad data pipeline process look like ?

Data Pipeline

Data Pipeline Building Manufacturing BI

Declarative Data Pipelines with Hoptimator

LinkedIn Engineering

JUNE 26, 2023

However, we've found that this vertical self-service model doesn't work particularly well for data pipelines, which involve wiring together many different systems into end-to-end data flows. Data pipelines power foundational parts of LinkedIn's infrastructure, including replication between data centers.

Data Pipeline

Data Pipeline Kafka SQL MySQL

A Look At The Data Systems Behind The Gameplay For League Of Legends

Data Engineering Podcast

NOVEMBER 20, 2022

In this episode Ian Schweer shares his experiences at Riot Games supporting player-focused features such as machine learning models and recommeder systems that are deployed as part of the game binary. Atlan is the metadata hub for your data ecosystem. Step off the hamster wheel and opt for an automated data pipeline like Hevo.

Systems

Systems Metadata Data Pipeline MongoDB

Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

Data Engineering Podcast

NOVEMBER 10, 2021

Summary A significant source of friction and wasted effort in building and integrating data management systems is the fragmentation of metadata across various tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

Metadata

Metadata Data Warehouse Data Lake BI

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Data Engineering Podcast

OCTOBER 15, 2021

Summary The binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. No more scripts, just SQL.

Metadata

Metadata BI Data Warehouse Government

Next Stop – Building a Data Pipeline from Edge to Insight

Cloudera

FEBRUARY 8, 2021

Below is the entire set of steps in the data lifecycle, and each step in the lifecycle will be supported by a dedicated blog post(see Fig. 1): Data Collection – data ingestion and monitoring at the edge (whether the edge be industrial sensors or people in a vehicle showroom).

Data Pipeline

Data Pipeline Building Manufacturing Data Warehouse

Data Pipeline Observability: A Model For Data Engineers

Databand.ai

JUNE 28, 2023

Data Pipeline Observability: A Model For Data Engineers Eitan Chazbani June 29, 2023 Data pipeline observability is your ability to monitor and understand the state of a data pipeline at any time. We believe the world’s data pipelines need better data observability.

Data Pipeline

Data Pipeline Data Engineering Data Engineer Engineering

Building A System Of Record For Your Organization's Data Ecosystem At Metaphor

Data Engineering Podcast

DECEMBER 19, 2021

Summary Building a well managed data ecosystem for your organization requires a holistic view of all of the producers, consumers, and processors of information. The team at Metaphor are building a fully connected metadata layer to provide both technical and social intelligence about your data. No more scripts, just SQL.

Systems

Systems Building Metadata Data Warehouse

Understanding The Immune System With Data At ImmunAI

Data Engineering Podcast

FEBRUARY 20, 2022

In this episode Guy Yachdav, director of software engineering for ImmunAI, shares the complexities that are inherent to managing data workflows for bioinformatics. RudderStack’s smart customer data pipeline is warehouse-first. You can observe your pipelines with built in metadata search and column level lineage.

Systems

Systems Software Engineer Software Engineering Data Warehouse

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

As we look towards 2025, it’s clear that data teams must evolve to meet the demands of evolving technology and opportunities. In this blog post, we’ll explore key strategies that data teams should adopt to prepare for the year ahead. Tool sprawl is another hurdle that data teams must overcome.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems

Towards Data Science

FEBRUARY 6, 2024

ERP and CRM systems are designed and built to fulfil a broad range of business processes and functions. This generalisation makes their data models complex and cryptic and require domain expertise. As you do not want to start your development with uncertainty, you decide to go for the operational raw data directly.

Systems

Systems Raw Data Metadata Data Cleanse

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

The system leverages a combination of an event-based storage model in its TimeSeries Abstraction and continuous background aggregation to calculate counts across millions of counters efficiently. link] Grab: Metasense V2 - Enhancing, improving, and productionisation of LLM-powered data governance. Boyter on Bloom Filters and SQLite.

Data Engineering

Data Engineering Data Engineer Engineering Insurance

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

JUNE 16, 2024

What are the other systems that feed into and rely on the Trino/Iceberg service? what kinds of questions are you answering with table metadata what use case/team does that support comparative utility of iceberg REST catalog What are the shortcomings of Trino and Iceberg? Email hosts@dataengineeringpodcast.com with your story.

Data Lake

Data Lake High Quality Data Metadata Machine Learning

A Complete Guide to Scale Your Data Pipelines and Data Products with Contract Testing and Dbt

Towards Data Science

OCTOBER 25, 2023

As a data or analytics engineer, you knew where to find all the transformation logic and models because they were all in the same codebase. You probably work closely with the colleague who builds the data pipeline that you were consuming. There was only one data team, two at most. How did they do it?

Data Pipeline

Data Pipeline SQL Data Architecture Data

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Cloudera

MARCH 14, 2023

In this blog post we will put these capabilities in context and dive deeper into how the built-in, end-to-end data flow life cycle enables self-service data pipeline development. Key requirements for building data pipelines Every data pipeline starts with a business requirement.

Data Pipeline

Data Pipeline Designing Kafka Metadata

Stick All Of Your Systems And Data Together With SaaSGlue As Your Workflow Manager

Data Engineering Podcast

JULY 5, 2021

Summary At the core of every data pipeline is an workflow manager (or several). Deploying, managing, and scaling that orchestration can consume a large fraction of a data team’s energy so it is important to pick something that provides the power and flexibility that you need.

Systems

Systems Management Data Warehouse Programming Language

6 Ways To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

As we look towards 2025, it’s clear that data teams must evolve to meet the demands of evolving technology and opportunities. In this blog post, we’ll explore key strategies that data teams should adopt to prepare for the year ahead. Tool sprawl is another hurdle that data teams must overcome.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support…

Netflix Tech

SEPTEMBER 29, 2022

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support for Non-Parallelizable Workloads by Kostas Christidis Introduction Timestone is a high-throughput, low-latency priority queueing system we built in-house to support the needs of Cosmos , our media encoding platform. Over the past 2.5

Systems

Systems Metadata Media Kafka

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Netflix Tech

NOVEMBER 14, 2023

We’ll discuss batch data processing, the limitations we faced, and how Psyberg emerged as a solution. Furthermore, we’ll delve into the inner workings of Psyberg, its unique features, and how it integrates into our data pipelining workflows. What is late-arriving data? How does late-arriving data impact us?

Data Engineering

Data Engineering Data Engineer Engineering Metadata

Build Better Data Products By Creating Data, Not Consuming It

Data Engineering Podcast

NOVEMBER 6, 2022

In this episode Nick King discusses how you can be intentional about data creation in your applications and services to reduce the friction and errors involved in building data products and ML applications. Atlan is the metadata hub for your data ecosystem. What are some of the unique characteristics of that information?

Building

Building IT Metadata MongoDB

Data logs: The latest evolution in Meta’s access tools

Engineering at Meta

FEBRUARY 4, 2025

Were sharing how Meta built support for data logs, which provide people with additional data about how they use our products. Here we explore initial system designs we considered, an overview of the current architecture, and some important principles Meta takes into account in making data accessible and easy to understand.

Accessible

Accessible Accessibility Raw Data Data Warehouse

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Data Engineering Podcast

AUGUST 13, 2022

Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Data stacks are becoming more and more complex. Sifflet also offers a 2-week free trial.

Metadata

Metadata MongoDB MySQL Scala

Metadata: What Is It and Why it Matters

Ascend.io

JULY 11, 2024

Metadata is the information that provides context and meaning to data, ensuring it’s easily discoverable, organized, and actionable. It enhances data quality, governance, and automation, transforming raw data into valuable insights. Imagine a library with millions of books but no catalog system to organize them.

Metadata

Metadata IT Government High Quality Data

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Data Engineering Weekly

FEBRUARY 18, 2025

Kafka is designed to be a black box to collect all kinds of data, so Kafka doesn't have built-in schema and schema enforcement; this is the biggest problem when integrating with schematized systems like Lakehouse. It excels in event-driven architectures and data pipelines. Fluss is tailored for real-time analytics.

Kafka

Kafka Lambda Architecture SQL Architecture

Expanding The Reach of Business Intelligence Through Ubiquitous Embedded Analytics With Sisense

Data Engineering Podcast

OCTOBER 30, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc.,

Business Intelligence

Business Intelligence Metadata MongoDB MySQL

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Data Engineering Podcast

DECEMBER 18, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Struggling with broken pipelines? Missing data? Atlan is the metadata hub for your data ecosystem.

Metadata

Metadata Business Intelligence Data Lake BI

Data Engineering Weekly #215

Data Engineering Weekly

APRIL 6, 2025

The article summarizes the recent macro trends in AI and data engineering, focusing on Vibe coding, human-in-the-loop system design, and rapid simplification of developer tooling. As these assistants evolve, they signal a future where scalable, low-latency data pipelines become essential for seamless, intelligent user experiences.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

The author emphasizes the importance of mastering state management, understanding "local first" data processing (prioritizing single-node solutions before distributed systems), and leveraging an asset graph approach for data pipelines. and then to Nuage 3.0, The article highlights Nuage 3.0's

Data Engineering

Data Engineering Data Engineer Engineering Data

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Data Engineering Podcast

DECEMBER 29, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Struggling with broken pipelines? Missing data? Atlan is the metadata hub for your data ecosystem.

Management

Management Metadata Business Intelligence Data Lake

Supporting Diverse ML Systems at Netflix

Netflix Tech

MARCH 7, 2024

The Machine Learning Platform (MLP) team at Netflix provides an entire ecosystem of tools around Metaflow , an open source machine learning infrastructure framework we started, to empower data scientists and machine learning practitioners to build and manage a variety of ML systems. ETL workflows), as well as downstream (e.g.

Systems

Systems Media Machine Learning Data Warehouse

Snowflake Invests in Metaplane for Deep, End-to-End Observability in the Data Cloud

Snowflake

MAY 15, 2024

There’s a huge gap between the data quality most companies have by default and the data quality needed for successful AI. And that gap is directly affecting the performance and reliability of AI systems everywhere. Metaplane ensures that every company can trust the data that powers their business.

Cloud

Cloud Metadata Data Pipeline Government

Making The Total Cost Of Ownership For External Data Manageable With Crux

Data Engineering Podcast

JULY 17, 2022

Summary There are extensive and valuable data sets that are available outside the bounds of your organization. Whether that data is public, paid, or scraped it requires investment and upkeep to acquire and integrate it with your systems. Atlan is the metadata hub for your data ecosystem.

Data Management

Data Management Management Metadata MongoDB

How To Bring Agile Practices To Your Data Projects

Data Engineering Podcast

OCTOBER 23, 2022

Applying those same practices to data can prove challenging due to the number of systems that need to be included to implement a complete feature. Atlan is the metadata hub for your data ecosystem. Summary Agile methodologies have been adopted by a majority of teams for building software applications.

Project

Project Metadata MongoDB MySQL

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

Cloudera

DECEMBER 16, 2022

We are pleased to announce that Cloudera has been named a Leader in the 2022 Gartner ® Magic Quadrant for Cloud Database Management Systems. Our open, interoperable platform is deployed easily in all data ecosystems, and includes unique security and governance capabilities. 4-Ready for modern data fabric architectures.

Database

Database Cloud Systems Management

Data Engineering Weekly #196

Data Engineering Weekly

NOVEMBER 3, 2024

Foundation Capital: A System of Agents brings Service-as-Software to life software is no longer simply a tool for organizing work; software becomes the worker itself, capable of understanding, executing, and improving upon traditionally human-delivered services. 60+ speakers from LinkedIn, Shopify, Amazon, Lyft, Grammarly, Mistral, et al.

Data Engineering

Data Engineering Data Engineer Engineering Pipeline-centric

Combining The Simplicity Of Spreadsheets With The Power Of Modern Data Infrastructure At Canvas

Data Engineering Podcast

JUNE 19, 2022

Summary Data analysis is a valuable exercise that is often out of reach of non-technical users as a result of the complexity of data systems. Atlan is the metadata hub for your data ecosystem. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Metadata

Metadata Unstructured Data MongoDB MySQL

3. Psyberg: Automated end to end catch up

Netflix Tech

NOVEMBER 14, 2023

Now, let’s explore the state of our pipelines after incorporating Psyberg. Pipelines After Psyberg Let’s explore how different modes of Psyberg could help with a multistep data pipeline. The session metadata table can then be read to determine the pipeline input.

Metadata

Metadata Data Pipeline Scala Data Process

Data Engineering Weekly #209

Data Engineering Weekly

FEBRUARY 23, 2025

[link] Alireza Sadeghi: Open Source Data Engineering Landscape 2025 This article comprehensively overviews the 2025 open-source data engineering landscape, highlighting key trends, active projects, and emerging technologies. I wonder if these systems expand more capabilities that eventually fall on their own weight.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Our First Netflix Data Engineering Summit

Netflix Tech

DECEMBER 14, 2023

Engineers from across the company came together to share best practices on everything from Data Processing Patterns to Building Reliable Data Pipelines. The result was a series of talks which we are now sharing with the rest of the Data Engineering community!

Data Engineering

Data Engineering Data Engineer Engineering Metadata

Bring Geospatial Analytics Across Disparate Datasets Into Your Toolkit With The Unfolded Platform

Data Engineering Podcast

JUNE 26, 2022

In order to reduce the friction involved in aggregating disparate data sets that share geographic similarities the Unfolded team built a platform that supports working across raster, vector, and tabular data in a single system. Atlan is the metadata hub for your data ecosystem.

Datasets

Datasets Unstructured Data Metadata MongoDB

Business Intelligence In The Palm Of Your Hand With Zing Data

Data Engineering Podcast

DECEMBER 4, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day.

Business Intelligence

Business Intelligence Metadata BI MongoDB

Ready-to-go sample data pipelines with Dataflow

Level Up Your Data Platform With Active Metadata

Webinars

Trending Sources

Inside Facebook’s video delivery system

Webinars

Build your data pipelines like the Toyota Way

Declarative Data Pipelines with Hoptimator

A Look At The Data Systems Behind The Gameplay For League Of Legends

Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Next Stop – Building a Data Pipeline from Edge to Insight

Data Pipeline Observability: A Model For Data Engineers

Building A System Of Record For Your Organization's Data Ecosystem At Metaphor

Understanding The Immune System With Data At ImmunAI

How To Prepare Your Data Team for 2025

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems

Data Engineering Weekly #198

Being Data Driven At Stripe With Trino And Iceberg

A Complete Guide to Scale Your Data Pipelines and Data Products with Contract Testing and Dbt

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Stick All Of Your Systems And Data Together With SaaSGlue As Your Workflow Manager

6 Ways To Prepare Your Data Team for 2025

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support…

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Build Better Data Products By Creating Data, Not Consuming It

Data logs: The latest evolution in Meta’s access tools

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Metadata: What Is It and Why it Matters

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Expanding The Reach of Business Intelligence Through Ubiquitous Embedded Analytics With Sisense

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Data Engineering Weekly #215

Data Engineering Weekly #213

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Supporting Diverse ML Systems at Netflix

Snowflake Invests in Metaplane for Deep, End-to-End Observability in the Data Cloud

Making The Total Cost Of Ownership For External Data Manageable With Crux

How To Bring Agile Practices To Your Data Projects

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

Data Engineering Weekly #196

Combining The Simplicity Of Spreadsheets With The Power Of Modern Data Infrastructure At Canvas

3. Psyberg: Automated end to end catch up

Data Engineering Weekly #209

Our First Netflix Data Engineering Summit

Bring Geospatial Analytics Across Disparate Datasets Into Your Toolkit With The Unfolded Platform

Business Intelligence In The Palm Of Your Hand With Zing Data

Stay Connected