Data Pipeline, Events and Metadata - Data Engineering Digest

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance.

Metadata

Metadata MongoDB MySQL Scala

Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

Data Engineering Podcast

NOVEMBER 10, 2021

Summary A significant source of friction and wasted effort in building and integrating data management systems is the fragmentation of metadata across various tools. Start trusting your data with Monte Carlo today! What are the capabilities that a centralized and holistic view of a platform’s metadata can enable?

Metadata

Metadata Data Warehouse Data Lake BI

Next Stop – Building a Data Pipeline from Edge to Insight

Cloudera

FEBRUARY 8, 2021

The first blog introduced a mock connected vehicle manufacturing company, The Electric Car Company (ECC), to illustrate the manufacturing data path through the data lifecycle. Having completed the Data Collection step in the previous blog, ECC’s next step in the data lifecycle is Data Enrichment.

Data Pipeline

Data Pipeline Building Manufacturing Data Warehouse

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Data Engineering Podcast

OCTOBER 15, 2021

Summary The binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. How is the governance of DataHub being managed?

Metadata

Metadata BI Data Warehouse Government

Data Pipeline Observability: A Model For Data Engineers

Databand.ai

JUNE 28, 2023

Data Pipeline Observability: A Model For Data Engineers Eitan Chazbani June 29, 2023 Data pipeline observability is your ability to monitor and understand the state of a data pipeline at any time. We believe the world’s data pipelines need better data observability.

Data Pipeline

Data Pipeline Data Engineer Data Engineering Engineering

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

link] Netflix: Netflix’s Distributed Counter Abstraction Netflix writes about scalable Distributed Counter abstractions for accurately counting events across its global services with millisecond latency. Due to the platform's diverse user base and workloads, Canva faced challenges maintaining visibility into Snowflake usage and costs.

Data Engineer

Data Engineer Data Engineering Engineering Insurance

Data Pipeline with Airflow and AWS Tools (S3, Lambda & Glue)

Towards Data Science

APRIL 6, 2023

Today’s post follows the same philosophy: fitting local and cloud pieces together to build a data pipeline. And, when it comes to data engineering solutions, it’s no different: They have databases, ETL tools, streaming platforms, and so on — a set of tools that makes our life easier (as long as you pay for them). not sponsored.

Data Pipeline

Data Pipeline AWS Amazon Web Services Python

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Netflix Tech

NOVEMBER 14, 2023

We’ll discuss batch data processing, the limitations we faced, and how Psyberg emerged as a solution. Furthermore, we’ll delve into the inner workings of Psyberg, its unique features, and how it integrates into our data pipelining workflows. It also becomes inefficient as the data scale increases.

Data Engineer

Data Engineer Data Engineering Engineering Metadata

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Cloudera

MARCH 14, 2023

In this blog post we will put these capabilities in context and dive deeper into how the built-in, end-to-end data flow life cycle enables self-service data pipeline development. Key requirements for building data pipelines Every data pipeline starts with a business requirement.

Data Pipeline

Data Pipeline Designing Kafka Metadata

Data Engineering Weekly #196

Data Engineering Weekly

NOVEMBER 3, 2024

The challenges around memory, data size, and runtime are exciting to read. Sampling is an obvious strategy for data size, but the layered approach and dynamic inclusion of dependencies are some key techniques I learned with the case study. Passes include app-brain-date networking, birds of a feature, post-event parties, etc.

Data Engineer

Data Engineer Data Engineering Engineering Pipeline-centric

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Data Engineering Weekly

FEBRUARY 18, 2025

Kafka is designed for streaming events, but Fluss is designed for streaming analytics. Architecture Difference The first difference is the Data Model. It excels in event-driven architectures and data pipelines. It maintains metadata, manages tablet allocation, lists nodes, and handles permissions.

Kafka

Kafka Lambda Architecture SQL Architecture

Build Better Data Products By Creating Data, Not Consuming It

Data Engineering Podcast

NOVEMBER 6, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc.,

Building

Building IT Metadata MongoDB

The last (but not least)”ops” you need for your data : DataGovops

François Nguyen

JANUARY 18, 2021

In every step,we do not just read, transform and write data, we are also doing that with the metadata. Last part, it was added the data security and privacy part. Every data governance policy about this topic must be read by a code to act in your data platform (access management, masking, etc.)

Data Governance

Data Governance Metadata Government Data Pipeline

Metadata: What Is It and Why it Matters

Ascend.io

JULY 11, 2024

Metadata is the information that provides context and meaning to data, ensuring it’s easily discoverable, organized, and actionable. It enhances data quality, governance, and automation, transforming raw data into valuable insights. This is what managing data without metadata feels like. Chaos, right?

Metadata

Metadata IT Government High Quality Data

Toward a Data Mesh (part 2) : Architecture & Technologies

François Nguyen

MARCH 22, 2021

TL;DR After setting up and organizing the teams, we are describing 4 topics to make data mesh a reality. We want interoperability for any data stored versus we have to think how to store the data in a specific node to optimize the processing. ” He/She is managing triggers, he/she needs to check conditions (event type ?

Technology

Technology Architecture Google Cloud Metadata

3. Psyberg: Automated end to end catch up

Netflix Tech

NOVEMBER 14, 2023

Now, let’s explore the state of our pipelines after incorporating Psyberg. Pipelines After Psyberg Let’s explore how different modes of Psyberg could help with a multistep data pipeline. The session metadata table can then be read to determine the pipeline input.

Metadata

Metadata Data Pipeline Scala Data Process

How Column-Aware Development Tooling Yields Better Data Models

Data Engineering Podcast

JUNE 17, 2023

Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. What are the most interesting, innovative, or unexpected ways that you have seen column-aware data modeling used?

Data Lake

Data Lake Machine Learning Metadata Data Architecture

Data Engineering Weekly #209

Data Engineering Weekly

FEBRUARY 23, 2025

Try Astro Free → Editor’s Note: Data Council 2025, Apr 22-24, Oakland, CA Data Council has always been one of my favorite events to connect with and learn from the data engineering community. Data Council 2025 is set for April 22-24 in Oakland, CA. Let me know in the comments.

Data Engineer

Data Engineer Data Engineering Engineering Kafka

New Snowflake Features Released in September–November 2023

Snowflake

DECEMBER 12, 2023

At our recent Snowday event, we announced a wave of Snowflake product innovations for easier application development, new AI and LLM capabilities, better cost management and more. If you missed the event or need a refresh of what was presented, watch any Snowday session on demand. Learn more about Iceberg Tables here. Learn more here.

Metadata

Metadata Python AWS Government

Addressing The Challenges Of Component Integration In Data Platform Architectures

Data Engineering Podcast

NOVEMBER 26, 2023

Developing event-driven pipelines is going to be a lot easier - Meet Functions! Memphis Logo]([link] Developing event-driven pipelines is going to be a lot easier - Meet Functions! Developing event-driven pipelines is going to be a lot easier - Meet Functions!

Architecture

Architecture Data Lake High Quality Data SQL

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

I won’t bore you with the importance of data quality in the blog. Instead, Let’s examine the current data pipeline architecture and ask why data quality is expensive. Instead of looking at the implementation of the data quality frameworks, Let's examine the architectural patterns of the data pipeline.

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Data Engineering Weekly #215

Data Engineering Weekly

APRIL 6, 2025

You won't want to miss this live event on April 23rd! This article highlights their growing complexity, from multimodal interaction to enterprise adoption, underscoring the data and infrastructure challenges beneath the surface. Introducing Apache Airflow® 3.0 Be among the first to see Airflow 3.0

Data Engineer

Data Engineer Data Engineering Engineering Kafka

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

Application Logic: Application logic refers to the type of data processing, and can be anything from analytical or operational systems to data pipelines that ingest data inputs, apply transformations based on some business logic and produce data outputs.

Architecture

Architecture Metadata Kafka Government

Unlocking The Power of Data Lineage In Your Platform with OpenLineage

Data Engineering Podcast

MAY 18, 2021

Summary Data lineage is the common thread that ties together all of your data pipelines, workflows, and systems. In order to get a holistic understanding of your data quality, where errors are occurring, or how a report was constructed you need to track the lineage of the data from beginning to end.

Metadata

Metadata Kafka Data Warehouse Hadoop

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

Editor’s Note: Data Council 2025, Apr 22-24, Oakland, CA Data Council has always been one of my favorite events to connect with and learn from the data engineering community. Data Council 2025 is set for April 22-24 in Oakland, CA.

Data Engineer

Data Engineer Data Engineering Engineering Data

Solving Data Lineage Tracking And Data Discovery At WeWork

Data Engineering Podcast

DECEMBER 16, 2019

Summary Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform.

Metadata

Metadata PostgreSQL Datasets Data Warehouse

Charting the Path of Riskified's Data Platform Journey

Data Engineering Podcast

JULY 10, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Metadata

Metadata MongoDB MySQL Machine Learning

Gain Visibility And Insight Into Your Supply Chains Through Operational Analytics Powered By Roambee

Data Engineering Podcast

OCTOBER 2, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc.,

Metadata

Metadata Electronics MongoDB MySQL

Understanding The Role Of The Chief Data Officer

Data Engineering Podcast

AUGUST 21, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. RudderStack helps you build a customer data platform on your warehouse or data lake.

Metadata

Metadata MongoDB MySQL Data Lake

Demystifying event streams: Transforming events into tables with dbt

dbt Developer Hub

NOVEMBER 3, 2022

Let’s discuss how to convert events from an event-driven microservice architecture into relational tables in a warehouse like Snowflake. So our solution was to start using an intentional contract: Events. What are Events? Events are facts about what happened within your service.

Kafka

Kafka ETL Tools BI Database

An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality

Data Engineering Podcast

AUGUST 28, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. RudderStack helps you build a customer data platform on your warehouse or data lake.

Data Engineer

Data Engineer Data Engineering MongoDB Metadata

Build Maintainable And Testable Data Applications With Dagster

Data Engineering Podcast

OCTOBER 28, 2019

This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! For someone who wants to get started with Dagster can you describe a typical workflow for writing a data pipeline?

Building

Building Data Pipeline Programming Language Kafka

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Netflix Tech

NOVEMBER 14, 2023

Stateless Data Processing : As the name suggests, one should use this pattern in scenarios where the columns in the target table solely depend on the content of the incoming events, irrespective of their order of occurrence. A missed event in such a scenario would result in incorrect analysis due to a wrong derived state.

Data Process

Data Process Process Metadata Finance

Real World Change Data Capture At Datacoral

Data Engineering Podcast

MARCH 22, 2021

Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. RudderStack’s smart customer data pipeline is warehouse-first.

Data Warehouse

Data Warehouse Metadata Hadoop Data Lake

Stick All Of Your Systems And Data Together With SaaSGlue As Your Workflow Manager

Data Engineering Podcast

JULY 5, 2021

Summary At the core of every data pipeline is an workflow manager (or several). Deploying, managing, and scaling that orchestration can consume a large fraction of a data team’s energy so it is important to pick something that provides the power and flexibility that you need.

Systems

Systems Management Data Warehouse Programming Language

Understanding The Immune System With Data At ImmunAI

Data Engineering Podcast

FEBRUARY 20, 2022

RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. You can observe your pipelines with built in metadata search and column level lineage.

Systems

Systems Software Engineer Software Engineering Data Warehouse

Schemas, Contracts, and Compatibility

Confluent

MAY 21, 2019

This leads us to event streaming microservices patterns. Now that the profile change event is published, it can be received by the quote service. Now that the profile change event is published, it can be received by the quote service. In fact, schemas are more than just a contract between two event streaming microservices.

Kafka

Kafka Insurance Architecture Database

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

Data Engineering Podcast

MAY 22, 2022

Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.

Machine Learning

Machine Learning Data Engineer Data Engineering Cloud

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

Data Engineering Podcast

NOVEMBER 20, 2022

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. What is the workflow for someone getting Sifflet integrated into their data stack?

Data Lake

Data Lake Data Ingestion MongoDB MySQL

What Is Data Pipeline Orchestration and Why You Need It

Ascend.io

NOVEMBER 28, 2023

The terms ‘data orchestration’ and ‘data pipeline orchestration’ are often used interchangeably, yet they diverge significantly in function and scope. Data orchestration refers to a wide collection of methods and tools that coordinate any and all types of data-related computing tasks.

Data Pipeline

Data Pipeline IT Data Data Engineer

A Reflection On Data Observability As It Reaches Broader Adoption

Data Engineering Podcast

SEPTEMBER 4, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. RudderStack helps you build a customer data platform on your warehouse or data lake.

IT

IT Metadata MongoDB MySQL

Building And Managing Data Teams And Data Platforms In Large Organizations With Ashish Mrig

Data Engineering Podcast

JANUARY 23, 2022

Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. You can observe your pipelines with built in metadata search and column level lineage.

Management

Management Building Data Pipeline Metadata

Simplifying Data Integration Through Eventual Connectivity

Data Engineering Podcast

JULY 28, 2019

Rather than manually defining all of the mappings ahead of time, we can rely on the power of graph databases and some strategic metadata to allow connections to occur as the data becomes available. If you are struggling to maintain a tangle of data pipelines then you might find some new ideas for reducing your workload.

Data Integration

Data Integration Metadata Architecture Media

Turning Streams Into Data Products

Cloudera

JUNE 16, 2022

CSP was recently recognized as a leader in the 2022 GigaOm Radar for Streaming Data Platforms report. The DevOps/app dev team wants to know how data flows between such entities and understand the key performance metrics (KPMs) of these entities.

Kafka

Kafka Manufacturing Data Lake SQL

Level Up Your Data Platform With Active Metadata

Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

Webinars

Trending Sources

Next Stop – Building a Data Pipeline from Edge to Insight

Webinars

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Data Pipeline Observability: A Model For Data Engineers

Data Engineering Weekly #198

Data Pipeline with Airflow and AWS Tools (S3, Lambda & Glue)

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Data Engineering Weekly #196

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Build Better Data Products By Creating Data, Not Consuming It

The last (but not least)”ops” you need for your data : DataGovops

Metadata: What Is It and Why it Matters

Toward a Data Mesh (part 2) : Architecture & Technologies

3. Psyberg: Automated end to end catch up

How Column-Aware Development Tooling Yields Better Data Models

Data Engineering Weekly #209

New Snowflake Features Released in September–November 2023

Addressing The Challenges Of Component Integration In Data Platform Architectures

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly #215

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Unlocking The Power of Data Lineage In Your Platform with OpenLineage

Data Engineering Weekly #213

Solving Data Lineage Tracking And Data Discovery At WeWork

Charting the Path of Riskified's Data Platform Journey

Gain Visibility And Insight Into Your Supply Chains Through Operational Analytics Powered By Roambee

Understanding The Role Of The Chief Data Officer

Demystifying event streams: Transforming events into tables with dbt

An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality

Build Maintainable And Testable Data Applications With Dagster

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Real World Change Data Capture At Datacoral

Stick All Of Your Systems And Data Together With SaaSGlue As Your Workflow Manager

Understanding The Immune System With Data At ImmunAI

Schemas, Contracts, and Compatibility

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

What Is Data Pipeline Orchestration and Why You Need It

A Reflection On Data Observability As It Reaches Broader Adoption

Building And Managing Data Teams And Data Platforms In Large Organizations With Ashish Mrig

Simplifying Data Integration Through Eventual Connectivity

Turning Streams Into Data Products

Stay Connected