Data Warehouse, Events and Metadata - Data Engineering Digest

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

These stages propagate through various systems including function-based systems that load, process, and propagate data through stacks of function calls in different programming languages (e.g., For simplicity, we will demonstrate these for the web, the data warehouse, and AI, per the diagram below. Hack, C++, Python, etc.)

Data Warehouse

Data Warehouse SQL Programming Language Data

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Snowflake

NOVEMBER 2, 2023

Over the years, the technology landscape for data management has given rise to various architecture patterns, each thoughtfully designed to cater to specific use cases and requirements. These patterns include both centralized storage patterns like data warehouse , data lake and data lakehouse , and distributed patterns such as data mesh.

Data Lake

Data Lake Data Warehouse Cloud Unstructured Data

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

Data Engineering Podcast

NOVEMBER 10, 2021

Summary A significant source of friction and wasted effort in building and integrating data management systems is the fragmentation of metadata across various tools. Start trusting your data with Monte Carlo today! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads?

Metadata

Metadata Data Warehouse Data Lake BI

Keeping Your Data Warehouse In Order With DataForm

Data Engineering Podcast

OCTOBER 14, 2019

Summary Managing a data warehouse can be challenging, especially when trying to maintain a common set of patterns. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council.

Data Warehouse

Data Warehouse PostgreSQL AWS Programming Language

Optimizing data warehouse storage

Netflix Tech

DECEMBER 21, 2020

By Anupom Syam Background At Netflix, our current data warehouse contains hundreds of Petabytes of data stored in AWS S3 , and each day we ingest and create additional Petabytes. Some of the optimizations are prerequisites for a high-performance data warehouse.

Data Warehouse

Data Warehouse Metadata Algorithm Data

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Data Engineering Podcast

OCTOBER 15, 2021

Summary The binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. No more scripts, just SQL.

Metadata

Metadata BI Data Warehouse Government

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

Snowflake was founded in 2012 around its data warehouse product, which is still its core offering, and Databricks was founded in 2013 from academia with Spark co-creator researchers, becoming Apache Spark in 2014. It adds metadata, read, write and transactions that allow you to treat a Parquet file as a table.

Metadata

Metadata Data Warehouse BI MySQL

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

[link] Netflix: Netflix’s Distributed Counter Abstraction Netflix writes about scalable Distributed Counter abstractions for accurately counting events across its global services with millisecond latency. The service offers configurable counter types optimized for various use cases with a unified Control Plane configuration.

Data Engineering

Data Engineering Data Engineer Engineering Insurance

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

This ecosystem includes: Catalogs: Services that manage metadata about Iceberg tables (e.g., Compute Engines: Tools that query and process data stored in Iceberg tables (e.g., Maintenance Processes: Operations that optimize Iceberg tables, such as compacting small files and managing metadata. Trino, Spark, Snowflake, DuckDB).

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Functional Data Engineering — a modern paradigm for batch data processing

Maxime Beauchemin

JANUARY 7, 2018

When functions are “pure” — meaning they do not have side-effects — they can be written, tested, reasoned-about and debugged in isolation, without the need to understand external context or history of events surrounding its execution. But how do we model this in a functional data warehouse without mutating data?

Data Process

Data Process Data Engineering Data Engineer Process

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

Data modeling is changing Typical data modeling techniques — like the star schema — which defined our approach to data modeling for the analytics workloads typically associated with data warehouses, are less relevant than they once were.

Data Engineering

Data Engineering Data Engineer Engineering ETL Tools

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Cloudera

FEBRUARY 9, 2021

Today’s customers have a growing need for a faster end to end data ingestion to meet the expected speed of insights and overall business demand. This ‘need for speed’ drives a rethink on building a more modern data warehouse solution, one that balances speed with platform cost management, performance, and reliability.

Data Warehouse

Data Warehouse Cloud Kafka Cloud Storage

Change Data Capture (CDC): What it is and How it Works

Striim

MARCH 21, 2025

Since the value of data quickly drops over time, organizations need a way to analyze data as it is generated. To avoid disruptions to operational databases, companies typically replicate data to data warehouses for analysis.

IT

IT Data Lake Data Warehouse Relational Database

How Column-Aware Development Tooling Yields Better Data Models

Data Engineering Podcast

JUNE 17, 2023

Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. What are the most interesting, innovative, or unexpected ways that you have seen column-aware data modeling used?

Data Lake

Data Lake Machine Learning Metadata Data Architecture

Toward a Data Mesh (part 2) : Architecture & Technologies

François Nguyen

MARCH 22, 2021

TL;DR After setting up and organizing the teams, we are describing 4 topics to make data mesh a reality. With this 3rd platform generation, you have more real time data analytics and a cost reduction because it is easier to manage this infrastructure in the cloud thanks to managed services. What you have to code is this workflow !

Technology

Technology Architecture Google Cloud Metadata

Solving Data Lineage Tracking And Data Discovery At WeWork

Data Engineering Podcast

DECEMBER 16, 2019

The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools.

Metadata

Metadata PostgreSQL Datasets Data Warehouse

Is Modern Data Warehouse Architecture Broken?

Monte Carlo

APRIL 16, 2022

The data warehouse is the foundation of the modern data stack, so it caught our attention when we saw Convoy head of data Chad Sanderson declare, “ the data warehouse is broken ” on LinkedIn. Treating data like an API. Immutable data warehouses have challenges too.

Data Warehouse

Data Warehouse Architecture Data Data Architect

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera

Data Engineering Podcast

MARCH 27, 2022

Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Acryl]([link] The modern data stack needs a reimagined metadata management platform.

Data Governance

Data Governance Government Cloud Building

Unlocking The Power of Data Lineage In Your Platform with OpenLineage

Data Engineering Podcast

MAY 18, 2021

RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Interview Introduction How did you get involved in the area of data management?

Metadata

Metadata Kafka Data Warehouse Hadoop

A decade of scaling (real-time) analytics and master data at Picnic

Picnic Engineering

MARCH 28, 2025

Our investments in a lakeless data warehouse, modern analytics platform, and strong master data practices have made data a core strategic capability. The Turning Point: Year3 At Picnic we had a Data Warehouse from the start, from the very first order. The challenges were multi-dimensional (pun intended).

Data Warehouse

Data Warehouse PostgreSQL Python Machine Learning

From Big Data to Better Data: Ensuring Data Quality with Verity

Lyft Engineering

OCTOBER 3, 2023

In this post we will define data quality at a high-level and explore our motivation to achieve better data quality. We will then introduce our in-house product, Verity, and showcase how it serves as a central platform for ensuring data quality in our Hive Data Warehouse. What and Where is Data Quality?

Big Data

Big Data Metadata Data Warehouse Data

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

Like the staging environment, Fronting Kafka receives all the events without validation. A streaming consumer, often implemented in stream processing frameworks like Flink or Spark, consumes the events from the fronting Kafka and runs through data contract validation. Event Routers typically don’t alter the payload.

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Real World Change Data Capture At Datacoral

Data Engineering Podcast

MARCH 22, 2021

Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. RudderStack’s smart customer data pipeline is warehouse-first.

Data Warehouse

Data Warehouse Metadata Hadoop Data Lake

Stick All Of Your Systems And Data Together With SaaSGlue As Your Workflow Manager

Data Engineering Podcast

JULY 5, 2021

RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. How does SaasGlue manage metadata propagation throughout the execution graph?

Systems

Systems Management Data Warehouse Programming Language

Understanding The Immune System With Data At ImmunAI

Data Engineering Podcast

FEBRUARY 20, 2022

RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. RudderStack’s smart customer data pipeline is warehouse-first.

Systems

Systems Software Engineer Software Engineering Data Warehouse

Implementing Data Contracts in the Data Warehouse

Monte Carlo

JANUARY 25, 2023

In this article, Chad Sanderson , Head of Product, Data Platform , at Convoy and creator of Data Quality Camp , introduces a new application of data contracts: in your data warehouse. In the last couple of posts , I’ve focused on implementing data contracts in production services.

Data Warehouse

Data Warehouse Data High Quality Data Metadata

Demystifying event streams: Transforming events into tables with dbt

dbt Developer Hub

NOVEMBER 3, 2022

Let’s discuss how to convert events from an event-driven microservice architecture into relational tables in a warehouse like Snowflake. We use Snowflake as our data warehouse where we build dashboards both for internal use and for customers. So our solution was to start using an intentional contract: Events.

Kafka

Kafka ETL Tools BI Database

Insights from the Gartner Data & Analytics Summit in London: Embracing Data Leadership and Strategy

Precisely

JUNE 24, 2024

They are working through organizational design challenges while also establishing foundational data management capabilities like metadata management and data governance that will allow them to offer trusted data to the business in a timely and efficient manner for analytics and AI.”

Food

Food Data Analytics Pharmaceutical Consulting

Put Your Whole Data Team On The Same Page With Atlan

Data Engineering Podcast

APRIL 5, 2021

Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. RudderStack’s smart customer data pipeline is warehouse-first.

Data Warehouse

Data Warehouse Data Pipeline BI Metadata

The Grand Vision And Present Reality of DataOps

Data Engineering Podcast

MAY 3, 2021

They explain how to think about your data systems in a holistic and maintainable fashion, the security challenges that threaten to derail your efforts, and the power of using metadata as the foundation of everything that you do. RudderStack’s smart customer data pipeline is warehouse-first.

Data Warehouse

Data Warehouse Data Pipeline BI Metadata

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

Data Engineering Podcast

NOVEMBER 20, 2022

Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows.

Data Lake

Data Lake Data Ingestion MongoDB MySQL

Charting the Path of Riskified's Data Platform Journey

Data Engineering Podcast

JULY 10, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Metadata

Metadata MongoDB MySQL Machine Learning

Make Sure Your Records Are Reliable With The BookKeeper Distributed Storage Layer

Data Engineering Podcast

JUNE 8, 2021

RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. RudderStack’s smart customer data pipeline is warehouse-first.

Data Warehouse

Data Warehouse Hadoop Metadata Python

Accenture’s Smart Data Transition Toolkit Now Available for Cloudera Data Platform

Cloudera

AUGUST 31, 2021

Cloudera and Accenture demonstrate strength in their relationship with an accelerator called the Smart Data Transition Toolkit for migration of legacy data warehouses into Cloudera Data Platform. Accenture’s Smart Data Transition Toolkit . Are you looking for your data warehouse to support the hybrid multi-cloud?

Data Warehouse

Data Warehouse Database-centric Metadata Cloud

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

AUGUST 11, 2021

“Data Lake vs Data Warehouse = Load First, Think Later vs Think First, Load Later” The terms data lake and data warehouse are frequently stumbled upon when it comes to storing large volumes of data. Data Warehouse Architecture What is a Data lake?

Data Lake

Data Lake Data Warehouse Cloud Hadoop

Next Stop – Building a Data Pipeline from Edge to Insight

Cloudera

FEBRUARY 8, 2021

The first blog introduced a mock connected vehicle manufacturing company, The Electric Car Company (ECC), to illustrate the manufacturing data path through the data lifecycle. Having completed the Data Collection step in the previous blog, ECC’s next step in the data lifecycle is Data Enrichment.

Data Pipeline

Data Pipeline Building Manufacturing Data Warehouse

Building a Data Platform in 2024

Towards Data Science

FEBRUARY 9, 2024

In truth, the synergy between batch and streaming pipelines is essential for tackling the diverse challenges posed to your data platform at scale. The key to seamlessly addressing these challenges lies, unsurprisingly, in data orchestration. Their robust core offering seamlessly integrates data warehouses with data-hungry applications.

Building

Building Transportation Data Lake Metadata

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

While cloud-native, point-solution data warehouse services may serve your immediate business needs, there are dangers to the corporation as a whole when you do your own IT this way. Cloudera Data Warehouse (CDW) is here to save the day! CDW is an integrated data warehouse service within Cloudera Data Platform (CDP).

IT

IT Data Lake Data Warehouse Cloud Storage

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

Data Engineering Podcast

APRIL 24, 2022

Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. How do you structure the log events and metadata to provide detail and context for data applications?

Machine Learning

Machine Learning Systems Data Lake Java

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

FEBRUARY 8, 2023

It offers users a data integration tool that organizes data from many sources, formats it, and stores it in a single repository, such as data lakes, data warehouses, etc., Glue uses ETL jobs for extracting data from various AWS cloud services and integrating it into data warehouses and lakes.

AWS

AWS Scala Metadata Data Lake

Data Quality Score: The next chapter of data quality at Airbnb

Airbnb Tech

NOVEMBER 28, 2023

However, for all of our uncertified data, which remained the majority of our offline data, we lacked visibility into its quality and didn’t have clear mechanisms for up-leveling it. How could we scale the hard-fought wins and best practices of Midas across our entire data warehouse?

Data Warehouse

Data Warehouse Metadata Data Certification

Unleashing the Power of CDC With Snowflake

Workfall

JUNE 12, 2023

Moreover, it facilitates the implementation of microservices architectures and event-driven systems, automating reactions to data changes without manual intervention. It captures incremental changes from transactional databases or other sources, efficiently loading them into data warehouses or data lakes.

Telecommunication

Telecommunication Metadata Healthcare Finance

Implementing the Netflix Media Database

Netflix Tech

DECEMBER 14, 2018

A fundamental requirement for any lasting data system is that it should scale along with the growth of the business applications it wishes to serve. NMDB is built to be a highly scalable, multi-tenant, media metadata system that can serve a high volume of write/read throughput as well as support near real-time queries.

Media

Media Database Metadata Data Schemas

How Apache Iceberg Is Changing the Face of Data Lakes

How Meta discovers data flows via lineage at scale

Webinars

Trending Sources

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Webinars

Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

Keeping Your Data Warehouse In Order With DataForm

Optimizing data warehouse storage

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Databricks, Snowflake and the future

Data Engineering Weekly #198

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Functional Data Engineering — a modern paradigm for batch data processing

The Rise of the Data Engineer

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Change Data Capture (CDC): What it is and How it Works

How Column-Aware Development Tooling Yields Better Data Models

Toward a Data Mesh (part 2) : Architecture & Technologies

Solving Data Lineage Tracking And Data Discovery At WeWork

Is Modern Data Warehouse Architecture Broken?

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera

Unlocking The Power of Data Lineage In Your Platform with OpenLineage

A decade of scaling (real-time) analytics and master data at Picnic

From Big Data to Better Data: Ensuring Data Quality with Verity

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Real World Change Data Capture At Datacoral

Stick All Of Your Systems And Data Together With SaaSGlue As Your Workflow Manager

Understanding The Immune System With Data At ImmunAI

Implementing Data Contracts in the Data Warehouse

Demystifying event streams: Transforming events into tables with dbt

Insights from the Gartner Data & Analytics Summit in London: Embracing Data Leadership and Strategy

Put Your Whole Data Team On The Same Page With Atlan

The Grand Vision And Present Reality of DataOps

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

Charting the Path of Riskified's Data Platform Journey

Make Sure Your Records Are Reliable With The BookKeeper Distributed Storage Layer

Accenture’s Smart Data Transition Toolkit Now Available for Cloudera Data Platform

Data Lake vs Data Warehouse - Working Together in the Cloud

Next Stop – Building a Data Pipeline from Edge to Insight

Building a Data Platform in 2024

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

Data Quality Score: The next chapter of data quality at Airbnb

Unleashing the Power of CDC With Snowflake

Implementing the Netflix Media Database

Stay Connected