Data Lake and Metadata - Data Engineering Digest

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance.

Metadata

Metadata MongoDB MySQL Scala

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Snowflake

NOVEMBER 2, 2023

Over the years, the technology landscape for data management has given rise to various architecture patterns, each thoughtfully designed to cater to specific use cases and requirements. These patterns include both centralized storage patterns like data warehouse , data lake and data lakehouse , and distributed patterns such as data mesh.

Data Lake

Data Lake Data Warehouse Cloud Unstructured Data

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

Data Engineering Podcast

NOVEMBER 10, 2021

Summary A significant source of friction and wasted effort in building and integrating data management systems is the fragmentation of metadata across various tools. Start trusting your data with Monte Carlo today! What are the capabilities that a centralized and holistic view of a platform’s metadata can enable?

Metadata

Metadata Data Warehouse Data Lake BI

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

JUNE 16, 2024

In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform. What are the other systems that feed into and rely on the Trino/Iceberg service?

Data Lake

Data Lake High Quality Data Metadata Machine Learning

Presto Powered Cloud Data Lakes At Speed Made Easy With Ahana

Data Engineering Podcast

SEPTEMBER 1, 2021

Summary The Presto project has become the de facto option for building scalable open source analytics in SQL for the data lake. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. lets you identify data quality issues and their root causes from a single dashboard.

Data Lake

Data Lake Cloud AWS SQL

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Cloudera

OCTOBER 23, 2024

In August, we wrote about how in a future where distributed data architectures are inevitable, unifying and managing operational and business metadata is critical to successfully maximizing the value of data, analytics, and AI.

Metadata

Metadata BI Data Lake Business Intelligence

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

First, we create an Iceberg table in Snowflake and then insert some data. Then, we add another column called HASHKEY , add more data, and locate the S3 file containing metadata for the iceberg table. In the screenshot below, we can see that the metadata file for the Iceberg table retains the snapshot history.

Architecture

Architecture Systems Data Lake Google Cloud

Open, Interoperable Storage with Iceberg Tables, Now Generally Available

Snowflake

JUNE 21, 2024

Snowflake is now making it even easier for customers to bring the platform’s usability, performance, governance and many workloads to more data with Iceberg tables (now generally available), unlocking full storage interoperability. Iceberg tables provide compute engine interoperability over a single copy of data.

Data Lake

Data Lake BI Business Intelligence Metadata

Simplifying Data Architecture and Security to Accelerate Value

Snowflake

NOVEMBER 11, 2024

Data stewards can also set up Request for Access (private preview) by setting a new visibility property on objects along with contact details so the right person can easily be reached to grant access. Support for auto-refresh and Iceberg metadata generation is coming soon to Delta Lake Direct.

Data Architecture

Data Architecture Architecture Data Lake Kafka

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Snowflake

DECEMBER 4, 2024

Beyond working with well-structured data in a data warehouse, modern AI systems can use deep learning and natural language processing to work effectively with unstructured and semi-structured data in data lakes and lakehouses.

Unstructured Data

Unstructured Data Data Lake Deep Learning Structured Data

The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

Data Engineering Podcast

FEBRUARY 19, 2023

Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Can you describe what Iceberg is and its position in the data lake/lakehouse ecosystem? Acryl]([link] The modern data stack needs a reimagined metadata management platform.

IT

IT Data Lake Metadata Data Warehouse

Reflecting On The Past 6 Years Of Data Engineering

Data Engineering Podcast

FEBRUARY 5, 2023

. -> Airflow and Luigi -> Dagster, Prefect, Lyft, etc.

Data Engineering

Data Engineering Data Engineer Engineering PostgreSQL

Toward a Data Mesh (part 2) : Architecture & Technologies

François Nguyen

MARCH 22, 2021

TL;DR After setting up and organizing the teams, we are describing 4 topics to make data mesh a reality. With this 3rd platform generation, you have more real time data analytics and a cost reduction because it is easier to manage this infrastructure in the cloud thanks to managed services. What you have to code is this workflow !

Technology

Technology Architecture Google Cloud Metadata

Addressing The Challenges Of Component Integration In Data Platform Architectures

Data Engineering Podcast

NOVEMBER 26, 2023

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Architecture

Architecture Data Lake High Quality Data SQL

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

Data Engineering Podcast

MAY 22, 2022

Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Stop struggling to speed up your data lake.

Machine Learning

Machine Learning Data Engineering Data Engineer Cloud

Snowflake and S3 Data Lake

Cloudyard

DECEMBER 13, 2022

Read Time: 4 Minute, 23 Second During this post we will discuss how AWS S3 service and Snowflake integration can be used as Data Lake in current organizations. How customer has migrated On Premises EDW to Snowflake to leverage snowflake Data Lake capabilities.

Data Lake

Data Lake AWS Metadata Data

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Data Engineering Podcast

DECEMBER 18, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Missing data? Atlan is the metadata hub for your data ecosystem. Missing data? Stale dashboards?

Metadata

Metadata Business Intelligence Data Lake BI

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Data Engineering Podcast

DECEMBER 29, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Missing data? Atlan is the metadata hub for your data ecosystem. Missing data? Stale dashboards?

Management

Management Metadata Business Intelligence Data Lake

Data Lake vs. Data Warehouse vs. Data Lakehouse

Sync Computing

NOVEMBER 7, 2024

While data warehouses are still in use, they are limited in use-cases as they only support structured data. Data lakes add support for semi-structured and unstructured data, and data lakehouses add further flexibility with better governance in a true hybrid solution built from the ground-up.

Data Lake

Data Lake Data Warehouse Business Intelligence Unstructured Data

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Data Engineering Weekly

FEBRUARY 18, 2025

Fluss uses Lakehouse as a tiered storage, and data will be converted and tiered into data lakes periodically; Fluss only retains a small portion of recent data. So you only need to store one copy of data for your streaming and Lakehouse. The TabletServer stores data and provides I/O services directly to users.

Kafka

Kafka Lambda Architecture SQL Architecture

Top Data Lake Vendors (Quick Reference Guide)

Monte Carlo

APRIL 24, 2023

Data lakes are useful, flexible data storage repositories that enable many types of data to be stored in its rawest state. Traditionally, after being stored in a data lake, raw data was then often moved to various destinations like a data warehouse for further processing, analysis, and consumption.

Data Lake

Data Lake Google Cloud Data Warehouse AWS

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

phData: Data Engineering

SEPTEMBER 19, 2023

With the amount of data companies are using growing to unprecedented levels, organizations are grappling with the challenge of efficiently managing and deriving insights from these vast volumes of structured and unstructured data. What is a Data Lake? Consistency of data throughout the data lake.

Data Lake

Data Lake Process Metadata Data Warehouse

How Column-Aware Development Tooling Yields Better Data Models

Data Engineering Podcast

JUNE 17, 2023

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. What are the most interesting, innovative, or unexpected ways that you have seen column-aware data modeling used?

Data Lake

Data Lake Machine Learning Metadata Data Architecture

Data Lakes vs. Data Warehouses

Grouparoo

JANUARY 11, 2022

This article looks at the options available for storing and processing big data, which is too large for conventional databases to handle. There are two main options available, a data lake and a data warehouse. What is a Data Warehouse? What is a Data Lake?

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Snowflake

DECEMBER 4, 2023

With this public preview, those external catalog options are either “GLUE”, where Snowflake can retrieve table metadata snapshots from AWS Glue Data Catalog, or “OBJECT_STORE”, where Snowflake retrieves metadata snapshots directly from the specified cloud storage location. With these three options, which one should you use?

Building

Building Metadata Cloud Storage AWS

Data Engineering Weekly #209

Data Engineering Weekly

FEBRUARY 23, 2025

[link] Alireza Sadeghi: Open Source Data Engineering Landscape 2025 This article comprehensively overviews the 2025 open-source data engineering landscape, highlighting key trends, active projects, and emerging technologies.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Change Data Capture (CDC): What it is and How it Works

Striim

MARCH 21, 2025

Change Data Capture (CDC) has emerged as an ideal solution for near real-time movement of data from relational databases (like SQL Server or Oracle) to data warehouses, data lakes or other databases. Data can be extracted using database queries (batch-based) or Change Data Capture (near-real-time).

IT

IT Data Lake Data Warehouse Relational Database

5 Reasons Data Discovery Platforms Are Best For Data Lakes

Monte Carlo

APRIL 1, 2021

Over the past few years, data lakes have emerged as a must-have for the modern data stack. But while the technologies powering our access and analysis of data have matured, the mechanics behind understanding this data in a distributed environment have lagged behind. Data discovery tools and platforms can help.

Data Lake

Data Lake Data Warehouse Unstructured Data Government

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Monte Carlo

AUGUST 25, 2023

That’s why it’s essential for teams to choose the right architecture for the storage layer of their data stack. But, the options for data storage are evolving quickly. Different vendors offering data warehouses, data lakes, and now data lakehouses all offer their own distinct advantages and disadvantages for data teams to consider.

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

AltexSoft

AUGUST 29, 2023

In 2010, a transformative concept took root in the realm of data storage and analytics — a data lake. The term was coined by James Dixon , Back-End Java, Data, and Business Intelligence Engineer, and it started a new era in how organizations could store, manage, and analyze their data. What is a data lake?

Data Lake

Data Lake Architecture IT Amazon Web Services

Are Apache Iceberg Tables Right For Your Data Lake? 6 Reasons Why.

Monte Carlo

NOVEMBER 14, 2023

Databricks announced that Delta tables metadata will also be compatible with the Iceberg format, and Snowflake has also been moving aggressively to integrate with Iceberg. How Apache Iceberg tables structure metadata. Is your data lake a good fit for Iceberg? I think it’s safe to say it’s getting pretty cold in here.

Data Lake

Data Lake Metadata Data Warehouse SQL

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera

Data Engineering Podcast

MARCH 27, 2022

Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake.

Data Governance

Data Governance Government Cloud Building

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

Data Engineering Podcast

NOVEMBER 20, 2022

Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. RudderStack helps you build a customer data platform on your warehouse or data lake. What is the workflow for someone getting Sifflet integrated into their data stack?

Data Lake

Data Lake Data Ingestion MongoDB Google Cloud

2024 Governance Trends for Data Leaders

phData: Data Engineering

NOVEMBER 1, 2024

Chief Technology Officer, Information Technology Industry Organizations have spent the past decade accumulating, maintaining, and securing data lakes/warehouses/fabrics that will now be expected to drive AI/LLM use cases. The technology for metadata management, data quality management, etc., No problem!

Government

Government Data Governance Finance Metadata

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

AUGUST 11, 2021

“Data Lake vs Data Warehouse = Load First, Think Later vs Think First, Load Later” The terms data lake and data warehouse are frequently stumbled upon when it comes to storing large volumes of data. Data Warehouse Architecture What is a Data lake?

Data Lake

Data Lake Data Warehouse Cloud Hadoop

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

Data Engineering Podcast

OCTOBER 14, 2018

The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. One of the complicated problems in data modeling is managing table partitions. What are the unique challenges posed by using S3 as the basis for a data lake?

Data Lake

Data Lake Big Data Cloud Hadoop

Iceberg Is An Implementation Detail

dbt Developer Hub

OCTOBER 3, 2024

These formats are changing the way data is stored and metadata accessed. Apache Iceberg is a high-performance open table format developed for modern data lakes. Iceberg Data Catalog - an open-source metadata management system that tracks the schema, partition, and versions of Iceberg tables.

Metadata

Metadata Data Lake Data Storage Accessible

Data Engineering Weekly #179

Data Engineering Weekly

JULY 7, 2024

Learn More → Notion: Building and scaling Notion’s data lake Notion writes about scaling the data lake by bringing critical data ingestion operations in-house. Hudi seems to be a de facto choice for CDC data lake features. Notion migrated the insert heavy workload from Snowflake to Hudi.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

Apache Ozone and Dense Data Nodes

Cloudera

APRIL 22, 2021

Apache Ozone is one of the major innovations introduced in CDP, which provides the next generation storage architecture for Big Data applications, where data blocks are organized in storage containers for larger scale and to handle small objects. Collects and aggregates metadata from components and present cluster state.

Pipeline-centric

Pipeline-centric Data Lake Hadoop Metadata

Building Spark Lineage For Data Lakes

Monte Carlo

MAY 31, 2022

Metadata from the data warehouse/lake and from the BI tool of record can then be used to map the dependencies between the tables and dashboards. Integrating with it is the holy grail of Spark lineage because it contains all the information needed for how data moves through the data lake and how everything is connected.

Data Lake

Data Lake Building Scala Metadata

An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch

Data Engineering Podcast

DECEMBER 25, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Missing data? Again, be prepared to have metadata challenges especially. Struggling with broken pipelines? Stale dashboards?

Building

Building Metadata Business Intelligence Data Lake

Unifying Iceberg Tables on Snowflake

Snowflake

AUGUST 31, 2023

Catalog Integration: Our newly developed Catalog Integration feature allows you to seamlessly plug Snowflake into other Iceberg catalogs tracking table metadata. Since 2021, Snowflake has had External Tables for the purpose of read-only querying external data lakes.

Metadata

Metadata AWS Data Lake Datasets

Understanding The Role Of The Chief Data Officer

Data Engineering Podcast

AUGUST 21, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. RudderStack helps you build a customer data platform on your warehouse or data lake.

Metadata

Metadata MongoDB MySQL Data Lake

How Apache Iceberg Is Changing the Face of Data Lakes

Level Up Your Data Platform With Active Metadata

Webinars

Trending Sources

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Webinars

Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

Being Data Driven At Stripe With Trino And Iceberg

Presto Powered Cloud Data Lakes At Speed Made Easy With Ahana

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Why Open Table Format Architecture is Essential for Modern Data Systems

Open, Interoperable Storage with Iceberg Tables, Now Generally Available

Simplifying Data Architecture and Security to Accelerate Value

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

Reflecting On The Past 6 Years Of Data Engineering

Toward a Data Mesh (part 2) : Architecture & Technologies

Addressing The Challenges Of Component Integration In Data Platform Architectures

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

Snowflake and S3 Data Lake

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Data Lake vs. Data Warehouse vs. Data Lakehouse

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Top Data Lake Vendors (Quick Reference Guide)

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

How Column-Aware Development Tooling Yields Better Data Models

Data Lakes vs. Data Warehouses

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Data Engineering Weekly #209

Change Data Capture (CDC): What it is and How it Works

5 Reasons Data Discovery Platforms Are Best For Data Lakes

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

Are Apache Iceberg Tables Right For Your Data Lake? 6 Reasons Why.

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

2024 Governance Trends for Data Leaders

Data Lake vs Data Warehouse - Working Together in the Cloud

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

Iceberg Is An Implementation Detail

Data Engineering Weekly #179

Apache Ozone and Dense Data Nodes

Building Spark Lineage For Data Lakes

An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch

Unifying Iceberg Tables on Snowflake

Understanding The Role Of The Chief Data Officer

Stay Connected