Data Lake, Metadata and Systems - Data Engineering Digest

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms.

Architecture

Architecture Systems Data Lake Google Cloud

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. A variety of platforms have been developed to capture and analyze that information to great effect, but they are inherently limited in their utility due to their nature as storage systems.

Metadata

Metadata MongoDB MySQL Scala

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Snowflake

NOVEMBER 2, 2023

Over the years, the technology landscape for data management has given rise to various architecture patterns, each thoughtfully designed to cater to specific use cases and requirements. These patterns include both centralized storage patterns like data warehouse , data lake and data lakehouse , and distributed patterns such as data mesh.

Data Lake

Data Lake Data Warehouse Cloud Unstructured Data

Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

Data Engineering Podcast

NOVEMBER 10, 2021

Summary A significant source of friction and wasted effort in building and integrating data management systems is the fragmentation of metadata across various tools. Start trusting your data with Monte Carlo today! Supercharge your business teams with customer data using Hightouch for Reverse ETL today.

Metadata

Metadata Data Warehouse Data Lake BI

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Snowflake

DECEMBER 4, 2024

Beyond working with well-structured data in a data warehouse, modern AI systems can use deep learning and natural language processing to work effectively with unstructured and semi-structured data in data lakes and lakehouses.

Unstructured Data

Unstructured Data Data Lake Deep Learning Structured Data

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

JUNE 16, 2024

In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform. What are the other systems that feed into and rely on the Trino/Iceberg service?

Data Lake

Data Lake High Quality Data Metadata Machine Learning

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Cloudera

OCTOBER 23, 2024

In August, we wrote about how in a future where distributed data architectures are inevitable, unifying and managing operational and business metadata is critical to successfully maximizing the value of data, analytics, and AI.

Metadata

Metadata BI Data Lake Business Intelligence

Presto Powered Cloud Data Lakes At Speed Made Easy With Ahana

Data Engineering Podcast

SEPTEMBER 1, 2021

Summary The Presto project has become the de facto option for building scalable open source analytics in SQL for the data lake. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. lets you identify data quality issues and their root causes from a single dashboard.

Data Lake

Data Lake Cloud AWS SQL

Simplifying Data Architecture and Security to Accelerate Value

Snowflake

NOVEMBER 11, 2024

This reduces the overall complexity of getting streaming data ready to use: Simply create external access integration with your existing Kafka solution. SnowConvert is an easy-to-use code conversion tool that accelerates legacy relational database management system (RDBMS) migrations to Snowflake.

Data Architecture

Data Architecture Architecture Data Lake Kafka

Open, Interoperable Storage with Iceberg Tables, Now Generally Available

Snowflake

JUNE 21, 2024

Snowflake is now making it even easier for customers to bring the platform’s usability, performance, governance and many workloads to more data with Iceberg tables (now generally available), unlocking full storage interoperability. Iceberg tables provide compute engine interoperability over a single copy of data.

Data Lake

Data Lake BI Business Intelligence Metadata

Building A System Of Record For Your Organization's Data Ecosystem At Metaphor

Data Engineering Podcast

DECEMBER 19, 2021

Summary Building a well managed data ecosystem for your organization requires a holistic view of all of the producers, consumers, and processors of information. The team at Metaphor are building a fully connected metadata layer to provide both technical and social intelligence about your data. No more scripts, just SQL.

Systems

Systems Building Metadata Data Warehouse

Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata

Data Engineering Podcast

SEPTEMBER 11, 2022

Summary Data engineering systems are complex and interconnected with myriad and often opaque chains of dependencies. In order to turn this into a tractable problem one approach is to define and enforce contracts between producers and consumers of data. Atlan is the metadata hub for your data ecosystem.

Systems

Systems Metadata Building MongoDB

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Data Engineering Weekly

FEBRUARY 18, 2025

Kafka is designed to be a black box to collect all kinds of data, so Kafka doesn't have built-in schema and schema enforcement; this is the biggest problem when integrating with schematized systems like Lakehouse. So you only need to store one copy of data for your streaming and Lakehouse. When to use Fluss vs Apache Pinot?

Kafka

Kafka Lambda Architecture SQL Architecture

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems

Towards Data Science

FEBRUARY 6, 2024

ERP and CRM systems are designed and built to fulfil a broad range of business processes and functions. This generalisation makes their data models complex and cryptic and require domain expertise. As you do not want to start your development with uncertainty, you decide to go for the operational raw data directly.

Systems

Systems Raw Data Metadata Data Cleanse

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

Data Engineering Podcast

APRIL 24, 2022

WhyLogs is a powerful library for flexibly instrumenting all of your data systems to understand the entire lifecycle of your data from source to productionized model. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies.

Machine Learning

Machine Learning Systems Data Lake Java

Data Engineering Weekly #215

Data Engineering Weekly

APRIL 6, 2025

The article summarizes the recent macro trends in AI and data engineering, focusing on Vibe coding, human-in-the-loop system design, and rapid simplification of developer tooling. Kudos to the Grab team for building a docs-as-code system. The Grab blog delights me since I have tried to do this many times.

Data Engineer

Data Engineer Data Engineering Engineering Kafka

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

Data Engineering Podcast

MAY 22, 2022

Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Stop struggling to speed up your data lake.

Machine Learning

Machine Learning Data Engineer Data Engineering Cloud

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Data Engineering Podcast

DECEMBER 18, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Missing data? Atlan is the metadata hub for your data ecosystem. Struggling with broken pipelines? Stale dashboards?

Metadata

Metadata Business Intelligence Data Lake BI

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Data Engineering Podcast

DECEMBER 29, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Missing data? Atlan is the metadata hub for your data ecosystem. Struggling with broken pipelines? Stale dashboards?

Management

Management Metadata Business Intelligence Data Lake

Snowflake and S3 Data Lake

Cloudyard

DECEMBER 13, 2022

Read Time: 4 Minute, 23 Second During this post we will discuss how AWS S3 service and Snowflake integration can be used as Data Lake in current organizations. How customer has migrated On Premises EDW to Snowflake to leverage snowflake Data Lake capabilities. Create S3 bucket to hold the tables data.

Data Lake

Data Lake AWS Metadata Data

Change Data Capture (CDC): What it is and How it Works

Striim

MARCH 21, 2025

Change Data Capture (CDC) has emerged as an ideal solution for near real-time movement of data from relational databases (like SQL Server or Oracle) to data warehouses, data lakes or other databases. Data can be extracted using database queries (batch-based) or Change Data Capture (near-real-time).

IT

IT Data Lake Data Warehouse Relational Database

Data Engineering Weekly #209

Data Engineering Weekly

FEBRUARY 23, 2025

[link] Alireza Sadeghi: Open Source Data Engineering Landscape 2025 This article comprehensively overviews the 2025 open-source data engineering landscape, highlighting key trends, active projects, and emerging technologies. I wonder if these systems expand more capabilities that eventually fall on their own weight.

Data Engineer

Data Engineer Data Engineering Engineering Kafka

How Column-Aware Development Tooling Yields Better Data Models

Data Engineering Podcast

JUNE 17, 2023

In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. ML, reverse ETL, etc.)

Data Lake

Data Lake Machine Learning Metadata Data Architecture

Supporting Diverse ML Systems at Netflix

Netflix Tech

MARCH 7, 2024

The Machine Learning Platform (MLP) team at Netflix provides an entire ecosystem of tools around Metaflow , an open source machine learning infrastructure framework we started, to empower data scientists and machine learning practitioners to build and manage a variety of ML systems. ETL workflows), as well as downstream (e.g.

Systems

Systems Media Machine Learning Data Warehouse

Data Lake vs. Data Warehouse vs. Data Lakehouse

Sync Computing

NOVEMBER 7, 2024

While data warehouses are still in use, they are limited in use-cases as they only support structured data. Data lakes add support for semi-structured and unstructured data, and data lakehouses add further flexibility with better governance in a true hybrid solution built from the ground-up.

Data Lake

Data Lake Data Warehouse Business Intelligence Unstructured Data

2024 Governance Trends for Data Leaders

phData: Data Engineering

NOVEMBER 1, 2024

Strong data governance also lays the foundation for better model performance, cost efficiency, and improved data quality, which directly contributes to regulatory compliance and more secure AI systems. The technology for metadata management, data quality management, etc., No problem! is fairly advanced.

Government

Government Data Governance Finance Metadata

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

Cloudera

DECEMBER 16, 2022

We are pleased to announce that Cloudera has been named a Leader in the 2022 Gartner ® Magic Quadrant for Cloud Database Management Systems. Cloudera has long had the capabilities of a data lakehouse, if not the label. 4-Ready for modern data fabric architectures. 4-Ready for modern data fabric architectures.

Database

Database Cloud Systems Management

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

phData: Data Engineering

SEPTEMBER 19, 2023

With the amount of data companies are using growing to unprecedented levels, organizations are grappling with the challenge of efficiently managing and deriving insights from these vast volumes of structured and unstructured data. What is a Data Lake? Consistency of data throughout the data lake.

Data Lake

Data Lake Process Metadata Data Warehouse

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera

Data Engineering Podcast

MARCH 27, 2022

Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Can you explain how the Privacera platform is architected?

Data Governance

Data Governance Government Cloud Building

Data Engineering Weekly #179

Data Engineering Weekly

JULY 7, 2024

Learn More → Notion: Building and scaling Notion’s data lake Notion writes about scaling the data lake by bringing critical data ingestion operations in-house. Hudi seems to be a de facto choice for CDC data lake features. Notion migrated the insert heavy workload from Snowflake to Hudi.

Data Engineer

Data Engineer Data Engineering Engineering Data Lake

Data Lakes vs. Data Warehouses

Grouparoo

JANUARY 11, 2022

This article looks at the options available for storing and processing big data, which is too large for conventional databases to handle. There are two main options available, a data lake and a data warehouse. What is a Data Warehouse? What is a Data Lake?

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

Iceberg Is An Implementation Detail

dbt Developer Hub

OCTOBER 3, 2024

These formats are changing the way data is stored and metadata accessed. Apache Iceberg is a high-performance open table format developed for modern data lakes. Iceberg Data Catalog - an open-source metadata management system that tracks the schema, partition, and versions of Iceberg tables.

Metadata

Metadata Data Lake Data Storage Accessibility

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

Data Engineering Podcast

OCTOBER 14, 2018

Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake.

Data Lake

Data Lake Big Data Cloud Hadoop

Apache Ozone and Dense Data Nodes

Cloudera

APRIL 22, 2021

This CVD is built using Cloudera Data Platform Private Cloud Base 7.1.5 on Cisco UCS S3260 M5 Rack Server with Apache Ozone as the distributed file system for CDP. This has been a major architectural enhancement on how Apache Ozone manages data at scale in a data lake. . Data Generation at Scale.

Pipeline-centric

Pipeline-centric Data Lake Hadoop Big Data

A Primer On Enterprise Data Curation with Todd Walter - Episode 49

Data Engineering Podcast

SEPTEMBER 23, 2018

Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. How do you define data curation? How does the size and maturity of a company affect the ways that they architect and interact with their data systems?

Data Lake

Data Lake Data Warehouse Data Architecture Architecture

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

AltexSoft

AUGUST 29, 2023

In 2010, a transformative concept took root in the realm of data storage and analytics — a data lake. The term was coined by James Dixon , Back-End Java, Data, and Business Intelligence Engineer, and it started a new era in how organizations could store, manage, and analyze their data. What is a data lake?

Data Lake

Data Lake Architecture IT Amazon Web Services

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

AUGUST 11, 2021

“Data Lake vs Data Warehouse = Load First, Think Later vs Think First, Load Later” The terms data lake and data warehouse are frequently stumbled upon when it comes to storing large volumes of data. Data Warehouse Architecture What is a Data lake?

Data Lake

Data Lake Data Warehouse Cloud Hadoop

Unifying Iceberg Tables on Snowflake

Snowflake

AUGUST 31, 2023

Because of its leading ecosystem of diverse adopters, contributors and commercial offerings, Iceberg helps prevent storage lock-in and eliminates the need to move or copy tables between different systems, which often translates to lower compute and storage costs for your overall data stack.

Metadata

Metadata AWS Data Lake Datasets

Business Intelligence In The Palm Of Your Hand With Zing Data

Data Engineering Podcast

DECEMBER 4, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day.

Business Intelligence

Business Intelligence Metadata BI MongoDB

Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

Data Engineering Podcast

NOVEMBER 27, 2022

This has led to inefficiencies in how data is stored, accessed, and shared across process and system boundaries. The Arrow project is designed to eliminate wasted effort in translating between languages, and Voltron Data was created to help grow and support its technology and community. Missing data? Stale dashboards?

Data Process

Data Process Process Metadata Business Intelligence

Bring Order To The Chaos Of Your Unstructured Data Assets With Unstruk

Data Engineering Podcast

JUNE 17, 2021

Summary Working with unstructured data has typically been a motivation for a data lake. Kirk Marple has spent years working with data systems and the media industry, which inspired him to build a platform for automatically organizing your unstructured assets to make them more valuable. No more scripts, just SQL.

Unstructured Data

Unstructured Data Data Warehouse Metadata Media

An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch

Data Engineering Podcast

DECEMBER 25, 2022

Summary Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. Atlan is the metadata hub for your data ecosystem. Missing data? Struggling with broken pipelines?

Building

Building Metadata Business Intelligence Data Lake

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes.

Data Lake

Data Lake Data Warehouse BI SQL

How Apache Iceberg Is Changing the Face of Data Lakes

Why Open Table Format Architecture is Essential for Modern Data Systems

Webinars

Trending Sources

Level Up Your Data Platform With Active Metadata

Webinars

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Being Data Driven At Stripe With Trino And Iceberg

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Presto Powered Cloud Data Lakes At Speed Made Easy With Ahana

Simplifying Data Architecture and Security to Accelerate Value

Open, Interoperable Storage with Iceberg Tables, Now Generally Available

Building A System Of Record For Your Organization's Data Ecosystem At Metaphor

Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

Data Engineering Weekly #215

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Snowflake and S3 Data Lake

Change Data Capture (CDC): What it is and How it Works

Data Engineering Weekly #209

How Column-Aware Development Tooling Yields Better Data Models

Supporting Diverse ML Systems at Netflix

Data Lake vs. Data Warehouse vs. Data Lakehouse

2024 Governance Trends for Data Leaders

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera

Data Engineering Weekly #179

Data Lakes vs. Data Warehouses

Iceberg Is An Implementation Detail

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

Apache Ozone and Dense Data Nodes

A Primer On Enterprise Data Curation with Todd Walter - Episode 49

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

Data Lake vs Data Warehouse - Working Together in the Cloud

Unifying Iceberg Tables on Snowflake

Business Intelligence In The Palm Of Your Hand With Zing Data

Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

Bring Order To The Chaos Of Your Unstructured Data Assets With Unstruk

An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch

The Future of the Data Lakehouse – Open

Stay Connected