Cloud, Data Lake and Metadata - Data Engineering Digest

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Databricks Delta Lake: A Scalable Data Lake Solution

ProjectPro

JUNE 6, 2025

Want to process peta-byte scale data with real-time streaming ingestions rates, build 10 times faster data pipelines with 99.999% reliability, witness 20 x improvement in query performance compared to traditional data lakes, enter the world of Databricks Delta Lake now. Delta Lake is a game-changer for big data.

Data Lake

Data Lake Data Warehouse Metadata Unstructured Data

How to Build a Data Lake?

ProjectPro

JUNE 6, 2025

This guide is your roadmap to building a data lake from scratch. We'll break down the fundamentals, walk you through the architecture, and share actionable steps to set up a robust and scalable data lake. That’s where data lakes come in. Table of Contents What is a Data Lake?

Data Lake

Data Lake Building Hadoop Raw Data

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

JUNE 6, 2025

“Data Lake vs Data Warehouse = Load First, Think Later vs Think First, Load Later” The terms data lake and data warehouse are frequently stumbled upon when it comes to storing large volumes of data. Data Warehouse Architecture What is a Data lake?

Data Lake

Data Lake Data Warehouse Cloud Hadoop

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance.

Metadata

Metadata MongoDB Scala MySQL

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Snowflake

NOVEMBER 2, 2023

Over the years, the technology landscape for data management has given rise to various architecture patterns, each thoughtfully designed to cater to specific use cases and requirements. These patterns include both centralized storage patterns like data warehouse , data lake and data lakehouse , and distributed patterns such as data mesh.

Data Lake

Data Lake Data Warehouse Cloud Unstructured Data

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

JUNE 6, 2025

For e.g., Finaccel, a leading tech company in Indonesia, leverages AWS Glue to easily load, process, and transform their enterprise data for further processing. Another leading European company, Claranet, has adopted Glue to migrate their data load from their existing on-premise solution to the cloud. How Does AWS Glue Work?

AWS

AWS Scala Metadata Data Lake

50+ Azure Data Factory Interview Questions and Answers [2025]

ProjectPro

JUNE 6, 2025

This growth is due to the increasing adoption of cloud-based data integration solutions such as Azure Data Factory. If you have heard about cloud computing , you would have heard about Microsoft Azure as one of the leading cloud service providers in the world, along with AWS and Google Cloud.

Data Lake

Data Lake Metadata SQL Datasets

Presto Powered Cloud Data Lakes At Speed Made Easy With Ahana

Data Engineering Podcast

SEPTEMBER 1, 2021

Summary The Presto project has become the de facto option for building scalable open source analytics in SQL for the data lake. In recent months the community has focused their efforts on making it the fastest possible option for running your analytics in the cloud. Hudi, Delta Lake, Iceberg, Nessie, LakeFS, etc.).

Data Lake

Data Lake Cloud AWS SQL

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

Data Engineering Podcast

MAY 22, 2022

Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Stop struggling to speed up your data lake.

Machine Learning

Machine Learning Data Engineering Data Engineer Cloud

What is Apache Iceberg: Features, Architecture & Use Cases

ProjectPro

JUNE 6, 2025

Explore what is Apache Iceberg, what makes it different, and why it’s quickly becoming the new standard for data lake analytics. Data lakes were born from a vision to democratize data, enabling more people, tools, and applications to access a wider range of data. Metadata Layer 3.

Architecture

Architecture Data Lake Metadata Cloud Storage

Azure Data Factory vs AWS Glue-The Cloud ETL Battle

ProjectPro

JUNE 6, 2025

A survey by Data Warehousing Institute TDWI found that AWS Glue and Azure Data Factory are the most popular cloud ETL tools with 69% and 67% of the survey respondents mentioning that they have been using them. Azure Data Factory and AWS Glue are powerful tools for data engineers who want to perform ETL on Big Data in the Cloud.

AWS

AWS Cloud Amazon Web Services ETL Tools

Simplifying Data Architecture and Security to Accelerate Value

Snowflake

NOVEMBER 11, 2024

Data stewards can also set up Request for Access (private preview) by setting a new visibility property on objects along with contact details so the right person can easily be reached to grant access. Support for auto-refresh and Iceberg metadata generation is coming soon to Delta Lake Direct.

Data Architecture

Data Architecture Architecture Data Lake Kafka

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Cloudera

OCTOBER 23, 2024

In August, we wrote about how in a future where distributed data architectures are inevitable, unifying and managing operational and business metadata is critical to successfully maximizing the value of data, analytics, and AI. They are free to choose the infrastructure best suited for each workload.

Metadata

Metadata BI Data Lake Business Intelligence

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

First, we create an Iceberg table in Snowflake and then insert some data. Then, we add another column called HASHKEY , add more data, and locate the S3 file containing metadata for the iceberg table. In the screenshot below, we can see that the metadata file for the Iceberg table retains the snapshot history.

Architecture

Architecture Systems Data Lake Google Cloud

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera

Data Engineering Podcast

MARCH 27, 2022

Summary Data governance is a practice that requires a high degree of flexibility and collaboration at the organizational and technical levels. The growing prominence of cloud and hybrid environments in data management adds additional stress to an already complex endeavor.

Data Governance

Data Governance Government Cloud Building

Azure Blob Storage: Hidden Gem of Cloud Storage Solutions

ProjectPro

JUNE 6, 2025

Unlock the power of scalable cloud storage with Azure Blob Storage! This Azure Blob Storage tutorial offers everything you need to know to get started with this scalable cloud storage solution. By 2030, the global cloud storage market is likely to be worth USD 490.8 billion, increasing at a CAGR of 24.8%.

Cloud Storage

Cloud Storage Cloud Unstructured Data Data Lake

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Cloudera

DECEMBER 3, 2024

Cloudera’s open data lakehouse, powered by Apache Iceberg, solves the real-world big data challenges mentioned above by providing a unified, curated, shareable, and interoperable data lake that is accessible by a wide array of Iceberg-compatible compute engines and tools. Follow the steps below to setup Cloudera: 1.

Metadata

Metadata SQL Data Warehouse Database

Reflecting On The Past 6 Years Of Data Engineering

Data Engineering Podcast

FEBRUARY 5, 2023

Sign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses. Go to [dataengineeringpodcast.com/materialize]([link] Support Data Engineering Podcast

Data Engineering

Data Engineering Data Engineer Engineering PostgreSQL

Open, Interoperable Storage with Iceberg Tables, Now Generally Available

Snowflake

JUNE 21, 2024

Snowflake is now making it even easier for customers to bring the platform’s usability, performance, governance and many workloads to more data with Iceberg tables (now generally available), unlocking full storage interoperability. Iceberg tables provide compute engine interoperability over a single copy of data.

Data Lake

Data Lake BI Business Intelligence Metadata

The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

Data Engineering Podcast

FEBRUARY 19, 2023

Summary Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used.

IT

IT Data Lake Metadata Data Warehouse

Toward a Data Mesh (part 2) : Architecture & Technologies

François Nguyen

MARCH 22, 2021

TL;DR After setting up and organizing the teams, we are describing 4 topics to make data mesh a reality. It will be illustrated with our technical choices and the services we are using in the Google Cloud Platform. Data As Code is a very strong choice : we do not want any UI because it is an heritage of the ETL period.

Technology

Technology Architecture Google Cloud Metadata

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Cloudera

MARCH 31, 2021

CDP Public Cloud is now available on Google Cloud. The addition of support for Google Cloud enables Cloudera to deliver on its promise to offer its enterprise data platform at a global scale. CDP Public Cloud is already available on Amazon Web Services and Microsoft Azure.

Google Cloud

Google Cloud Cloud Amazon Web Services Cloud Storage

Snowflake Architecture and It's Fundamental Concepts

ProjectPro

JUNE 6, 2025

As the demand for big data grows, an increasing number of businesses are turning to cloud data warehouses. The cloud is the only platform to handle today's colossal data volumes because of its flexibility and scalability. Launched in 2014, Snowflake is one of the most popular cloud data solutions on the market.

Architecture

Architecture IT Data Warehouse Amazon Web Services

Top 10 AWS Services for Data Engineering Projects

ProjectPro

JUNE 6, 2025

Amazon S3 Amazon Simple Storage Service or Amazon S3 is a data lake that can store any volume of data from any part of the internet. Since it is an incredibly scalable, quick, and affordable option, Data engineers have the flexibility to duplicate their S3 storage across various Availability Zones with Amazon S3.

AWS

AWS Data Engineering Data Engineer Project

Microsoft Fabric Architecture Explained: Core Components & Benefit

Edureka

MAY 27, 2025

The architecture of Microsoft Fabric is based on several essential elements that work together to simplify data processes: 1. OneLake Data Lake OneLake provides a centralized data repository and is the fundamental storage layer of Microsoft Fabric. What Are the Core Components of Microsoft Fabric Architecture?

Architecture

Architecture BI Business Intelligence Raw Data

Top 40+ Cloud Computing Projects to Boost Your Cloud Skills

ProjectPro

JUNE 6, 2025

Want to put your cloud computing skills to the test? Dive into these innovative cloud computing projects for big data professionals and learn to master the cloud! Cloud computing has revolutionized how we store, process, and analyze big data, making it an essential skill for professionals in data science and big data.

Cloud Computing

Cloud Computing Cloud Project Google Cloud

Addressing The Challenges Of Component Integration In Data Platform Architectures

Data Engineering Podcast

NOVEMBER 26, 2023

Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. Data lakes are notoriously complex. Materialize]([link] You shouldn't have to throw away the database to build with fast-changing data.

Architecture

Architecture Data Lake High Quality Data SQL

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

Data Engineering Podcast

OCTOBER 14, 2018

The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. One of the complicated problems in data modeling is managing table partitions. What are the unique challenges posed by using S3 as the basis for a data lake?

Data Lake

Data Lake Cloud Big Data Hadoop

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Snowflake

DECEMBER 4, 2023

With this public preview, those external catalog options are either “GLUE”, where Snowflake can retrieve table metadata snapshots from AWS Glue Data Catalog, or “OBJECT_STORE”, where Snowflake retrieves metadata snapshots directly from the specified cloud storage location. Now, Snowflake can make changes to the table.

Building

Building Metadata Cloud Storage AWS

Emerging Big Data Trends for 2023

ProjectPro

JUNE 6, 2025

Organizations today are looking to glean insights from a host of multiple sources ranging from systems of record to cloud warehouses and structured and unstructured data from both non-hadoop and hadoop sources. Data lakes allow enterprise to centralize all sorts of information and gain competitive edge in the market.

Big Data

Big Data Hadoop Data Lake Data Governance

What is the Difference Between Azure Synapse vs. Databricks ?

ProjectPro

JUNE 6, 2025

Databricks: Overview Azure Synapse is a limitless analytics service that combines big data analytics , data integration, and enterprise data warehousing into single unified platform. When it comes to databricks architecture, it is not entirely a data warehouse. Databricks architecture is not entirely a data warehouse.

Programming Language

Programming Language Data Lake Scala Data Warehouse

The Security Challenges of Data Warehousing in the Cloud

Cloudera

NOVEMBER 5, 2020

How do you control data privacy and protect against data breaches when the data is spread across so many different systems? How do you optimize your enterprise-wide infrastructure (mostly cloud) and application expenditures? In CDP, an “Environment” is a logical subset of your cloud provider account.

Cloud

Cloud Data Lake Data Warehouse Metadata

How to Learn AWS for Data Engineering?

ProjectPro

JUNE 6, 2025

Amazon Web Services, or AWS, remains among the Top cloud computing services platforms with a 34% market share as of 2022. million organizations that want to be data-driven choose AWS as their cloud services partner. With AWS cloud services, web applications may be deployed quickly without further coding or server infrastructure.

AWS

AWS Data Engineering Data Engineer Engineering

20 Best Open Source Big Data Projects to Contribute on GitHub

ProjectPro

JUNE 6, 2025

As per the surveyors, Big data (35 percent), Cloud computing (39 percent), operating systems (33 percent), and the Internet of Things (31 percent) are all expected to be impacted by open source shortly. Following these statistics, big data is set to get bigger with the evolution of open-source projects.

Big Data

Big Data Project Metadata Programming Language

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Data Engineering Podcast

DECEMBER 18, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Missing data? Atlan is the metadata hub for your data ecosystem. Struggling with broken pipelines? Stale dashboards?

Metadata

Metadata Data Lake Business Intelligence MongoDB

2024 Governance Trends for Data Leaders

phData: Data Engineering

NOVEMBER 1, 2024

Organizations also need a better understanding of how LLMs are trained, especially with external vendors or public cloud environments. In sectors like legal services, safeguarding client data from being used in public apps or external training models is critical.

Government

Government Data Governance Finance Metadata

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Cloudera

OCTOBER 11, 2021

Customers can now seamlessly automate migration to Cloudera’s Hybrid Data Platform — Cloudera Data Platform (CDP) to dynamically auto-scale cloud services with Cloudera Data Engineering (CDE) integration with Modak Nabu. Cloud Speed and Scale. Customers using Modak Nabu with CDP today have deployed Data Lakes and.

Data Engineering

Data Engineering Data Engineer Cloud Engineering

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

Many Cloudera customers are making the transition from being completely on-prem to cloud by either backing up their data in the cloud, or running multi-functional analytics on CDP Public cloud in AWS or Azure. CDP Data Lake cluster versions – CM 7.4.0, Introduction. Runtime 7.2.8. Architecture.

Cloud

Cloud Data Lake Cloud Storage Metadata

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Data Engineering Podcast

DECEMBER 29, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Missing data? If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.

Management

Management Metadata Data Lake Business Intelligence

Extreme data center pressure? Burst to the cloud with CDP!

Cloudera

NOVEMBER 12, 2020

Cloud has given us hope, with public clouds at our disposal we now have virtually infinite resources, but they come at a different cost – using the cloud means we may be creating yet another series of silos, which also creates unmeasurable new risks in security and traceability of our data. A solution.

Cloud

Cloud Data Warehouse Banking Data

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Data Engineering Weekly

FEBRUARY 18, 2025

Fluss is a compelling new project in the realm of real-time data processing. I spoke with Jark Wu , who leads the Fluss and Flink SQL team at Alibaba Cloud, to understand its origins and potential. So you only need to store one copy of data for your streaming and Lakehouse. The fourth difference is the Lakehouse Architecture.

Kafka

Kafka Lambda Architecture SQL Data Lake

How Column-Aware Development Tooling Yields Better Data Models

Data Engineering Podcast

JUNE 17, 2023

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. How has the move to the cloud for data warehousing/data platforms influenced the practice of data modeling?

Data Lake

Data Lake Metadata Data Architecture Machine Learning

Mastering the Art of ETL on AWS for Data Management

ProjectPro

JUNE 6, 2025

Data Engineers and Data Scientists require efficient methods for managing large databases, which is why centralized data warehouses are in high demand. Cloud computing has made it easier for businesses to move their data to the cloud for better scalability, performance, solid integrations, and affordable pricing.

AWS

AWS Data Management ETL Tools Management

How Apache Iceberg Is Changing the Face of Data Lakes

Databricks Delta Lake: A Scalable Data Lake Solution

Webinars

Trending Sources

How to Build a Data Lake?

Webinars

Data Lake vs Data Warehouse - Working Together in the Cloud

Level Up Your Data Platform With Active Metadata

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

50+ Azure Data Factory Interview Questions and Answers [2025]

Presto Powered Cloud Data Lakes At Speed Made Easy With Ahana

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

What is Apache Iceberg: Features, Architecture & Use Cases

Azure Data Factory vs AWS Glue-The Cloud ETL Battle

Simplifying Data Architecture and Security to Accelerate Value

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Why Open Table Format Architecture is Essential for Modern Data Systems

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera

Azure Blob Storage: Hidden Gem of Cloud Storage Solutions

Secure Data Sharing and Interoperability Powered by Iceberg REST Catalog

Reflecting On The Past 6 Years Of Data Engineering

Open, Interoperable Storage with Iceberg Tables, Now Generally Available

The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

Toward a Data Mesh (part 2) : Architecture & Technologies

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Snowflake Architecture and It's Fundamental Concepts

Top 10 AWS Services for Data Engineering Projects

Microsoft Fabric Architecture Explained: Core Components & Benefit

Top 40+ Cloud Computing Projects to Boost Your Cloud Skills

Addressing The Challenges Of Component Integration In Data Platform Architectures

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Emerging Big Data Trends for 2023

What is the Difference Between Azure Synapse vs. Databricks ?

The Security Challenges of Data Warehousing in the Cloud

How to Learn AWS for Data Engineering?

20 Best Open Source Big Data Projects to Contribute on GitHub

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

2024 Governance Trends for Data Leaders

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Migrate Hive data from CDH to CDP public cloud

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Extreme data center pressure? Burst to the cloud with CDP!

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

How Column-Aware Development Tooling Yields Better Data Models

Mastering the Art of ETL on AWS for Data Management

Stay Connected