Cloud Storage, Data Lake and Metadata - Data Engineering Digest

Cloud Storage

Data Lake

Metadata

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

First, we create an Iceberg table in Snowflake and then insert some data. Then, we add another column called HASHKEY , add more data, and locate the S3 file containing metadata for the iceberg table. In the screenshot below, we can see that the metadata file for the Iceberg table retains the snapshot history.

Architecture

Architecture Systems Data Lake Google Cloud

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Trending Sources

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Snowflake

DECEMBER 4, 2023

With this public preview, those external catalog options are either “GLUE”, where Snowflake can retrieve table metadata snapshots from AWS Glue Data Catalog, or “OBJECT_STORE”, where Snowflake retrieves metadata snapshots directly from the specified cloud storage location.

Building

Building Metadata Cloud Storage AWS

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Top Data Lake Vendors (Quick Reference Guide)

Monte Carlo

APRIL 24, 2023

Data lakes are useful, flexible data storage repositories that enable many types of data to be stored in its rawest state. Traditionally, after being stored in a data lake, raw data was then often moved to various destinations like a data warehouse for further processing, analysis, and consumption.

Data Lake

Data Lake Google Cloud Data Warehouse AWS

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

With CDW, as an integrated service of CDP, your line of business gets immediate resources needed for faster application launches and expedited data access, all while protecting the company’s multi-year investment in centralized data management, security, and governance. Separate storage. Separate compute.

IT Data Lake Data Warehouse Cloud Storage

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Cloudera

MARCH 31, 2021

With the addition of Google Cloud, we deliver on our vision of providing a hybrid and multi-cloud architecture to support our customer’s analytics needs regardless of deployment platform. . You could then use an existing pipeline to run analytics on the prepared data in BigQuery. .

Google Cloud

Google Cloud Cloud Amazon Web Services Cloud Storage

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

This blog post outlines detailed step by step instructions to perform Hive Replication from an on-prem CDH cluster to a CDP Public Cloud Data Lake. CDP Data Lake cluster versions – CM 7.4.0, Configure the required ports to enable connectivity from CDH to CDP Public Cloud (see docs for details).

Cloud

Cloud Data Lake Cloud Storage Metadata

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Monte Carlo

AUGUST 25, 2023

That’s why it’s essential for teams to choose the right architecture for the storage layer of their data stack. But, the options for data storage are evolving quickly. So let’s get to the bottom of the big question: what kind of data storage layer will provide the strongest foundation for your data platform?

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

AUGUST 11, 2021

“Data Lake vs Data Warehouse = Load First, Think Later vs Think First, Load Later” The terms data lake and data warehouse are frequently stumbled upon when it comes to storing large volumes of data. Data Warehouse Architecture What is a Data lake?

Data Lake

Data Lake Data Warehouse Cloud Hadoop

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

Mark: The first element in the process is the link between the source data and the entry point into the data platform. At Ramsey International (RI), we refer to that layer in the architecture as the foundation, but others call it a staging area, raw zone, or even a source data lake.

Data Lake

Data Lake Analytics Application Cloud Storage Architecture

Of Muffins and Machine Learning Models

Cloudera

FEBRUARY 16, 2022

Each workspace is associated with a collection of cloud resources. In the case of CDP Public Cloud, this includes virtual networking constructs and the data lake as provided by a combination of a Cloudera Shared Data Experience (SDX) and the underlying cloud storage. Figure 03: lineage.yaml.

Machine Learning

Machine Learning Algorithm Government Metadata

Data Architect: Role Description, Skills, Certifications and When to Hire

AltexSoft

FEBRUARY 11, 2023

Data architecture is the organization and design of how data is collected, transformed, integrated, stored, and used by a company. Metadata management skills Metadata management unlocks the value of a company’s data and it’s a data architect’s task to ensure metadata principles are applicable to all data a business has.

Data Architect

Data Architect Certification Generalist Big Data

Moving Past ETL and ELT: Understanding the EtLT Approach

Ascend.io

AUGUST 31, 2023

Secondly , the rise of data lakes that catalyzed the transition from ELT to ELT and paved the way for niche paradigms such as Reverse ETL and Zero-ETL. Still, these methods have been overshadowed by EtLT — the predominant approach reshaping today’s data landscape.

Data Lake

Data Lake Data Warehouse ETL Tools Data Pipeline

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

Cloudera

AUGUST 21, 2020

Data-in-motion is predominantly about streaming data so enterprises typically have two different ways or binary ways of looking at data. The governance aspect is perhaps even more important and businesses need to be able to understand where the data comes from.

Banking

Banking Kafka Cloud Storage Government

Copy Activity in Azure Data Factory and Azure Synapse Analytics

Edureka

OCTOBER 10, 2024

Supported Data Stores and Formats Azure Data Factory and Azure Synapse Analytics support a vast array of data stores for the Copy activity. NoSQL Stores: As source systems, Cassandra and MongoDB (including MongoDB Atlas), NoSQL databases are supported to make the integration of the unstructured data easy.

MongoDB

MongoDB NoSQL Metadata Datasets

The Good and the Bad of Hadoop Big Data Framework

AltexSoft

JULY 29, 2022

a runtime environment (sandbox) for classic business intelligence (BI), advanced analysis of large volumes of data, predictive maintenance , and data discovery and exploration; a store for raw data; a tool for large-scale data integration ; and. a suitable technology to implement data lake architecture.

Hadoop

Hadoop Big Data Google Cloud NoSQL

20+ Data Engineering Projects for Beginners with Source Code

ProjectPro

AUGUST 24, 2021

Then, the Yelp dataset downloaded in JSON format is connected to Cloud SDK, following connections to Cloud storage which is then connected with Cloud Composer. Cloud composer and PubSub outputs are Apache Beam and connected to Google Dataflow. Upload it to Azure Data lake storage manually.

Data Engineer

Data Engineer Data Engineering Coding Project

What is ETL Pipeline? Process, Considerations, and Examples

ProjectPro

NOVEMBER 30, 2021

Cloud: Technology advancements, information security threats, faster internet speeds, and a push to prevent data loss have contributed to the move toward cloud-native storage and processing. It is the most feasible option when the data size is huge. When making instant backups, this can be useful.

Process

Process Data Warehouse Data Pipeline AWS

The Evolution of Customer Data Modeling: From Static Profiles to Dynamic Customer 360

phData: Data Engineering

SEPTEMBER 27, 2024

But with modern cloud storage solutions and clever techniques like log compaction (where obsolete entries are removed), this is becoming less and less of an issue. The benefits of log-based approaches often far outweigh the storage costs. Both persistent staging and data lakes involve storing large amounts of raw data.

Data

Data Raw Data Data Lake Architecture

Unstructured Data: Examples, Tools, Techniques, and Best Practices

AltexSoft

MAY 12, 2023

Unstructured data , on the other hand, is unpredictable and has no fixed schema, making it more challenging to analyze. Without a fixed schema, the data can vary in structure and organization. There are several widely used unstructured data storage solutions such as data lakes (e.g., Hadoop, Apache Spark).

Unstructured Data

Unstructured Data NoSQL Hadoop Data Lake

What is a Data Platform? And How to Build An Awesome One

Monte Carlo

AUGUST 19, 2023

With companies moving their data platforms to the cloud, the emergence of cloud-native solutions ( data warehouse vs data lake or even a data lakehouse ) have taken over the market, offering more accessible and affordable options for storing data relative to many on-premises solutions.

Building

Building BI Data Lake Data Governance

What is Azure Data Factory – Here’s Everything You Need to Know

Edureka

JULY 3, 2024

ADF leverages compute services like Azure HDInsight, Spark, Azure Data Lake Analytics, or Machine Learning to process and analyze the data according to defined requirements. Publish: Transformed data is then published either back to on-premises sources like SQL Server or kept in cloud storage.

Pipeline-centric

Pipeline-centric Data Lake Database-centric Data Pipeline

The Good and the Bad of Databricks Lakehouse Platform

AltexSoft

MARCH 30, 2023

What is Databricks Databricks is an analytics platform with a unified set of tools for data engineering, data management , data science, and machine learning. It combines the best elements of a data warehouse, a centralized repository for structured data, and a data lake used to host large amounts of raw data.

Scala

Scala Data Lake Machine Learning BI

Envisioning LakeDB: The Next Evolution of the Lakehouse Architecture

Data Engineering Weekly

JANUARY 24, 2025

The world of data management is undergoing a rapid transformation. The rise of cloud storage, coupled with the increasing demand for real-time analytics, has led to the emergence of the Data Lakehouse. This paradigm combines the flexibility of data lakes with the performance and reliability of data warehouses.

Architecture

Architecture Metadata Data Ingestion Data Lake

The Future of Data Engineering: DEW's 2025 Predictions

Data Engineering Weekly

DECEMBER 18, 2024

Built-in Data Governance: Data quality checks, CI/ CD pipeline, the ability to run integration testing before pushing into production, access controls, and lineage tracking will be integrated directly into the development workflow, ensuring that data governance is not an afterthought.

Data Engineer

Data Engineer Data Engineering Engineering Data Lake

How Apache Iceberg Is Changing the Face of Data Lakes

Why Open Table Format Architecture is Essential for Modern Data Systems

Webinars

Trending Sources

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Webinars

Top Data Lake Vendors (Quick Reference Guide)

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Migrate Hive data from CDH to CDP public cloud

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Data Lake vs Data Warehouse - Working Together in the Cloud

Demystifying Modern Data Platforms

Of Muffins and Machine Learning Models

Data Architect: Role Description, Skills, Certifications and When to Hire

Moving Past ETL and ELT: Understanding the EtLT Approach

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

Copy Activity in Azure Data Factory and Azure Synapse Analytics

The Good and the Bad of Hadoop Big Data Framework

20+ Data Engineering Projects for Beginners with Source Code

What is ETL Pipeline? Process, Considerations, and Examples

The Evolution of Customer Data Modeling: From Static Profiles to Dynamic Customer 360

Unstructured Data: Examples, Tools, Techniques, and Best Practices

What is a Data Platform? And How to Build An Awesome One

What is Azure Data Factory – Here’s Everything You Need to Know

The Good and the Bad of Databricks Lakehouse Platform

Envisioning LakeDB: The Next Evolution of the Lakehouse Architecture

The Future of Data Engineering: DEW's 2025 Predictions

Stay Connected