Cloud Storage and Data Lake - Data Engineering Digest

Cloud Storage

Data Lake

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Setting up Data Lake on GCP using Cloud Storage and BigQuery

Analytics Vidhya

FEBRUARY 25, 2023

Introduction A data lake is a centralized and scalable repository storing structured and unstructured data. The need for a data lake arises from the growing volume, variety, and velocity of data companies need to manage and analyze.

Cloud Storage

Cloud Storage Data Lake Cloud Unstructured Data

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Trending Sources

Enabling Security for Hadoop Data Lake on Google Cloud Storage

Uber Engineering

OCTOBER 27, 2024

Ready to boost your Hadoop Data Lake security on GCP? Our latest blog dives into enabling security for Uber’s modernized batch data lake on Google Cloud Storage!

Cloud Storage

Cloud Storage Google Cloud Data Lake Hadoop

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Data Engineering Podcast

FEBRUARY 18, 2024

Summary A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Data lakes are notoriously complex. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!

Data Lake

Data Lake High Quality Data Data Warehouse Google Cloud

Enabling Multi-User Fine-Grained Access Control for Cloud Storage in CDP

Cloudera

SEPTEMBER 10, 2021

Shared Data Experience ( SDX ) on Cloudera Data Platform ( CDP ) enables centralized data access control and audit for workloads in the Enterprise Data Cloud. The public cloud (CDP-PC) editions default to using cloud storage (S3 for AWS, ADLS-gen2 for Azure).

Cloud Storage

Cloud Storage Accessibility Accessible Cloud

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

Note : Cloud Data warehouses like Snowflake and Big Query already have a default time travel feature. However, this feature becomes an absolute must-have if you are operating your analytics on top of your data lake or lakehouse. It can also be integrated into major data platforms like Snowflake.

Architecture

Architecture Systems Data Lake Google Cloud

Microsoft Fabric vs. Snowflake: Key Differences You Need to Know

Edureka

APRIL 22, 2025

It incorporates elements from several Microsoft products working together, like Power BI, Azure Synapse Analytics, Data Factory, and OneLake, into a single SaaS experience. No matter the workload, Fabric stores all data on OneLake, a single, unified data lake built on the Delta Lake model.

BI Pipeline-centric Data Lake Google Cloud

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

They opted for Snowflake, a cloud-native data platform ideal for SQL-based analysis. The team landed the data in a Data Lake implemented with cloud storage buckets and then loaded into Snowflake, enabling fast access and smooth integrations with analytical tools.

Pharmaceutical

Pharmaceutical Data Lake Cloud Storage Project

Cloudera announces support for Azure’s next-generation Data Lake Store

Cloudera

FEBRUARY 14, 2019

The Cloudera platform delivers a one-stop shop that allows you to store any kind of data, process and analyze it in many different ways in a single environment, and integrate with the rest of your data infrastructure. But working with cloud storage has often been a compromise. As a Hadoop developer, I loved that!

Data Lake

Data Lake Hadoop Cloud Storage Cloud

Top Data Lake Vendors (Quick Reference Guide)

Monte Carlo

APRIL 24, 2023

Data lakes are useful, flexible data storage repositories that enable many types of data to be stored in its rawest state. Traditionally, after being stored in a data lake, raw data was then often moved to various destinations like a data warehouse for further processing, analysis, and consumption.

Data Lake

Data Lake Google Cloud Data Warehouse AWS

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Snowflake

DECEMBER 4, 2023

With this public preview, those external catalog options are either “GLUE”, where Snowflake can retrieve table metadata snapshots from AWS Glue Data Catalog, or “OBJECT_STORE”, where Snowflake retrieves metadata snapshots directly from the specified cloud storage location. With these three options, which one should you use?

Building

Building Metadata Cloud Storage AWS

Data Lake vs. Data Warehouse: Differences and Similarities

U-Next

SEPTEMBER 7, 2022

The terms “ Data Warehouse ” and “ Data Lake ” may have confused you, and you have some questions. Structuring data refers to converting unstructured data into tables and defining data types and relationships based on a schema. What is Data Lake? . Athena on AWS. .

Data Lake

Data Lake Data Warehouse Unstructured Data Amazon Web Services

Fivetran Supports the Automation of the Modern Data Lake on Amazon S3

phData: Data Engineering

APRIL 4, 2023

Fivetran today announced support for Amazon Simple Storage Service (Amazon S3) with Apache Iceberg data lake format. Amazon S3 is an object storage service from Amazon Web Services (AWS) that offers industry-leading scalability, data availability, security, and performance.

Data Lake

Data Lake Amazon Web Services Data Cleanse Data Warehouse

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

Of high value to existing customers, Cloudera’s Data Warehouse service has a unique, separated architecture. . Separate storage. Cloudera’s Data Warehouse service allows raw data to be stored in the cloud storage of your choice (S3, ADLSg2). Proprietary file formats mean no one else is invited in!

IT Data Lake Data Warehouse Cloud Storage

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

AUGUST 11, 2021

“Data Lake vs Data Warehouse = Load First, Think Later vs Think First, Load Later” The terms data lake and data warehouse are frequently stumbled upon when it comes to storing large volumes of data. Data Warehouse Architecture What is a Data lake? What is a Data lake?

Data Lake

Data Lake Data Warehouse Cloud Hadoop

Discover And De-Clutter Your Unstructured Data With Aparavi

Data Engineering Podcast

JUNE 12, 2022

Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. What are the mechanisms that you use for categorizing data assets?

Unstructured Data

Unstructured Data MongoDB MySQL Scala

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Monte Carlo

AUGUST 25, 2023

That’s why it’s essential for teams to choose the right architecture for the storage layer of their data stack. But, the options for data storage are evolving quickly. So let’s get to the bottom of the big question: what kind of data storage layer will provide the strongest foundation for your data platform?

Data Lake

Data Lake Data Warehouse Unstructured Data Raw Data

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Cloudera

MARCH 31, 2021

With the addition of Google Cloud, we deliver on our vision of providing a hybrid and multi-cloud architecture to support our customer’s analytics needs regardless of deployment platform. . You could then use an existing pipeline to run analytics on the prepared data in BigQuery. .

Google Cloud

Google Cloud Cloud Amazon Web Services Cloud Storage

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

This blog post outlines detailed step by step instructions to perform Hive Replication from an on-prem CDH cluster to a CDP Public Cloud Data Lake. CDP Data Lake cluster versions – CM 7.4.0, Configure the required ports to enable connectivity from CDH to CDP Public Cloud (see docs for details).

Cloud

Cloud Data Lake Cloud Storage Metadata

A Serverless Query Engine from Spare Parts

Towards Data Science

APRIL 26, 2023

An open-source implementation of a Data Lake with DuckDB and AWS Lambdas A duck in the cloud. Photo by László Glatz on Unsplash In this post we will show how to build a simple end-to-end application in the cloud on a serverless infrastructure. The idea is to start from a Data Lake where our data are stored.

Engineering

Engineering Data Lake AWS BI

Open Source Object Storage For All Of Your Data

Data Engineering Podcast

SEPTEMBER 22, 2019

Summary Object storage is quickly becoming the unifying layer for data intensive applications and analytics. Modern, cloud oriented data warehouses and data lakes both rely on the durability and ease of use that it provides. How do you approach project governance and sustainability?

AWS

AWS Google Cloud Cloud Storage Data Lake

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

Mark: The first element in the process is the link between the source data and the entry point into the data platform. At Ramsey International (RI), we refer to that layer in the architecture as the foundation, but others call it a staging area, raw zone, or even a source data lake.

Data Lake

Data Lake Analytics Application Cloud Storage Architecture

Aaand the New NiFi Champion is…

Cloudera

JUNE 5, 2023

RK built some simple flows to pull streaming data into Google Cloud Storage and Snowflake. Many developers use DataFlow to filter/enrich streams and ingest into cloud data lakes and warehouses where the ability to process and route anywhere makes DataFlow very effective.

Google Cloud

Google Cloud Cloud Storage Data Lake Data Pipeline

Data Engineering Weekly #184

Data Engineering Weekly

AUGUST 11, 2024

link] Uber: Enabling Security for Hadoop Data Lake on Google Cloud Storage Uber writes about securing a Hadoop-based data lake on Google Cloud Platform (GCP) by replacing HDFS with Google Cloud Storage (GCS) while maintaining existing security models like Kerberos-based authentication.

Data Engineering

Data Engineering Data Engineer Google Cloud Engineering

Access control for Azure ADLS cloud object storage

Cloudera

SEPTEMBER 15, 2020

Cloudera Data Platform 7.2.1 introduces fine-grained authorization for access to Azure Data Lake Storage using Apache Ranger policies. Cloudera and Microsoft have been working together closely on this integration, which greatly simplifies the security administration of access to ADLS-Gen2 cloud storage.

Accessible

Accessible Accessibility Cloud Cloud Storage

How to Build a 5-Layer Data Stack

Monte Carlo

JULY 19, 2023

In this article, we’ll present you with the Five Layer Data Stack—a model for platform development consisting of five critical tools that will not only allow you to maximize impact but empower you to grow with the needs of your organization. Before you can model the data for your stakeholders, you need a place to collect and store it.

Building

Building Business Intelligence Cloud Storage BI

The Guide to Common Data Engineer Design Patterns

Monte Carlo

FEBRUARY 25, 2025

They make data workflows more resilient and easier to manage when things inevitably go sideways. This guide tackles the big decisions every data engineer faces: Should you clean your data before or after loading it? Data lake or warehouse? Data Lakes vs. Data Warehouses: Where Should Your Data Live?

Designing

Designing Data Engineering Data Engineer Engineering

How to Build a 5-Layer Data Stack

Towards Data Science

JULY 21, 2023

In this article, we’ll present you with the Five Layer Data Stack — a model for platform development consisting of five critical tools that will not only allow you to maximize impact but empower you to grow with the needs of your organization. Before you can model the data for your stakeholders, you need a place to collect and store it.

Building

Building Business Intelligence BI Cloud Storage

Consulting Case Study: Job Market Analysis

WeCloudData

OCTOBER 19, 2021

Conclusion WeCloudData helped a client build a flexible data pipeline to address the needs from multiple business units requiring different sets, views and timelines of job market data.

Consulting

Consulting Raw Data Data Lake Data Pipeline

Consulting Case Study: Job Market Analysis

WeCloudData

OCTOBER 19, 2021

Conclusion WeCloudData helped a client build a flexible data pipeline to address the needs from multiple business units requiring different sets, views and timelines of job market data.

Consulting

Consulting Raw Data Data Lake Data Pipeline

How Much Data Do We Need? Balancing Machine Learning with Security Considerations

Towards Data Science

DECEMBER 15, 2023

Even the best of us sometimes demonize the parts of our organization whose primary goals are in the privacy and security area and conflict with our wishes to splash around in the data lake. In reality, data scientists are not always the heroes and IT and security teams are not the villains. You’re using the data, of course!

Machine Learning

Machine Learning Data Science Data Security Data Storage

Snowflake: Amazon S3-compatible Storage with Cloudflare

Cloudyard

AUGUST 22, 2023

With this feature, you can efficiently manage, govern, and analyze your data irrespective of its storage location, ensuring optimal data management. This feature significantly contributes to enhancing your data management capabilities within the Snowflake ecosystem. Zero egress fees mean zero vendor lock-in.”

Bytes

Bytes Data Lake Cloud Storage Cloud

Moving Past ETL and ELT: Understanding the EtLT Approach

Ascend.io

AUGUST 31, 2023

Secondly , the rise of data lakes that catalyzed the transition from ELT to ELT and paved the way for niche paradigms such as Reverse ETL and Zero-ETL. Still, these methods have been overshadowed by EtLT — the predominant approach reshaping today’s data landscape.

Data Lake

Data Lake Data Warehouse ETL Tools Data Pipeline

Unstructured Data: Examples, Tools, Techniques, and Best Practices

AltexSoft

MAY 12, 2023

Unstructured data , on the other hand, is unpredictable and has no fixed schema, making it more challenging to analyze. Without a fixed schema, the data can vary in structure and organization. There are several widely used unstructured data storage solutions such as data lakes (e.g., Hadoop, Apache Spark).

Unstructured Data

Unstructured Data NoSQL Hadoop Data Lake

What is Azure Data Factory – Here’s Everything You Need to Know

Edureka

JULY 3, 2024

ADF leverages compute services like Azure HDInsight, Spark, Azure Data Lake Analytics, or Machine Learning to process and analyze the data according to defined requirements. Publish: Transformed data is then published either back to on-premises sources like SQL Server or kept in cloud storage.

Pipeline-centric

Pipeline-centric Data Lake Database-centric Data Pipeline

Most important Data Engineering Concepts and Tools for Data Scientists

DareData

JANUARY 30, 2023

Data lakes: These are large-scale data storage systems that are designed to store and process large amounts of raw, unstructured data. Examples of technologies able to aggregate data in data lake format include Amazon S3 or Azure Data Lake. Stanford's Relational Databases and SQL.

Data Engineering

Data Engineering Data Engineer NoSQL Engineering

Apache Hadoop 3.0.0 is Generally Available!

Cloudera

DECEMBER 14, 2017

Improved support for cloud storage systems like S3 (with S3Guard ), Microsoft Azure Data Lake, and Aliyun OSS. YARN Timeline Service v2, which improves the scalability, reliability, and usability of the existing Timeline Service. See the Apache Hadoop 3.0.0 documentation for a full rundown of the changes.

Hadoop

Hadoop Cloud Storage Data Lake Software Engineering

The Good and the Bad of Databricks Lakehouse Platform

AltexSoft

MARCH 30, 2023

What is Databricks Databricks is an analytics platform with a unified set of tools for data engineering, data management , data science, and machine learning. It combines the best elements of a data warehouse, a centralized repository for structured data, and a data lake used to host large amounts of raw data.

Scala

Scala Data Lake Machine Learning BI

Of Muffins and Machine Learning Models

Cloudera

FEBRUARY 16, 2022

Each workspace is associated with a collection of cloud resources. In the case of CDP Public Cloud, this includes virtual networking constructs and the data lake as provided by a combination of a Cloudera Shared Data Experience (SDX) and the underlying cloud storage.

Machine Learning

Machine Learning Algorithm Government Metadata

Azure Synapse vs Databricks: 2023 Comparison Guide

Knowledge Hut

SEPTEMBER 26, 2023

Key connectivity features include: Data Ingestion: Databricks supports data ingestion from a variety of sources, including data lakes, databases, streaming platforms, and cloud storage. This flexibility allows organizations to ingest data from virtually anywhere.

Data Lake

Data Lake Database-centric Machine Learning Pipeline-centric

Ingestion of Healthcare Pricing Transparency Data Files Natively on Snowflake

Snowflake

FEBRUARY 23, 2023

Snowflake’s solution to ingesting very large healthcare pricing transparency data files. In the above solution approach, the pricing transparency JSON file is hosted in a cloud storage bucket and is referenced through an external stage on Snowflake.

Healthcare

Healthcare Hospitality Insurance Cloud Storage

Rethinking Data Marts in the Cloud

Cloudera

OCTOBER 26, 2017

Organizations find they have much more agility with analytics in the cloud and can operate at a lower cost point than has been possible with legacy on-premises solutions. Generally, instances for transient clusters need only minimal local disk space, since data processing runs directly on the data in the cloud storage.

Cloud

Cloud BI Cloud Storage Business Intelligence

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

Cloudera

AUGUST 21, 2020

Data-in-motion is predominantly about streaming data so enterprises typically have two different ways or binary ways of looking at data.

Banking

Banking Kafka Cloud Storage Government

15 Sample GCP Projects Ideas for Beginners to Practice in 2023

ProjectPro

OCTOBER 6, 2021

Also, Cloud Endpoints are used, which help speed up the development, making smoother API calls for mobile app development. Data Lake using Google Cloud Platform What is a Data Lake? Data Lake is a centralized area or repository for data storage.

Google Cloud

Google Cloud Project Data Lake Healthcare

How Apache Iceberg Is Changing the Face of Data Lakes

Setting up Data Lake on GCP using Cloud Storage and BigQuery

Webinars

Trending Sources

Enabling Security for Hadoop Data Lake on Google Cloud Storage

Webinars

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Enabling Multi-User Fine-Grained Access Control for Cloud Storage in CDP

Why Open Table Format Architecture is Essential for Modern Data Systems

Microsoft Fabric vs. Snowflake: Key Differences You Need to Know

Drug Launch Case Study: Amazing Efficiency Using DataOps

Cloudera announces support for Azure’s next-generation Data Lake Store

Top Data Lake Vendors (Quick Reference Guide)

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Data Lake vs. Data Warehouse: Differences and Similarities

Fivetran Supports the Automation of the Modern Data Lake on Amazon S3

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Data Lake vs Data Warehouse - Working Together in the Cloud

Discover And De-Clutter Your Unstructured Data With Aparavi

Data Warehouse vs Data Lake vs Data Lakehouse: Definitions, Similarities, and Differences

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Migrate Hive data from CDH to CDP public cloud

A Serverless Query Engine from Spare Parts

Open Source Object Storage For All Of Your Data

Demystifying Modern Data Platforms

Aaand the New NiFi Champion is…

Data Engineering Weekly #184

Access control for Azure ADLS cloud object storage

How to Build a 5-Layer Data Stack

The Guide to Common Data Engineer Design Patterns

How to Build a 5-Layer Data Stack

Consulting Case Study: Job Market Analysis

Consulting Case Study: Job Market Analysis

How Much Data Do We Need? Balancing Machine Learning with Security Considerations

Snowflake: Amazon S3-compatible Storage with Cloudflare

Moving Past ETL and ELT: Understanding the EtLT Approach

Unstructured Data: Examples, Tools, Techniques, and Best Practices

What is Azure Data Factory – Here’s Everything You Need to Know

Most important Data Engineering Concepts and Tools for Data Scientists

Apache Hadoop 3.0.0 is Generally Available!

The Good and the Bad of Databricks Lakehouse Platform

Of Muffins and Machine Learning Models

Azure Synapse vs Databricks: 2023 Comparison Guide

Ingestion of Healthcare Pricing Transparency Data Files Natively on Snowflake

Rethinking Data Marts in the Cloud

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

15 Sample GCP Projects Ideas for Beginners to Practice in 2023

Stay Connected