Blog and Data Lake - Data Engineering Digest

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Snowflake

NOVEMBER 2, 2023

Over the years, the technology landscape for data management has given rise to various architecture patterns, each thoughtfully designed to cater to specific use cases and requirements. These patterns include both centralized storage patterns like data warehouse , data lake and data lakehouse , and distributed patterns such as data mesh.

Data Lake

Data Lake Data Warehouse Cloud Unstructured Data

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Snowflake

NOVEMBER 26, 2024

For organizations considering moving from a legacy data warehouse to Snowflake, looking to learn more about how the AI Data Cloud can support legacy Hadoop use cases, or assessing new options if your current cloud data warehouse just isn’t scaling anymore, it helps to see how others have done it.

Data Warehouse

Data Warehouse Cloud PostgreSQL Hadoop

Enabling Security for Hadoop Data Lake on Google Cloud Storage

Uber Engineering

OCTOBER 27, 2024

Ready to boost your Hadoop Data Lake security on GCP? Our latest blog dives into enabling security for Uber’s modernized batch data lake on Google Cloud Storage!

Cloud Storage

Cloud Storage Google Cloud Data Lake Hadoop

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Making Email Better With AI At Shortwave

Data Engineering Podcast

APRIL 21, 2024

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Data Lake

Data Lake High Quality Data Machine Learning Data Pipeline

Microsoft Fabric vs. Snowflake: Key Differences You Need to Know

Edureka

APRIL 22, 2025

It incorporates elements from several Microsoft products working together, like Power BI, Azure Synapse Analytics, Data Factory, and OneLake, into a single SaaS experience. No matter the workload, Fabric stores all data on OneLake, a single, unified data lake built on the Delta Lake model.

BI

BI Pipeline-centric Data Lake Google Cloud

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Data Engineering Podcast

FEBRUARY 25, 2024

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Database

Database Technology Data Lake High Quality Data

Version Your Data Lakehouse Like Your Software With Nessie

Data Engineering Podcast

MARCH 10, 2024

Summary Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. Data lakes are notoriously complex. What is involved in integrating Nessie into a given data stack?

Data Lake

Data Lake High Quality Data Architecture Machine Learning

Designing A Non-Relational Database Engine

Data Engineering Podcast

APRIL 14, 2024

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Non-relational Database

Non-relational Database Relational Database Database Designing

Data Lake vs. Data Warehouse vs. Data Lakehouse

Sync Computing

NOVEMBER 7, 2024

While data warehouses are still in use, they are limited in use-cases as they only support structured data. Data lakes add support for semi-structured and unstructured data, and data lakehouses add further flexibility with better governance in a true hybrid solution built from the ground-up.

Data Lake

Data Lake Data Warehouse Business Intelligence Unstructured Data

Building Data Lake in Apache Iceberg with MySQL CDC

Hevo

JULY 12, 2024

Building a data lake for reporting, analytics, and machine learning needs has become general practice. Data lakes allow us to ingest data from multiple sources in their raw formats in real time. This will enable us to scale any data size and save time in defining its schema and transformations.

Data Lake

Data Lake MySQL Building Machine Learning

Loading data into Redshift with DBT

Yelp Engineering

NOVEMBER 5, 2024

With our consumers’ ever growing appetite for data, we recently revisited how we could load data into Redshift more efficiently. Starting Point Our method of loading batch data into Redshift had been effective for years, but we continually sought improvements.

Data Lake

Data Lake Data IT

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

A Drug Launch Case Study in the Amazing Efficiency of a Data Team Using DataOps How a Small Team Powered the Multi-Billion Dollar Acquisition of a Pharma Startup When launching a groundbreaking pharmaceutical product, the stakes and the rewards couldnt be higher. It is necessary to have more than a data lake and a database.

Pharmaceutical

Pharmaceutical Data Lake Cloud Storage Project

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms. In this blog, we will discuss: What is the Open Table format (OTF)? It can also be integrated into major data platforms like Snowflake. Contact phData Today!

Architecture

Architecture Systems Data Lake Google Cloud

Data Engineering Weekly #209

Data Engineering Weekly

FEBRUARY 23, 2025

[link] Alireza Sadeghi: Open Source Data Engineering Landscape 2025 This article comprehensively overviews the 2025 open-source data engineering landscape, highlighting key trends, active projects, and emerging technologies. I found the blog to be a comprehensive roadmap for data engineering in 2025.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Demystifying Azure Storage Account network access

Towards Data Science

OCTOBER 30, 2024

Introduction Storage accounts play a vital role in a medallion architecture for establishing an enterprise data lake. They act as a centralized repository, enabling seamless data exchange between producers and consumers. This setup empowers consumers to perform data science tasks and build machine learning (ML) models.

Accessible

Accessible Accessibility Data Lake Data Science

Tips to Build a Robust Data Lake Infrastructure

DareData

JULY 5, 2023

Learn how we build data lake infrastructures and help organizations all around the world achieving their data goals. In today's data-driven world, organizations are faced with the challenge of managing and processing large volumes of data efficiently.

Data Lake

Data Lake Building Raw Data ETL Tools

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

phData: Data Engineering

SEPTEMBER 19, 2023

With the amount of data companies are using growing to unprecedented levels, organizations are grappling with the challenge of efficiently managing and deriving insights from these vast volumes of structured and unstructured data. What is a Data Lake? Consistency of data throughout the data lake.

Data Lake

Data Lake Process Metadata Data Warehouse

Seamlessly Migrate Your Apache Parquet Data Lake to Delta Lake

databricks

JUNE 6, 2023

Apache Parquet is one of the most popular open source file formats in the big data world today. Being column-oriented, Apache Parquet allows.

Data Lake

Data Lake Big Data Data Data Engineering

Snowpark Magic: Auto-Create Tables from S3 Folders

Cloudyard

APRIL 2, 2025

Read Time: 3 Minute, 9 Second Snowpark Magic: Auto-Create Tables from S3 Folders In modern data lakes, its common for departments like Finance, Marketing, Sales, etc., to continuously drop data files into their respective folders within an S3 bucket. And more importantly how do you avoid reloading already processed files ?

Finance

Finance Data Lake Python Coding

Data Engineering Weekly #179

Data Engineering Weekly

JULY 7, 2024

Learn More → Notion: Building and scaling Notion’s data lake Notion writes about scaling the data lake by bringing critical data ingestion operations in-house. Hudi seems to be a de facto choice for CDC data lake features. Notion migrated the insert heavy workload from Snowflake to Hudi.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Cloudera

OCTOBER 23, 2024

One of the most important innovations in data management is open table formats, specifically Apache Iceberg , which fundamentally transforms the way data teams manage operational metadata in the data lake. Try Cloudera’s open data lakehouse on AWS for 5 days for free here , or try Snowflake for free for 30 days here.

Metadata

Metadata BI Data Lake Business Intelligence

Moving Enterprise Data From Anywhere to Any System Made Easy

Cloudera

JUNE 2, 2022

CDF-PC is a cloud native universal data distribution service powered by Apache NiFi on Kubernetes, ??allowing allowing developers to connect to any data source anywhere with any structure, process it, and deliver to any destination. This blog aims to answer two questions: What is a universal data distribution service?

Systems

Systems Data Lake Google Cloud Data Collection

Evaluating Change Data Capture Tools: A Comprehensive Guide

Data Engineering Weekly

AUGUST 6, 2024

The Dominance of the Lakehouses and the Mutation Support Lakehouses have become a standard pattern in data infrastructure, combining the best features of data lakes and warehouses. Unlike data lakes, which are predominantly append-only, lakehouses support data mutation natively. log-based, trigger-based).

Data Lake

Data Lake Data Warehouse Database Data Architecture

Data Lake Best Practices: The Do’s and Don’ts

Hevo

JANUARY 8, 2025

In today’s data-driven world, data lakes have emerged as the data architecture of choice when storing and analyzing large volumes of data. However, implementing a successful data lake requires diligent planning and design, as it can quickly become a data swamp with no additional value.

Data Lake

Data Lake Data Architecture Architecture Data

Data Engineering Weekly #207

Data Engineering Weekly

FEBRUARY 9, 2025

[link] QuantumBlack: Solving data quality for gen AI applications Unstructured data processing is a top priority for enterprises that want to harness the power of GenAI. It brings challenges in data processing and quality, but what data quality means in unstructured data is a top question for every organization.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

How to Navigate the Costs of Legacy SIEMS with Snowflake

Snowflake

APRIL 18, 2024

This blog post explores how Snowflake can help with this challenge. How Snowflake Works An open-architecture deployment with a modern security data lake, and best-of-breed applications from Snowflake, can keep costs down while improving an organization’s security posture. But what if security teams didn’t have to make tradeoffs?

Data Lake

Data Lake Data Ingestion Bytes Cloud Computing

Use Consistent And Up To Date Customer Profiles To Power Your Business With Segment Unify

Data Engineering Podcast

MAY 7, 2023

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.

Pipeline-centric

Pipeline-centric Data Lake Machine Learning Data Warehouse

Evaluating Data Observability Tools: A Comprehensive Guide

Data Engineering Weekly

SEPTEMBER 18, 2024

The Rise of Data Observability Data observability has become increasingly critical as companies seek greater visibility into their data processes. This growing demand has found a natural synergy with the rise of the data lake. As a result, monitoring data in real time was often an afterthought.

Data Lake

Data Lake Data Pipeline Unstructured Data Data

Data Engineering Weekly #215

Data Engineering Weekly

APRIL 6, 2025

The Grab blog delights me since I have tried to do this many times. The approach bridges the data and software engineering gap, offering a practical blueprint for scaling trustworthy data systems. Unlike coding, we never (or rarely) apply a code review process for documentation.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes.

Data Lake

Data Lake Data Warehouse BI SQL

Data Modeling That Evolves With Your Business Using Data Vault

Data Engineering Podcast

FEBRUARY 9, 2020

What are some of the foundational skills and knowledge that are necessary for effective modeling of data warehouses? How has the era of data lakes, unstructured/semi-structured data, and non-relational storage engines impacted the state of the art in data modeling?

Data Lake

Data Lake Data Warehouse Hadoop NoSQL

2020 Data Impact Award Winner Spotlight: Merck KGaA

Cloudera

DECEMBER 11, 2020

As mentioned in my previous blog on the topic , the recent shift to remote working has seen an increase in conversations around how data is managed. Without meeting GxP compliance, the Merck KGaA team could not run the enterprise data lake needed to store, curate, or process the data required to inform business decisions.

Data Lake

Data Lake Government Data Security Unstructured Data

Apache Ozone and Dense Data Nodes

Cloudera

APRIL 22, 2021

Apache Ozone is one of the major innovations introduced in CDP, which provides the next generation storage architecture for Big Data applications, where data blocks are organized in storage containers for larger scale and to handle small objects. Cloudera will publish separate blog posts with results of performance benchmarks.

Pipeline-centric

Pipeline-centric Data Lake Hadoop Big Data

Unifying Your Data Ecosystem with Delta Lake Integration

databricks

MAY 9, 2023

As organizations are maturing their data infrastructure and accumulating more data than ever before in their data lakes, Open and Reliable table formats.

Data Lake

Data Lake Data Data Engineering Data Engineer

NVIDIA RAPIDS in Cloudera Machine Learning

Cloudera

MAY 19, 2021

In the previous blog post in this series, we walked through the steps for leveraging Deep Learning in your Cloudera Machine Learning (CML) projects. Data Ingestion. The raw data is in a series of CSV files. We will firstly convert this to parquet format as most data lakes exist as object stores full of parquet files.

Machine Learning

Machine Learning Data Science Datasets Raw Data

Snowflake Migration Success Stories: Core Digital Media and NAVEX

Snowflake

OCTOBER 16, 2024

For organizations who are considering moving from a legacy data warehouse to Snowflake, are looking to learn more about how the AI Data Cloud can support legacy Hadoop use cases, or are struggling with a cloud data warehouse that just isn’t scaling anymore, it often helps to see how others have done it.

Digital Media

Digital Media Media Data Lake Data Warehouse

Laying the Foundation for Modern Data Architecture

Cloudera

MAY 28, 2024

Modern data architectures deliver key functionality in terms of flexibility and scalability of data management. This form of architecture can handle data in all forms—structured, semi-structured, unstructured—blending capabilities from data warehouses and data lakes into data lakehouses.

Data Architecture

Data Architecture Architecture Data Lake Data Warehouse

Streaming Edge Data Collection and Global Data Distribution

Cloudera

JUNE 9, 2022

In the first blog of the Universal Data Distribution blog series , we discussed the emerging need within enterprise organizations to take control of their data flows. controlling distribution while also allowing the freedom and flexibility to deliver the data to different services is more critical than ever. .

Data Collection

Data Collection Data Lake Unstructured Data Retail

Why Modernizing the First Mile of the Data Pipeline Can Accelerate all Analytics

Cloudera

AUGUST 13, 2021

Every enterprise is trying to collect and analyze data to get better insights into their business. Whether it is consuming log files, sensor metrics, and other unstructured data, most enterprises manage and deliver data to the data lake and leverage various applications like ETL tools, search engines, and databases for analysis.

Data Pipeline

Data Pipeline Data Lake ETL Tools Unstructured Data

Create your Private Data Warehousing Environment Using Azure Kubernetes Service

Cloudera

DECEMBER 2, 2021

In addition to AKS and the load balancers mentioned above, this includes VNET, Data Lake Storage, PostgreSQL Azure database, and more. By default Azure Data Lake Storage, PostgreSQL Database, and Virtual Machines are accessible over public endpoints. Additional Aspects of a Private CDW Environment on Azure.

PostgreSQL

PostgreSQL Data Lake Data Warehouse Retail

Using SQL to democratize streaming data

Cloudera

MARCH 2, 2021

They no longer need to ask a small subset of the organization to provide them with information, rather, they have tooling, systems, and capabilities to get the data they need. Data Democratization has been a topic of conversation for the last few years – but mostly centered around data warehousing and data lakes.

SQL

SQL Java Data Lake Scala

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

This blog post outlines detailed step by step instructions to perform Hive Replication from an on-prem CDH cluster to a CDP Public Cloud Data Lake. CDP Data Lake cluster versions – CM 7.4.0, CDP Data Lake cluster versions – CM 7.4.0, Pre-Check: Data Lake Cluster.

Cloud

Cloud Data Lake Cloud Storage Metadata

Unify your data: AI and Analytics in an Open Lakehouse

Cloudera

MAY 30, 2024

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission-critical, large-scale data analytics and AI use cases—including enterprise data warehouses. Learn more about the Cloudera Open Data Lakehouse here.

Data Lake

Data Lake Data Warehouse Programming Language Data Ingestion

How to Become a Microsoft Fabric Engineer?

Edureka

APRIL 9, 2025

Imagine being in charge of creating an intelligent data universe where collaboration, analytics, and artificial intelligence all work together harmoniously. Companies with expertise in Microsoft Fabric are in high demand, including Microsoft, Accenture, AWS, and Deloitte Are you prepared to influence the data-driven future?

Engineering

Engineering Data Ingestion Data Lake Programming Language

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Webinars

Trending Sources

Enabling Security for Hadoop Data Lake on Google Cloud Storage

Webinars

Making Email Better With AI At Shortwave

Microsoft Fabric vs. Snowflake: Key Differences You Need to Know

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Version Your Data Lakehouse Like Your Software With Nessie

Designing A Non-Relational Database Engine

Data Lake vs. Data Warehouse vs. Data Lakehouse

Building Data Lake in Apache Iceberg with MySQL CDC

Loading data into Redshift with DBT

Drug Launch Case Study: Amazing Efficiency Using DataOps

Why Open Table Format Architecture is Essential for Modern Data Systems

Data Engineering Weekly #209

Demystifying Azure Storage Account network access

Tips to Build a Robust Data Lake Infrastructure

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

Seamlessly Migrate Your Apache Parquet Data Lake to Delta Lake

Snowpark Magic: Auto-Create Tables from S3 Folders

Data Engineering Weekly #179

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Moving Enterprise Data From Anywhere to Any System Made Easy

Evaluating Change Data Capture Tools: A Comprehensive Guide

Data Lake Best Practices: The Do’s and Don’ts

Data Engineering Weekly #207

How to Navigate the Costs of Legacy SIEMS with Snowflake

Use Consistent And Up To Date Customer Profiles To Power Your Business With Segment Unify

Evaluating Data Observability Tools: A Comprehensive Guide

Data Engineering Weekly #215

The Future of the Data Lakehouse – Open

Data Modeling That Evolves With Your Business Using Data Vault

2020 Data Impact Award Winner Spotlight: Merck KGaA

Apache Ozone and Dense Data Nodes

Unifying Your Data Ecosystem with Delta Lake Integration

NVIDIA RAPIDS in Cloudera Machine Learning

Snowflake Migration Success Stories: Core Digital Media and NAVEX

Laying the Foundation for Modern Data Architecture

Streaming Edge Data Collection and Global Data Distribution

Why Modernizing the First Mile of the Data Pipeline Can Accelerate all Analytics

Create your Private Data Warehousing Environment Using Azure Kubernetes Service

Using SQL to democratize streaming data

Migrate Hive data from CDH to CDP public cloud

Unify your data: AI and Analytics in an Open Lakehouse

How to Become a Microsoft Fabric Engineer?

Stay Connected