Data Lake and Database - Data Engineering Digest

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Databricks Delta Lake: A Scalable Data Lake Solution

ProjectPro

JUNE 6, 2025

Want to process peta-byte scale data with real-time streaming ingestions rates, build 10 times faster data pipelines with 99.999% reliability, witness 20 x improvement in query performance compared to traditional data lakes, enter the world of Databricks Delta Lake now. Delta Lake is a game-changer for big data.

Data Lake

Data Lake Data Warehouse Metadata Unstructured Data

A Comprehensive Guide to Data Lake vs. Data Warehouse

Analytics Vidhya

FEBRUARY 2, 2023

Now, businesses are looking for different types of data storage to store and manage their data effectively. Organizations can collect millions of data, but if they’re lacking in storing that data, those efforts […] The post A Comprehensive Guide to Data Lake vs. Data Warehouse appeared first on Analytics Vidhya.

Data Lake

Data Lake Data Warehouse Data Storage Data

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

Data warehouses vs Data Lakes vs Databases – Which One Do You Need

Seattle Data Guy

DECEMBER 19, 2022

Whether its helping increase revenue by finding new customers or reducing costs, all of it starts with data.

Data Lake

Data Lake Data Warehouse Database Data Storage

Setting up Data Lake on GCP using Cloud Storage and BigQuery

Analytics Vidhya

FEBRUARY 25, 2023

Introduction A data lake is a centralized and scalable repository storing structured and unstructured data. The need for a data lake arises from the growing volume, variety, and velocity of data companies need to manage and analyze.

Cloud Storage

Cloud Storage Data Lake Cloud Unstructured Data

How to Build a Data Lake?

ProjectPro

JUNE 6, 2025

This guide is your roadmap to building a data lake from scratch. We'll break down the fundamentals, walk you through the architecture, and share actionable steps to set up a robust and scalable data lake. That’s where data lakes come in. Table of Contents What is a Data Lake?

Data Lake

Data Lake Building Hadoop Raw Data

What is Azure Data Lake?

ProjectPro

JUNE 6, 2025

Many organizations are struggling to store, manage, and analyze data due to its exponential growth. Cloud-based data lakes allow organizations to gather any form of data, whether structured or unstructured, and make this data accessible for usage across various applications, to address these issues.

Data Lake

Data Lake Hadoop SQL Big Data

Keep Your Data Lake Fresh With Real Time Streams Using Estuary

Data Engineering Podcast

MAY 21, 2023

In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time data lake without all of the headache. What is the impact of continuous data flows on dags/orchestration of transforms? RudderStack also supports real-time use cases.

Data Lake

Data Lake Kafka Machine Learning Data Warehouse

Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)

Simon Späti

AUGUST 25, 2022

Image by Rachel Claire on Pexels Ever wanted or been asked to build an open-source Data Lake offloading data for analytics? Didn’t know the difference between a Data Lakehouse and a Data Warehouse? Asked yourself what components and features would that include.

Data Lake

Data Lake Data Warehouse Government Data

Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)

Simon Späti

SEPTEMBER 30, 2022

Image by Rachel Claire on Pexels Ever wanted or been asked to build an open-source Data Lake offloading data for analytics? Didn’t know the difference between a Data Lakehouse and a Data Warehouse? Asked yourself what components and features would that include.

Data Lake

Data Lake Data Warehouse Government Data

Designing A Non-Relational Database Engine

Data Engineering Podcast

APRIL 14, 2024

Summary Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication.

Non-relational Database

Non-relational Database Relational Database Database Designing

Azure Data Lake Architecture: Migrating Big Data to The Cloud

ProjectPro

JUNE 6, 2025

Did you know that the global data lakes market will likely grow at a CAGR of 29.9% Modern businesses are more likely to make data-driven decisions. Organizations are generating a massive volume of data due to the rise in digitalization. What is Azure Data Lake ? and reach USD 17.60 billion by 2026?

Data Lake

Data Lake Big Data Architecture Cloud

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

JUNE 6, 2025

“Data Lake vs Data Warehouse = Load First, Think Later vs Think First, Load Later” The terms data lake and data warehouse are frequently stumbled upon when it comes to storing large volumes of data. Data Warehouse Architecture What is a Data lake?

Data Lake

Data Lake Data Warehouse Cloud Hadoop

Reconciling The Data In Your Databases With Datafold

Data Engineering Podcast

MARCH 17, 2024

Summary A significant portion of data workflows involve storing and processing information in database engines. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data. Data lakes are notoriously complex.

Database

Database Data Lake High Quality Data Data Workflow

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Data Engineering Podcast

FEBRUARY 25, 2024

Summary Building a database engine requires a substantial amount of engineering effort and time investment. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database. Data lakes are notoriously complex.

Database

Database Technology Data Lake High Quality Data

Data Integrity for AI: What’s Old is New Again

Precisely

JANUARY 9, 2025

The demand for higher data velocity, faster access and analysis of data as its created and modified without waiting for slow, time-consuming bulk movement, became critical to business agility. Which turned into data lakes and data lakehouses Poor data quality turned Hadoop into a data swamp, and what sounds better than a data swamp?

Data Integration

Data Integration Data Lake Data Warehouse Hadoop

Data Warehouses Vs Operational Data Stores Vs Data Lakes – How To Store Your Data For Analytics

Seattle Data Guy

AUGUST 2, 2023

A few months ago, I uploaded a video where I discussed data warehouses, data lakes, and transactional databases. However, the world of data management is evolving rapidly, especially with the resurgence of AI and machine learning.

Data Lake

Data Lake Data Warehouse Data Machine Learning

Troubleshooting Kafka In Production

Data Engineering Podcast

DECEMBER 24, 2023

Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. With Materialize, you can!

Kafka

Kafka Data Lake High Quality Data SQL

How to Use Apache Iceberg Tables?

Analytics Vidhya

MARCH 12, 2025

In this article, we will explore the evolution of Iceberg, its key features like ACID transactions, partition evolution, and time travel, and how it integrates with modern data lakes. Well also dive into […] The post How to Use Apache Iceberg Tables? appeared first on Analytics Vidhya.

Data Lake

Data Lake Designing IT Data

Understanding Change Data Capture (CDC) in MySQL and PostgreSQL: BinLog vs. WAL + Logical Decoding

Towards Data Science

JANUARY 7, 2025

One question that puzzled me, though, was how tools like the Debezium CDC connectors can read changes from MySQL and PostgreSQL databases. Change Data Capture (CDC) SystemExample Diagram (Created using Lucidchart) What is Change DataCapture? MysqlBinlog MySQL uses a binary log to record changes to the database.

PostgreSQL

PostgreSQL MySQL Bytes Data Lake

Data Access API over Data Lake Tables Without the Complexity

Towards Data Science

SEPTEMBER 27, 2023

Data Access API over Data Lake Tables Without the Complexity Build a robust GraphQL API service on top of your S3 data lake files with DuckDB and Go Photo by Joshua Sortino on Unsplash 1. This data might be primarily used for internal reporting, but might also be valuable for other services in our organization.

Data Lake

Data Lake Accessible Accessibility SQL

Simplifying Data Architecture and Security to Accelerate Value

Snowflake

NOVEMBER 11, 2024

Unify transactional and analytical workloads in Snowflake for greater simplicity Many businesses must maintain two separate databases: one to handle transactional workloads and another for analytical workloads.

Data Architecture

Data Architecture Architecture Data Lake Kafka

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

JUNE 6, 2025

It offers a simple and efficient solution for data processing in organizations. It offers users a data integration tool that organizes data from many sources, formats it, and stores it in a single repository, such as data lakes, data warehouses , etc., being data exactly matches the classifier, and 0.0

AWS

AWS Scala Metadata Data Lake

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Snowflake

NOVEMBER 26, 2024

Before it migrated to Snowflake in 2022, WHOOP was using a catalog of tools — Amazon Redshift for SQL queries and BI tooling, Dremio for a data lake, PostgreSQL databases and others — that had ultimately become expensive to manage and difficult to maintain, let alone scale.

Data Warehouse

Data Warehouse Cloud PostgreSQL Data Lake

50+ Azure Data Factory Interview Questions and Answers [2025]

ProjectPro

JUNE 6, 2025

Azure Data Factory is a cloud-based, fully managed, serverless ETL and data integration service offered by Microsoft Azure for automating data movement from its native place to, say, a data lake or data warehouse using ETL (extract-transform-load) OR extract-load-transform (ELT).

Data Lake

Data Lake Metadata SQL Datasets

The Future of Data Lakehouses: A Fireside Chat with Vinoth Chandar - Founder CEO Onehouse & PMC Chair of Apache Hudi

Data Engineering Weekly

JANUARY 8, 2025

What if your data lake could do more than just store information—what if it could think like a database? As data lakehouses evolve, they transform how enterprises manage, store, and analyze their data.

Data Lake

Data Lake Retail Data Ingestion Finance

Run Your Applications Worldwide Without Worrying About The Database With Planetscale

Data Engineering Podcast

DECEMBER 11, 2022

Managing the operational concerns for your database can be complex and expensive, especially if you need to scale to large volumes of data, high traffic, or geographically distributed usage. No more shipping and praying, you can now know exactly what will change in your database! or any other destination you choose.

Database

Database MySQL Data Lake MongoDB

Tackling Real Time Streaming Data With SQL Using RisingWave

Data Engineering Podcast

FEBRUARY 4, 2024

RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable. Starburst : ![Starburst

SQL

SQL Data Lake High Quality Data Kafka

Scaling Analysis of Connected Data And Modeling Complex Relationships With The TigerGraph Graph Database

Data Engineering Podcast

MAY 8, 2022

Summary Many of the events, ideas, and objects that we try to represent through data have a high degree of connectivity in the real world. TigerGraph is a leading database that offers a highly scalable and performant native graph engine for powering graph analytics and machine learning. Start trusting your data with Monte Carlo today!

Database

Database Data Lake BI Kafka

Shining Some Light In The Black Box Of PostgreSQL Performance

Data Engineering Podcast

NOVEMBER 5, 2023

Summary Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products.

PostgreSQL

PostgreSQL Data Lake High Quality Data SQL

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Data Engineering Podcast

APRIL 7, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment.

Data Lake

Data Lake High Quality Data BI Data Workflow

15 Sample GCP Projects Ideas for Beginners to Practice in 2025

ProjectPro

JUNE 6, 2025

The benefits it offers start from data management and manipulation to machine learning tools on the GCP platform. GCP offers 90 services that span computation, storage, databases, networking, operations, development, data analytics , machine learning , and artificial intelligence , to name a few.

Google Cloud

Google Cloud Project Data Lake Healthcare

Addressing The Challenges Of Component Integration In Data Platform Architectures

Data Engineering Podcast

NOVEMBER 26, 2023

Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. With Materialize, you can!

Architecture

Architecture Data Lake High Quality Data SQL

Simplify Delta Lake Complexity with mack.

Confessions of a Data Guy

JANUARY 12, 2023

Anyone who’s been roaming around the forest of Data Engineering has probably run into many of the newish tools that have been growing rapidly around the concepts of Data Warehouses, Data Lakes, and Lake Houses … the merging of the old relational database functionality with TB and PB level cloud-based file storage systems.

Data Lake

Data Lake Relational Database Data Warehouse Data Engineering

Move Your Database To The Data And Speed Up Your Analytics With DuckDB

Data Engineering Podcast

MARCH 5, 2022

Summary When you think about selecting a database engine for your project you typically consider options focused on serving multiple concurrent users. Sometimes what you really need is an embedded database that is blazing fast for single user workloads. Can you describe what DuckDB is and the story behind it?

Database

Database Data Lake Java Data Engineering

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

Note : Cloud Data warehouses like Snowflake and Big Query already have a default time travel feature. However, this feature becomes an absolute must-have if you are operating your analytics on top of your data lake or lakehouse. It can also be integrated into major data platforms like Snowflake. Contact phData Today!

Architecture

Architecture Systems Data Lake Google Cloud

Modern Customer Data Platform Principles

Data Engineering Podcast

JANUARY 21, 2024

Summary Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization.

Data Lake

Data Lake High Quality Data NoSQL Data Warehouse

Unlocking Your dbt Projects With Practical Advice For Practitioners

Data Engineering Podcast

NOVEMBER 19, 2023

Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. With Materialize, you can!

Project

Project Data Lake High Quality Data SQL

Optimize Your Machine Learning Development And Serving With The Open Source Vector Database Milvus

Data Engineering Podcast

AUGUST 6, 2022

For machine learning applications relational models require additional processing to be directly useful, which is why there has been a growth in the use of vector databases. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services.

Machine Learning

Machine Learning Database MySQL MongoDB

Going From Transactional To Analytical And Self-managed To Cloud On One Database With MariaDB

Data Engineering Podcast

OCTOBER 23, 2022

Summary The database market has seen unprecedented activity in recent years, with new options addressing a variety of needs being introduced on a nearly constant basis. Despite that, there are a handful of databases that continue to be adopted due to their proven reliability and robust features.

Database

Database MySQL Cloud MongoDB

Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine

Data Engineering Podcast

NOVEMBER 12, 2023

If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Data lakes are notoriously complex. With Materialize, you can! Sponsored By: Starburst : ![Starburst

Software Engineering

Software Engineering Software Engineer Engineering Data Lake

What Happens When The Abstractions Leak On Your Data

Data Engineering Podcast

MAY 14, 2023

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.

Data Lake

Data Lake Data Warehouse Machine Learning AWS

Release Management For Data Platform Services And Logic

Data Engineering Podcast

MAY 12, 2024

I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. Data lakes are notoriously complex. Data lakes are notoriously complex.

Management

Management Data Lake High Quality Data Government

How Apache Iceberg Is Changing the Face of Data Lakes

Databricks Delta Lake: A Scalable Data Lake Solution

Webinars

Trending Sources

A Comprehensive Guide to Data Lake vs. Data Warehouse

Webinars

Data warehouses vs Data Lakes vs Databases – Which One Do You Need

Setting up Data Lake on GCP using Cloud Storage and BigQuery

How to Build a Data Lake?

What is Azure Data Lake?

Keep Your Data Lake Fresh With Real Time Streams Using Estuary

Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)

Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)

Designing A Non-Relational Database Engine

Azure Data Lake Architecture: Migrating Big Data to The Cloud

Data Lake vs Data Warehouse - Working Together in the Cloud

Reconciling The Data In Your Databases With Datafold

Top 15 Azure Data Lake Interview Questions and Answers For 2025

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Data Integrity for AI: What’s Old is New Again

Data Warehouses Vs Operational Data Stores Vs Data Lakes – How To Store Your Data For Analytics

Troubleshooting Kafka In Production

How to Use Apache Iceberg Tables?

Understanding Change Data Capture (CDC) in MySQL and PostgreSQL: BinLog vs. WAL + Logical Decoding

Data Access API over Data Lake Tables Without the Complexity

Simplifying Data Architecture and Security to Accelerate Value

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

50+ Azure Data Factory Interview Questions and Answers [2025]

The Future of Data Lakehouses: A Fireside Chat with Vinoth Chandar - Founder CEO Onehouse & PMC Chair of Apache Hudi

Run Your Applications Worldwide Without Worrying About The Database With Planetscale

Tackling Real Time Streaming Data With SQL Using RisingWave

Scaling Analysis of Connected Data And Modeling Complex Relationships With The TigerGraph Graph Database

Shining Some Light In The Black Box Of PostgreSQL Performance

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

15 Sample GCP Projects Ideas for Beginners to Practice in 2025

Addressing The Challenges Of Component Integration In Data Platform Architectures

Simplify Delta Lake Complexity with mack.

Move Your Database To The Data And Speed Up Your Analytics With DuckDB

Why Open Table Format Architecture is Essential for Modern Data Systems

Modern Customer Data Platform Principles

Unlocking Your dbt Projects With Practical Advice For Practitioners

Optimize Your Machine Learning Development And Serving With The Open Source Vector Database Milvus

Going From Transactional To Analytical And Self-managed To Cloud On One Database With MariaDB

Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine

What Happens When The Abstractions Leak On Your Data

Release Management For Data Platform Services And Logic

Stay Connected