Data Lake, Data Management and Technology

Data Integrity for AI: What’s Old is New Again

Precisely

JANUARY 9, 2025

Disclaimer: Throughout this post, I discuss a variety of complex technologies but avoid trying to explain how these technologies work. The goal of this post is to understand how data integrity best practices have been embraced time and time again, no matter the technology underpinning. Then came Big Data and Hadoop!

Data Integration

Data Integration Hadoop Data Warehouse Data Lake

Keep Your Data Lake Fresh With Real Time Streams Using Estuary

Data Engineering Podcast

MAY 21, 2023

In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time data lake without all of the headache. Stream processing technologies have been around for around a decade. What do you have planned for the future of Estuary?

Data Lake

Data Lake Kafka Machine Learning Data Warehouse

How to Build a Data Lake?

ProjectPro

JUNE 6, 2025

This guide is your roadmap to building a data lake from scratch. We'll break down the fundamentals, walk you through the architecture, and share actionable steps to set up a robust and scalable data lake. That’s where data lakes come in. Table of Contents What is a Data Lake?

Data Lake

Data Lake Building Hadoop Raw Data

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

Build A Data Lake For Your Security Logs With Scanner

Data Engineering Podcast

JANUARY 28, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data.

Data Lake

Data Lake Building High Quality Data AWS

What is Azure Data Lake?

ProjectPro

JUNE 6, 2025

Many organizations are struggling to store, manage, and analyze data due to its exponential growth. Cloud-based data lakes allow organizations to gather any form of data, whether structured or unstructured, and make this data accessible for usage across various applications, to address these issues.

Data Lake

Data Lake Hadoop Big Data SQL

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Data Engineering Podcast

FEBRUARY 25, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. Data lakes are notoriously complex. Go to dataengineeringpodcast.com/dagster today to get started.

Database

Database Technology Data Lake High Quality Data

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Data Engineering Podcast

FEBRUARY 18, 2024

Summary A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Data lakes are notoriously complex. To start, can you share your definition of what constitutes a "Data Lakehouse"?

Data Lake

Data Lake High Quality Data Data Warehouse Google Cloud

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

JUNE 6, 2025

“Data Lake vs Data Warehouse = Load First, Think Later vs Think First, Load Later” The terms data lake and data warehouse are frequently stumbled upon when it comes to storing large volumes of data. Data Warehouse Architecture What is a Data lake?

Data Lake

Data Lake Data Warehouse Cloud Hadoop

Realtime Data Applications Made Easier With Meroxa

Data Engineering Podcast

APRIL 23, 2023

In this episode DeVaris Brown discusses the types of applications that are possible when teams don't have to manage the complex infrastructure necessary to support continuous data flows. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.

Data Lake

Data Lake Kafka Machine Learning Data Warehouse

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

Data Engineering Podcast

AUGUST 3, 2021

Summary Data lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis.

Data Lake

Data Lake Data Warehouse Hadoop Kafka

What Is a Lakebase?

databricks

JUNE 11, 2025

Separation of storage and compute : Lakebases store their data in modern data lakes (object stores) in open formats, which enables scaling compute and storage separately, leading to lower TCO and eliminating lock-in. At zero, the cost of the lakebase is just the cost of storing the data on cheap data lakes.

Entertainment

Entertainment Data Lake Manufacturing Consulting

Troubleshooting Kafka In Production

Data Engineering Podcast

DECEMBER 24, 2023

Summary Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. Data lakes are notoriously complex.

Kafka

Kafka Data Lake High Quality Data SQL

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Snowflake

NOVEMBER 2, 2023

Over the years, the technology landscape for data management has given rise to various architecture patterns, each thoughtfully designed to cater to specific use cases and requirements. Use cases change, needs change, technology changes – and therefore data infrastructure should be able to scale and evolve with change.

Data Lake

Data Lake Data Warehouse Cloud Unstructured Data

Building A Data Lake For The Database Administrator At Upsolver

Data Engineering Podcast

JUNE 1, 2020

Summary Data lakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is now a composition of multiple systems that need to be properly configured to work in concert.

Data Lake

Data Lake Database Building Lambda Architecture

Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh

Data Engineering Podcast

JUNE 25, 2023

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Can you describe what SQLMesh is and the story behind it? DataOps is a term that has been co-opted and overloaded.

Data Engineer

Data Engineer Data Engineering Python Engineering

Exploring Processing Patterns For Streaming Data Integration In Your Data Lake

Data Engineering Podcast

NOVEMBER 20, 2021

Summary One of the perennial challenges posed by data lakes is how to keep them up to date as new data is collected. In this episode Ori Rafael shares his experiences from Upsolver and building scalable stream processing for integrating and analyzing data, and what the tradeoffs are when coming from a batch oriented mindset.

Data Lake

Data Lake Data Integration Lambda Architecture Process

Stitching Together Enterprise Analytics With Microsoft Fabric

Data Engineering Podcast

JUNE 23, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. Data lakes in various forms have been gaining significant popularity as a unified interface to an organization's analytics. When is Fabric the wrong choice?

Data Lake

Data Lake High Quality Data Hadoop Machine Learning

Add Version Control To Your Data Lake With LakeFS

Data Engineering Podcast

NOVEMBER 2, 2020

Summary Data lakes are gaining popularity due to their flexibility and reduced cost of storage. Along with the benefits there are some additional complexities to consider, including how to safely integrate new data sources or test out changes to existing pipelines. What do you have planned for the future of the project?

Data Lake

Data Lake PostgreSQL Machine Learning Datasets

Making Email Better With AI At Shortwave

Data Engineering Podcast

APRIL 21, 2024

Summary Generative AI has rapidly transformed everything in the technology sector. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. Your first 30 days are free!

Data Lake

Data Lake High Quality Data Data Pipeline Machine Learning

The Future of Data Lakehouses: A Fireside Chat with Vinoth Chandar - Founder CEO Onehouse & PMC Chair of Apache Hudi

Data Engineering Weekly

JANUARY 8, 2025

What if your data lake could do more than just store information—what if it could think like a database? As data lakehouses evolve, they transform how enterprises manage, store, and analyze their data. represented a significant leap forward in data lakehouse technology.

Data Lake

Data Lake Retail Datasets Data Ingestion

Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way

Data Engineering Podcast

MAY 15, 2022

Summary Designing a data platform is a complex and iterative undertaking which requires accounting for many conflicting needs. Designing a platform that relies on a data lake as its central architectural tenet adds additional layers of difficulty. Visit dataengineeringpodcast.com/montecarlo to learn more.

Data Lake

Data Lake Building BI Architecture

Presto Powered Cloud Data Lakes At Speed Made Easy With Ahana

Data Engineering Podcast

SEPTEMBER 1, 2021

Summary The Presto project has become the de facto option for building scalable open source analytics in SQL for the data lake. Can you give an overview of the options that are available for someone wanting to use its SQL engine for querying their data? Hudi, Delta Lake, Iceberg, Nessie, LakeFS, etc.).

Data Lake

Data Lake Cloud AWS SQL

Tackling Real Time Streaming Data With SQL Using RisingWave

Data Engineering Podcast

FEBRUARY 4, 2024

In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex.

SQL

SQL Data Lake High Quality Data Data Pipeline

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

JUNE 16, 2024

In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform. Can you describe what role Trino and Iceberg play in Stripe's data architecture?

Data Lake

Data Lake High Quality Data Metadata Machine Learning

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Data Engineering Podcast

MARCH 24, 2024

In this episode Pete Hunt, CEO of Dagster labs, outlines these new capabilities, how they reduce the burden on data teams, and the increased collaboration that they enable across teams and business units. Data lakes are notoriously complex. Go to dataengineeringpodcast.com/dagster today to get started.

Data Lake

Data Lake High Quality Data Hadoop Data Pipeline

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Data Engineering Podcast

APRIL 7, 2024

Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. Data lakes are notoriously complex.

Data Lake

Data Lake High Quality Data BI Data Workflow

Improve Data Quality Through Engineering Rigor And Business Engagement With Synq

Data Engineering Podcast

JUNE 30, 2024

He highlights the role of data teams in modern organizations and how Synq is empowering them to achieve this. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. Can you describe what Synq is and the story behind it?

Pipeline-centric

Pipeline-centric Engineering Data Lake High Quality Data

When And How To Conduct An AI Program

Data Engineering Podcast

MARCH 3, 2024

Summary Artificial intelligence technologies promise to revolutionize business and produce new sources of value. Colleen Tartow has worked across all stages of the data lifecycle, and in this episode she shares her hard-earned wisdom about how to conduct an AI program for your organization. Data lakes are notoriously complex.

Programming

Programming Data Lake High Quality Data Data Pipeline

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Data Engineering Podcast

JANUARY 7, 2024

Summary Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. What do you have planned for the future of your academic research?

Data Process

Data Process Process Data Lake High Quality Data

Simplifying Data Architecture and Security to Accelerate Value

Snowflake

NOVEMBER 11, 2024

What if you could streamline your efforts while still building an architecture that best fits your business and technology needs? Snowflake is committed to doing just that by continually adding features to help our customers simplify how they architect their data infrastructure.

Data Architecture

Data Architecture Architecture Data Lake Kafka

Practical First Steps In Data Governance For Long Term Success

Data Engineering Podcast

JUNE 2, 2024

Nicola Askham found her way into data governance by accident, and stayed because of the benefit that she was able to provide by serving as a bridge between the technology and business. In this episode she shares the practical steps to implementing a data governance practice in your organization, and the pitfalls to avoid.

Data Governance

Data Governance Government Data Lake High Quality Data

Data Sharing Across Business And Platform Boundaries

Data Engineering Podcast

FEBRUARY 11, 2024

In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process. What is the current state of the ecosystem for data sharing protocols/practices/platforms?

Data Lake

Data Lake High Quality Data Government Data Pipeline

Modern Customer Data Platform Principles

Data Engineering Podcast

JANUARY 21, 2024

In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).

Data Lake

Data Lake High Quality Data NoSQL Data Warehouse

Version Your Data Lakehouse Like Your Software With Nessie

Data Engineering Podcast

MARCH 10, 2024

Summary Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. Data lakes are notoriously complex. What is involved in integrating Nessie into a given data stack?

Data Lake

Data Lake High Quality Data Architecture Data Pipeline

X-Ray Vision For Your Flink Stream Processing With Datorios

Data Engineering Podcast

JUNE 9, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. Data lakes are notoriously complex. My thanks to the team at Code Comments for their support.

Process

Process Data Lake High Quality Data Machine Learning

A Roadmap To Bootstrapping The Data Team At Your Startup

Data Engineering Podcast

MAY 28, 2023

Ghalib Suleiman has been on both sides of this equation and joins the show to share his hard-won wisdom about how to start and grow a data team in the early days of company growth. Can you start by sharing your conception of the responsibilities of a data team? When is it more practical to outsource the data work?

Data Lake

Data Lake Machine Learning Data Warehouse Education

Designing A Non-Relational Database Engine

Data Engineering Podcast

APRIL 14, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment.

Non-relational Database

Non-relational Database Relational Database Database Designing

What Happens When The Abstractions Leak On Your Data

Data Engineering Podcast

MAY 14, 2023

Summary All of the advancements in our technology is based around the principles of abstraction. In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture.

Data Lake

Data Lake Data Warehouse Machine Learning AWS

Reconciling The Data In Your Databases With Datafold

Data Engineering Podcast

MARCH 17, 2024

In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data. Data lakes are notoriously complex. Can you start by outlining some of the situations where reconciling data between databases is needed?

Database

Database Data Lake High Quality Data Data Workflow

15 Sample GCP Projects Ideas for Beginners to Practice in 2025

ProjectPro

JUNE 6, 2025

The benefits it offers start from data management and manipulation to machine learning tools on the GCP platform. GCP offers 90 services that span computation, storage, databases, networking, operations, development, data analytics , machine learning , and artificial intelligence , to name a few. Source : 1.bp.blogspot.com

Google Cloud

Google Cloud Project Data Lake Healthcare

An Exploration Of The Composable Customer Data Platform

Data Engineering Podcast

APRIL 9, 2023

In this episode Darren Haken is joined by Tejas Manohar to discuss how Autotrader UK is addressing their customer data needs by building on top of their existing data stack. What do you see as the risks/tradeoffs of moving CDP functionality into the same data stack as the rest of the organization?

Data Lake

Data Lake Data Warehouse Machine Learning Data

Strategies And Tactics For A Successful Master Data Management Implementation

Data Engineering Podcast

JUNE 26, 2022

Summary The most complicated part of data engineering is the effort involved in making the raw data fit into the narrative of the business. Master Data Management (MDM) is the process of building consensus around what the information actually means in the context of the business and then shaping the data to match those semantics.

Data Management

Data Management Management MongoDB Scala

Release Management For Data Platform Services And Logic

Data Engineering Podcast

MAY 12, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. Data lakes are notoriously complex. Code Comments Podcast Logo]([link] Putting new technology to use is an exciting prospect.

Management

Management Data Lake High Quality Data Machine Learning

Addressing The Challenges Of Component Integration In Data Platform Architectures

Data Engineering Podcast

NOVEMBER 26, 2023

In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team. Data lakes are notoriously complex.

Architecture

Architecture Data Lake High Quality Data SQL

Data Integrity for AI: What’s Old is New Again

Keep Your Data Lake Fresh With Real Time Streams Using Estuary

Webinars

Trending Sources

How to Build a Data Lake?

Webinars

Build A Data Lake For Your Security Logs With Scanner

What is Azure Data Lake?

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Data Lake vs Data Warehouse - Working Together in the Cloud

Realtime Data Applications Made Easier With Meroxa

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

What Is a Lakebase?

Troubleshooting Kafka In Production

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Building A Data Lake For The Database Administrator At Upsolver

Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh

Exploring Processing Patterns For Streaming Data Integration In Your Data Lake

Stitching Together Enterprise Analytics With Microsoft Fabric

Add Version Control To Your Data Lake With LakeFS

Making Email Better With AI At Shortwave

The Future of Data Lakehouses: A Fireside Chat with Vinoth Chandar - Founder CEO Onehouse & PMC Chair of Apache Hudi

Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way

Presto Powered Cloud Data Lakes At Speed Made Easy With Ahana

Tackling Real Time Streaming Data With SQL Using RisingWave

Being Data Driven At Stripe With Trino And Iceberg

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Improve Data Quality Through Engineering Rigor And Business Engagement With Synq

When And How To Conduct An AI Program

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Simplifying Data Architecture and Security to Accelerate Value

Practical First Steps In Data Governance For Long Term Success

Data Sharing Across Business And Platform Boundaries

Modern Customer Data Platform Principles

Version Your Data Lakehouse Like Your Software With Nessie

X-Ray Vision For Your Flink Stream Processing With Datorios

A Roadmap To Bootstrapping The Data Team At Your Startup

Designing A Non-Relational Database Engine

What Happens When The Abstractions Leak On Your Data

Reconciling The Data In Your Databases With Datafold

15 Sample GCP Projects Ideas for Beginners to Practice in 2025

An Exploration Of The Composable Customer Data Platform

Strategies And Tactics For A Successful Master Data Management Implementation

Release Management For Data Platform Services And Logic

Addressing The Challenges Of Component Integration In Data Platform Architectures

Stay Connected