Data Lake and Systems - Data Engineering Digest

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Keep Your Data Lake Fresh With Real Time Streams Using Estuary

Data Engineering Podcast

MAY 21, 2023

The batch world has been the default for years because of the complexities of running a reliable streaming system at scale. In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time data lake without all of the headache.

Data Lake

Data Lake Machine Learning Kafka Data Warehouse

Build A Data Lake For Your Security Logs With Scanner

Data Engineering Podcast

JANUARY 28, 2024

Summary Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. Can you describe what Scanner is and the story behind it?

Data Lake

Data Lake Building High Quality Data AWS

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Predictions 2025: AI As Cybersecurity Tool and Target

Snowflake

JANUARY 8, 2025

Responding to data overload with a security data lake Security professionals have to continually up their game to make sure that, from all the data at their disposal, theyre using the correct inputs to identify vulnerabilities and incidents. In it, we discuss three layers of AI that can become an attack surface.

Data Lake

Data Lake Data Security Machine Learning Technology

Data Integrity for AI: What’s Old is New Again

Precisely

JANUARY 9, 2025

The simple idea was, hey how can we get more value from the transactional data in our operational systems spanning finance, sales, customer relationship management, and other siloed functions. There was no easy way to consolidate and analyze this data to more effectively manage our business. A data lake!

Data Integration

Data Integration Hadoop Data Warehouse Data Lake

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms.

Architecture

Architecture Systems Data Lake Google Cloud

Data Migration Strategies For Large Scale Systems

Data Engineering Podcast

MAY 26, 2024

Summary Any software system that survives long enough will require some form of migration or evolution. When that system is responsible for the data layer the process becomes more challenging. Sriram Panyam has been involved in several projects that required migration of large volumes of data in high traffic environments.

Systems

Systems Data Lake High Quality Data Google Cloud

Designing Data Transfer Systems That Scale

Data Engineering Podcast

DECEMBER 3, 2023

Summary The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor.

Systems

Systems Designing Data Lake SQL

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Data Engineering Podcast

FEBRUARY 18, 2024

Summary A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Data lakes are notoriously complex. Visit [dataengineeringpodcast.com/data-council]([link] and use code *depod20* to register today!

Data Lake

Data Lake High Quality Data Data Warehouse Google Cloud

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Snowflake

NOVEMBER 2, 2023

Over the years, the technology landscape for data management has given rise to various architecture patterns, each thoughtfully designed to cater to specific use cases and requirements. These patterns include both centralized storage patterns like data warehouse , data lake and data lakehouse , and distributed patterns such as data mesh.

Data Lake

Data Lake Data Warehouse Cloud Unstructured Data

Troubleshooting Kafka In Production

Data Engineering Podcast

DECEMBER 24, 2023

Summary Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Data lakes are notoriously complex. Operating it at scale, however, is notoriously challenging. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!

Kafka

Kafka Data Lake High Quality Data SQL

Exploring Processing Patterns For Streaming Data Integration In Your Data Lake

Data Engineering Podcast

NOVEMBER 20, 2021

Summary One of the perennial challenges posed by data lakes is how to keep them up to date as new data is collected. In this episode Ori Rafael shares his experiences from Upsolver and building scalable stream processing for integrating and analyzing data, and what the tradeoffs are when coming from a batch oriented mindset.

Data Lake

Data Lake Data Integration Lambda Architecture Process

Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems

Data Engineering Podcast

DECEMBER 25, 2022

Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows.

Machine Learning

Machine Learning Systems Data Lake Data Warehouse

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Snowflake

NOVEMBER 26, 2024

Today we’re focusing on customers who migrated from a cloud data warehouse to Snowflake and some of the benefits they saw. A consolidated data system to accommodate a big(ger) WHOOP When a company experiences exponential growth over a short period, it’s easy for its data foundation to feel a bit like it was built on the fly.

Data Warehouse

Data Warehouse Cloud PostgreSQL Hadoop

Moving Enterprise Data From Anywhere to Any System Made Easy

Cloudera

JUNE 2, 2022

In a recent customer workshop with a large retail data science media company, one of the attendees, an engineering leader, made the following observation: “Everytime I go to your competitor website, they only care about their system. How to onboard data into their system? I don’t care about their system.

Systems

Systems Data Lake Google Cloud Data Collection

Presto Powered Cloud Data Lakes At Speed Made Easy With Ahana

Data Engineering Podcast

SEPTEMBER 1, 2021

Summary The Presto project has become the de facto option for building scalable open source analytics in SQL for the data lake. Another area that has been seeing a lot of activity is data lakes and projects to make them more manageable and feature complete (e.g. Hudi, Delta Lake, Iceberg, Nessie, LakeFS, etc.).

Data Lake

Data Lake Cloud AWS SQL

Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way

Data Engineering Podcast

MAY 15, 2022

Summary Designing a data platform is a complex and iterative undertaking which requires accounting for many conflicting needs. Designing a platform that relies on a data lake as its central architectural tenet adds additional layers of difficulty. When is a data lake architecture the wrong choice?

Data Lake

Data Lake Building Architecture BI

Making Email Better With AI At Shortwave

Data Engineering Podcast

APRIL 21, 2024

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Data Lake

Data Lake High Quality Data Machine Learning Data Pipeline

Tackling Real Time Streaming Data With SQL Using RisingWave

Data Engineering Podcast

FEBRUARY 4, 2024

Summary Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.

SQL

SQL Data Lake High Quality Data Machine Learning

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Snowflake

DECEMBER 4, 2024

Beyond working with well-structured data in a data warehouse, modern AI systems can use deep learning and natural language processing to work effectively with unstructured and semi-structured data in data lakes and lakehouses.

Unstructured Data

Unstructured Data Data Lake Deep Learning Structured Data

Barking Up The Wrong GPTree: Building Better AI With A Cognitive Approach

Data Engineering Podcast

MAY 5, 2024

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Building

Building Data Lake High Quality Data Machine Learning

Simplifying Data Architecture and Security to Accelerate Value

Snowflake

NOVEMBER 11, 2024

This reduces the overall complexity of getting streaming data ready to use: Simply create external access integration with your existing Kafka solution. SnowConvert is an easy-to-use code conversion tool that accelerates legacy relational database management system (RDBMS) migrations to Snowflake.

Data Architecture

Data Architecture Architecture Data Lake Kafka

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Data Engineering Podcast

APRIL 7, 2024

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Data Lake

Data Lake High Quality Data BI Data Workflow

Improve Data Quality Through Engineering Rigor And Business Engagement With Synq

Data Engineering Podcast

JUNE 30, 2024

Petr shares his journey from being an engineer to founding Synq, emphasizing the importance of treating data systems with the same rigor as engineering systems. He discusses the challenges and solutions in data reliability, including the need for transparency and ownership in data systems.

Pipeline-centric

Pipeline-centric Engineering Data Lake High Quality Data

When And How To Conduct An AI Program

Data Engineering Podcast

MARCH 3, 2024

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Programming

Programming Data Lake High Quality Data Machine Learning

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Data Engineering Podcast

JANUARY 7, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data.

Data Process

Data Process Process Data Lake High Quality Data

X-Ray Vision For Your Flink Stream Processing With Datorios

Data Engineering Podcast

JUNE 9, 2024

To address this shortcoming Datorios created an observability platform for Flink that brings visibility to the internals of this popular stream processing system. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat.

Process

Process Data Lake High Quality Data Machine Learning

Open, Interoperable Storage with Iceberg Tables, Now Generally Available

Snowflake

JUNE 21, 2024

Snowflake is now making it even easier for customers to bring the platform’s usability, performance, governance and many workloads to more data with Iceberg tables (now generally available), unlocking full storage interoperability. Iceberg tables provide compute engine interoperability over a single copy of data.

Data Lake

Data Lake BI Business Intelligence Metadata

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

JUNE 16, 2024

In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform. What are the other systems that feed into and rely on the Trino/Iceberg service?

Data Lake

Data Lake High Quality Data Metadata Machine Learning

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Data Engineering Podcast

MARCH 24, 2024

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Data Lake

Data Lake High Quality Data Hadoop Machine Learning

Data Warehousing Essentials: A Guide To Data Warehousing

Seattle Data Guy

FEBRUARY 10, 2024

Photo by Tiger Lily Data warehouses and data lakes play a crucial role for many businesses. It gives businesses access to the data from all of their various systems. As well as often integrating data so that end-users can answer business critical questions.

Data Lake

Data Lake Data Warehouse Data Accessibility

Simplify Delta Lake Complexity with mack.

Confessions of a Data Guy

JANUARY 12, 2023

Anyone who’s been roaming around the forest of Data Engineering has probably run into many of the newish tools that have been growing rapidly around the concepts of Data Warehouses, Data Lakes, and Lake Houses … the merging of the old relational database functionality with TB and PB level cloud-based file storage systems.

Data Lake

Data Lake Relational Database Data Warehouse Data Engineering

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Data Engineering Podcast

FEBRUARY 25, 2024

Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. Data lakes are notoriously complex. Summary Building a database engine requires a substantial amount of engineering effort and time investment.

Database

Database Technology Data Lake High Quality Data

Practical First Steps In Data Governance For Long Term Success

Data Engineering Podcast

JUNE 2, 2024

Summary Modern businesses aspire to be data driven, and technologists enjoy working through the challenge of building data systems to support that goal. Data governance is the binding force between these two parts of the organization. What are some of the misconceptions that you encounter about data governance?

Data Governance

Data Governance Government Data Lake High Quality Data

Data Sharing Across Business And Platform Boundaries

Data Engineering Podcast

FEBRUARY 11, 2024

In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.

Data Lake

Data Lake High Quality Data Government Machine Learning

Modern Customer Data Platform Principles

Data Engineering Podcast

JANUARY 21, 2024

A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex.

Data Lake

Data Lake High Quality Data NoSQL Data Warehouse

Building A System Of Record For Your Organization's Data Ecosystem At Metaphor

Data Engineering Podcast

DECEMBER 19, 2021

In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Start trusting your data with Monte Carlo today! Supercharge your business teams with customer data using Hightouch for Reverse ETL today.

Systems

Systems Building Metadata Data Warehouse

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

Data Engineering Podcast

APRIL 24, 2022

WhyLogs is a powerful library for flexibly instrumenting all of your data systems to understand the entire lifecycle of your data from source to productionized model. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies.

Machine Learning

Machine Learning Systems Data Lake Java

Supporting Diverse ML Systems at Netflix

Netflix Tech

MARCH 7, 2024

The Machine Learning Platform (MLP) team at Netflix provides an entire ecosystem of tools around Metaflow , an open source machine learning infrastructure framework we started, to empower data scientists and machine learning practitioners to build and manage a variety of ML systems. ETL workflows), as well as downstream (e.g.

Systems

Systems Media Machine Learning Data Warehouse

Release Management For Data Platform Services And Logic

Data Engineering Podcast

MAY 12, 2024

Summary Building a data platform is a substrantial engineering endeavor. The services and systems need to be kept up to date, but so does the code that controls their behavior. Data lakes are notoriously complex. Support Data Engineering Podcast Summary Building a data platform is a substrantial engineering endeavor.

Management

Management Data Lake High Quality Data Machine Learning

Drug Launch Case Study: Amazing Efficiency Using DataOps

DataKitchen

DECEMBER 9, 2024

They opted for Snowflake, a cloud-native data platform ideal for SQL-based analysis. The team landed the data in a Data Lake implemented with cloud storage buckets and then loaded into Snowflake, enabling fast access and smooth integrations with analytical tools. Here is another example.

Pharmaceutical

Pharmaceutical Data Lake Cloud Storage Project

Version Your Data Lakehouse Like Your Software With Nessie

Data Engineering Podcast

MARCH 10, 2024

Summary Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. Data lakes are notoriously complex. What is involved in integrating Nessie into a given data stack?

Data Lake

Data Lake High Quality Data Architecture Machine Learning

Data Lake vs. Data Warehouse vs. Data Lakehouse

Sync Computing

NOVEMBER 7, 2024

While data warehouses are still in use, they are limited in use-cases as they only support structured data. Data lakes add support for semi-structured and unstructured data, and data lakehouses add further flexibility with better governance in a true hybrid solution built from the ground-up.

Data Lake

Data Lake Data Warehouse Business Intelligence Unstructured Data

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems

Towards Data Science

FEBRUARY 6, 2024

ERP and CRM systems are designed and built to fulfil a broad range of business processes and functions. This generalisation makes their data models complex and cryptic and require domain expertise. As you do not want to start your development with uncertainty, you decide to go for the operational raw data directly.

Systems

Systems Raw Data Metadata Data Cleanse

How Apache Iceberg Is Changing the Face of Data Lakes

Keep Your Data Lake Fresh With Real Time Streams Using Estuary

Webinars

Trending Sources

Build A Data Lake For Your Security Logs With Scanner

Webinars

Predictions 2025: AI As Cybersecurity Tool and Target

Data Integrity for AI: What’s Old is New Again

Why Open Table Format Architecture is Essential for Modern Data Systems

Data Migration Strategies For Large Scale Systems

Designing Data Transfer Systems That Scale

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Troubleshooting Kafka In Production

Exploring Processing Patterns For Streaming Data Integration In Your Data Lake

Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Moving Enterprise Data From Anywhere to Any System Made Easy

Presto Powered Cloud Data Lakes At Speed Made Easy With Ahana

Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way

Making Email Better With AI At Shortwave

Tackling Real Time Streaming Data With SQL Using RisingWave

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Barking Up The Wrong GPTree: Building Better AI With A Cognitive Approach

Simplifying Data Architecture and Security to Accelerate Value

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Improve Data Quality Through Engineering Rigor And Business Engagement With Synq

When And How To Conduct An AI Program

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

X-Ray Vision For Your Flink Stream Processing With Datorios

Open, Interoperable Storage with Iceberg Tables, Now Generally Available

Being Data Driven At Stripe With Trino And Iceberg

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Data Warehousing Essentials: A Guide To Data Warehousing

Simplify Delta Lake Complexity with mack.

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Practical First Steps In Data Governance For Long Term Success

Data Sharing Across Business And Platform Boundaries

Modern Customer Data Platform Principles

Building A System Of Record For Your Organization's Data Ecosystem At Metaphor

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

Supporting Diverse ML Systems at Netflix

Release Management For Data Platform Services And Logic

Drug Launch Case Study: Amazing Efficiency Using DataOps

Version Your Data Lakehouse Like Your Software With Nessie

Data Lake vs. Data Warehouse vs. Data Lakehouse

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems

Stay Connected