Data Lake and Kafka - Data Engineering Digest

Troubleshooting Kafka In Production

Data Engineering Podcast

DECEMBER 24, 2023

Summary Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. Data lakes are notoriously complex.

Kafka

Kafka Data Lake High Quality Data SQL

Keep Your Data Lake Fresh With Real Time Streams Using Estuary

Data Engineering Podcast

MAY 21, 2023

In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time data lake without all of the headache. What is the impact of continuous data flows on dags/orchestration of transforms? RudderStack also supports real-time use cases.

Data Lake

Data Lake Machine Learning Kafka Data Warehouse

Realtime Data Applications Made Easier With Meroxa

Data Engineering Podcast

APRIL 23, 2023

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.

Data Lake

Data Lake Kafka Machine Learning Data Warehouse

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Data Engineering Weekly

FEBRUARY 18, 2025

It addresses many of Kafka's challenges in analytical infrastructure. The combination of Kafka and Flink is not a perfect fit for real-time analytics; the integration of Kafka and Lakehouse is very shallow. How do you compare Fluss with Apache Kafka? Fluss and Kafka differ fundamentally in design principles.

Kafka

Kafka Lambda Architecture SQL Architecture

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

Data Engineering Podcast

AUGUST 3, 2021

Summary Data lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis.

Data Lake

Data Lake Data Warehouse Hadoop Architecture

Simplifying Data Architecture and Security to Accelerate Value

Snowflake

NOVEMBER 11, 2024

Ingest data more efficiently and manage costs For data managed by Snowflake, we are introducing features that help you access data easily and cost-effectively. This reduces the overall complexity of getting streaming data ready to use: Simply create external access integration with your existing Kafka solution.

Data Architecture

Data Architecture Architecture Data Lake Kafka

Straining Your Data Lake Through A Data Mesh

Data Engineering Podcast

JULY 22, 2019

Summary The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a solution for managing storage and access.

Data Lake

Data Lake Hadoop Data Architecture

Building A Data Lake For The Database Administrator At Upsolver

Data Engineering Podcast

JUNE 1, 2020

Summary Data lakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. In order to bring the DBA into the new era of data management the team at Upsolver added a SQL interface to their data lake platform.

Data Lake

Data Lake Database Building Lambda Architecture

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

Data Engineering Podcast

NOVEMBER 11, 2018

Summary A data lake can be a highly valuable resource, as long as it is well built and well managed. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.

Data Lake

Data Lake Building Kafka Cloud

Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way

Data Engineering Podcast

MAY 15, 2022

Summary Designing a data platform is a complex and iterative undertaking which requires accounting for many conflicting needs. Designing a platform that relies on a data lake as its central architectural tenet adds additional layers of difficulty. When is a data lake architecture the wrong choice?

Data Lake

Data Lake Building Architecture BI

Easier Stream Processing On Kafka With ksqlDB

Data Engineering Podcast

MARCH 2, 2020

The ksqlDB project was created to address this state of affairs by building a unified layer on top of the Kafka ecosystem for stream processing. Developers can work with the SQL constructs that they are familiar with while automatically getting the durability and reliability that Kafka offers. How is ksqlDB architected?

Kafka

Kafka Process PostgreSQL MySQL

Tackling Real Time Streaming Data With SQL Using RisingWave

Data Engineering Podcast

FEBRUARY 4, 2024

In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex.

SQL

SQL Data Lake High Quality Data Machine Learning

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

Confluent

SEPTEMBER 26, 2019

In the early days, many companies simply used Apache Kafka ® for data ingestion into Hadoop or another data lake. However, Apache Kafka is more than just messaging. Some Kafka and Rockset users have also built real-time e-commerce applications , for example, using Rockset’s Java, Node.js

Kafka

Kafka SQL BI Hadoop

Unpacking the Latest Streaming Announcements: A Comprehensive Analysis

Jesse Anderson

JUNE 12, 2024

We also discuss the various systems using Kafka’s protocol. Confluent has never shied away from saying Kafka is “easy,” and I disagree. During the Kafka Summit London Keynote, the speakers said “easy” 17 times; in the Kafka Summit Bangalore Keynote, they said it 18 times. Want to use the Kafka protocol with Pulsar?

Kafka

Kafka Data Lake Architecture Cloud

Microsoft Fabric vs. Snowflake: Key Differences You Need to Know

Edureka

APRIL 22, 2025

It incorporates elements from several Microsoft products working together, like Power BI, Azure Synapse Analytics, Data Factory, and OneLake, into a single SaaS experience. No matter the workload, Fabric stores all data on OneLake, a single, unified data lake built on the Delta Lake model.

BI

BI Pipeline-centric Data Lake Google Cloud

Data Engineering Weekly #209

Data Engineering Weekly

FEBRUARY 23, 2025

[link] Alireza Sadeghi: Open Source Data Engineering Landscape 2025 This article comprehensively overviews the 2025 open-source data engineering landscape, highlighting key trends, active projects, and emerging technologies. The proposal discusses how Kafka will implement queue functionality similar to SQS and RabbitMQ.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

?? On Track with Apache Kafka – Building a Streaming ETL Solution with Rail Data

Confluent

OCTOBER 16, 2019

Trains are an excellent source of streaming data—their movements around the network are an unbounded series of events. Using this data, Apache Kafka ® and Confluent Platform can provide the foundations for both event-driven applications as well as an analytical platform. As with any real system, the data has “character.”

Kafka

Kafka Building Data Coding

Data Engineering Weekly #215

Data Engineering Weekly

APRIL 6, 2025

The approach bridges the data and software engineering gap, offering a practical blueprint for scaling trustworthy data systems. link] ManoMano: Handle errors in Spring Kafka consumers like a bliss - retry and DLT reporting for duty The tiered-topic approach to handling backoff and DLQ made me think deeply about the pattern.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Tableflow Is GA: Unifying Apache Kafka® Topics with Apache Iceberg™️ and Delta Lake Tables in a Few Clicks

Confluent

MARCH 18, 2025

Tableflow represents Kafka topics as Apache Iceberg (GA) and Delta Lake (EA) tables in a few clicks to feed any data warehouse, data lake, or analytics engine of your choice

Kafka

Kafka Data Lake Data Warehouse Engineering

Data Migration Strategies For Large Scale Systems

Data Engineering Podcast

MAY 26, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. Powered by Trino, the query engine Apache Iceberg was designed for, Starburst is an open platform with support for all table formats including Apache Iceberg, Hive, and Delta Lake.

Systems

Systems Data Lake High Quality Data Google Cloud

Designing Data Transfer Systems That Scale

Data Engineering Podcast

DECEMBER 3, 2023

Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics.

Systems

Systems Designing Data Lake SQL

8 Essential Data Pipeline Design Patterns You Should Know

Monte Carlo

NOVEMBER 21, 2024

The people behind Apache Kafka asked themselves the same question , so they invented the Kappa Architecture, where instead of having both batching and streaming layers, everything is real-time with the whole stream of data stored in a central log like Kafka.

Data Pipeline

Data Pipeline Designing Lambda Architecture Kafka

Data Engineering Weekly #206

Data Engineering Weekly

FEBRUARY 2, 2025

Key features include workplan auctioning for resource allocation, in-progress remediation for handling data validation failures, and integration with external Kafka topics, achieving a throughput of 1.2 million entities per second in production.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

Top Data Lake Vendors (Quick Reference Guide)

Monte Carlo

APRIL 24, 2023

Data lakes are useful, flexible data storage repositories that enable many types of data to be stored in its rawest state. Traditionally, after being stored in a data lake, raw data was then often moved to various destinations like a data warehouse for further processing, analysis, and consumption.

Data Lake

Data Lake Google Cloud Data Warehouse AWS

Data Engineering: A Formula 1-inspired Guide for Beginners

Towards Data Science

DECEMBER 4, 2023

A robust data infrastructure is a must-have to compete in the F1 business. We’ll build a data architecture to support our racing team starting from the three canonical layers : Data Lake, Data Warehouse, and Data Mart. Data Layers can be often combined together, sometimes in a single platform.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

AUGUST 11, 2021

“Data Lake vs Data Warehouse = Load First, Think Later vs Think First, Load Later” The terms data lake and data warehouse are frequently stumbled upon when it comes to storing large volumes of data. Data Warehouse Architecture What is a Data lake? What is a Data lake?

Data Lake

Data Lake Data Warehouse Cloud Hadoop

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

AltexSoft

AUGUST 29, 2023

In 2010, a transformative concept took root in the realm of data storage and analytics — a data lake. The term was coined by James Dixon , Back-End Java, Data, and Business Intelligence Engineer, and it started a new era in how organizations could store, manage, and analyze their data. What is a data lake?

Data Lake

Data Lake Architecture IT Amazon Web Services

Using SQL to democratize streaming data

Cloudera

MARCH 2, 2021

They no longer need to ask a small subset of the organization to provide them with information, rather, they have tooling, systems, and capabilities to get the data they need. Data Democratization has been a topic of conversation for the last few years – but mostly centered around data warehousing and data lakes.

SQL

SQL Java Data Lake Scala

Turning Streams Into Data Products

Cloudera

JUNE 16, 2022

In 2015, Cloudera became one of the first vendors to provide enterprise support for Apache Kafka, which marked the genesis of the Cloudera Stream Processing (CSP) offering. Today, CSP is powered by Apache Flink and Kafka and provides a complete, enterprise-grade stream management and stateful processing solution. Who is affected?

Kafka

Kafka Manufacturing Data Lake SQL

Snowflake Migration Success Stories: Core Digital Media and NAVEX

Snowflake

OCTOBER 16, 2024

The company quickly realized maintaining 10 years’ worth of production data while enabling real-time data ingestion led to an unscalable situation that would have necessitated a data lake. Snowflake's separate clusters for ETL, reporting and data science eliminated resource contention.

Digital Media

Digital Media Media Data Lake Data Warehouse

Pachyderm with Daniel Whitenack - Episode 1

Data Engineering Podcast

JANUARY 14, 2017

Summary Do you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for building a data lake with a versioned file system. Interview with Daniel Whitenack Introduction How did you get started in the data engineering space?

Data Lake

Data Lake Raw Data Kafka Data Engineering

Reliable, Fast Access to On-Chain Data Insights

Confluent

JUNE 7, 2019

We offer both historical and low-latency data streams of on-chain data across multiple blockchains. How we use Apache Kafka and the Confluent Platform. Apache Kafka ® is the central data hub of our company. Cluster of Ethereum nodes, Ethereum-to-Kafka bridge. Block confirmer based on Kafka Streams.

Accessible

Accessible Accessibility Kafka Scala

Adopting Real-Time Data At Organizations Of Every Size

Data Engineering Podcast

DECEMBER 4, 2022

Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. RudderStack helps you build a customer data platform on your warehouse or data lake. RudderStack helps you build a customer data platform on your warehouse or data lake.

Data Lake

Data Lake MongoDB MySQL Data Warehouse

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Monte Carlo

OCTOBER 31, 2024

Data Engineering Tools Data engineers need to be comfortable using essential tools for data pipeline management and workflow orchestration, including Apache Kafka, Apache Spark, Airflow, Dagster, dbt, and many more. Data Storage Solutions As we all know, data can be stored in a variety of ways.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

MongoDB CDC: When to Use Kafka, Debezium, Change Streams and Rockset

Rockset

JULY 28, 2022

As capable as it is, there are still instances where MongoDB alone can't satisfy all of the requirements for an application, so getting a copy of the data into another platform via a change data capture (CDC) solution is required. Debezium It is also possible to capture MongoDB change data capture events using Debezium.

MongoDB

MongoDB Kafka NoSQL Data Lake

How Marriott Modernized Their Data Architecture with Snowflake

Snowflake

SEPTEMBER 14, 2023

Unbound by the limitations of a legacy on-premises solution, its multi-cluster shared data architecture separates compute from storage, allowing data teams to easily scale up and down based on their needs. With Snowflake’s Kafka connector, the technology team can ingest tokenized data as JSON into tables as VARIANT.

Data Architecture

Data Architecture Architecture Hadoop Data Warehouse

Adding Support For Distributed Transactions To The Redpanda Streaming Engine

Data Engineering Podcast

OCTOBER 5, 2021

In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Start trusting your data with Monte Carlo today! Start trusting your data with Monte Carlo today!

Engineering

Engineering MongoDB Kafka Data Lake

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Data Engineering Podcast

JANUARY 13, 2019

Links TimescaleDB Original Appearance on the Data Engineering Podcast 1.0 Links TimescaleDB Original Appearance on the Data Engineering Podcast 1.0

Database

Database PostgreSQL SQL MongoDB

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

In this category I recommend also to have a look at data ingestion (Airbyte, Fivetran, etc.), Main technologies around stream are bus messages like Kafka and processing framework like Flink or Spark on top of the bus. Understand Change Data Capture — CDC. You'll be also asked to put in place a data infrastructure.

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Scaling Analysis of Connected Data And Modeling Complex Relationships With The TigerGraph Graph Database

Data Engineering Podcast

MAY 8, 2022

Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Email hosts@dataengineeringpodcast.com ) with your story.

Database

Database Data Lake BI Business Intelligence

Data News — Week 23.09

Christophe Blefari

MARCH 4, 2023

Data ingestion pipeline with Operation Management — At Netflix they annotate video which can lead to thousand of annotation but they need to manage the annotation lifecycle each time the annotation algorithm runs. Some company also call it a lakehouse or a data lake, but the word shift is enough interesting to notice.

Machine Learning

Machine Learning AWS Data Data Lake

Self Service Real Time Data Integration Without The Headaches With Meroxa

Data Engineering Podcast

OCTOBER 5, 2020

Summary Analytical workloads require a well engineered and well maintained data integration process to ensure that your information is reliable and up to date. Building a real-time pipeline for your data lakes and data warehouses is a non-trivial effort, requiring a substantial investment of time and energy.

Data Integration

Data Integration Kafka Data Lake Machine Learning

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

Those decentralization efforts appeared under different monikers through time, e.g., data marts versus data warehousing implementations (a popular architectural debate in the era of structured data) then enterprise-wide data lakes versus smaller, typically BU-Specific, “data ponds”.

Architecture

Architecture Metadata Kafka Government

Streaming Edge Data Collection and Global Data Distribution

Cloudera

JUNE 9, 2022

From origin through all points of consumption both on-prem and in the cloud, all data flows need to be controlled in a simple, secure, universal, scalable, and cost-effective way. controlling distribution while also allowing the freedom and flexibility to deliver the data to different services is more critical than ever. .

Data Collection

Data Collection Data Lake Unstructured Data Retail

Troubleshooting Kafka In Production

Keep Your Data Lake Fresh With Real Time Streams Using Estuary

Webinars

Trending Sources

Realtime Data Applications Made Easier With Meroxa

Webinars

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

Simplifying Data Architecture and Security to Accelerate Value

Straining Your Data Lake Through A Data Mesh

Building A Data Lake For The Database Administrator At Upsolver

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way

Easier Stream Processing On Kafka With ksqlDB

Tackling Real Time Streaming Data With SQL Using RisingWave

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

Unpacking the Latest Streaming Announcements: A Comprehensive Analysis

Microsoft Fabric vs. Snowflake: Key Differences You Need to Know

Data Engineering Weekly #209

?? On Track with Apache Kafka – Building a Streaming ETL Solution with Rail Data

Data Engineering Weekly #215

Tableflow Is GA: Unifying Apache Kafka® Topics with Apache Iceberg™️ and Delta Lake Tables in a Few Clicks

Data Migration Strategies For Large Scale Systems

Designing Data Transfer Systems That Scale

8 Essential Data Pipeline Design Patterns You Should Know

Data Engineering Weekly #206

Top Data Lake Vendors (Quick Reference Guide)

Data Engineering: A Formula 1-inspired Guide for Beginners

Data Lake vs Data Warehouse - Working Together in the Cloud

Data Lake Explained: A Comprehensive Guide to Its Architecture and Use Cases

Using SQL to democratize streaming data

Turning Streams Into Data Products

Snowflake Migration Success Stories: Core Digital Media and NAVEX

Pachyderm with Daniel Whitenack - Episode 1

Reliable, Fast Access to On-Chain Data Insights

Adopting Real-Time Data At Organizations Of Every Size

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

MongoDB CDC: When to Use Kafka, Debezium, Change Streams and Rockset

How Marriott Modernized Their Data Architecture with Snowflake

Adding Support For Distributed Transactions To The Redpanda Streaming Engine

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

How to learn data engineering

Scaling Analysis of Connected Data And Modeling Complex Relationships With The TigerGraph Graph Database

Data News — Week 23.09

Self Service Real Time Data Integration Without The Headaches With Meroxa

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Streaming Edge Data Collection and Global Data Distribution

Stay Connected