Data Management and Kafka - Data Engineering Digest

Troubleshooting Kafka In Production

Data Engineering Podcast

DECEMBER 24, 2023

Summary Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. Can you describe your experiences with Kafka?

Kafka

Kafka Data Lake High Quality Data SQL

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Data Engineering Weekly

FEBRUARY 18, 2025

It addresses many of Kafka's challenges in analytical infrastructure. The combination of Kafka and Flink is not a perfect fit for real-time analytics; the integration of Kafka and Lakehouse is very shallow. How do you compare Fluss with Apache Kafka? Fluss and Kafka differ fundamentally in design principles.

Kafka

Kafka Lambda Architecture SQL Data Lake

Realtime Data Applications Made Easier With Meroxa

Data Engineering Podcast

APRIL 23, 2023

In this episode DeVaris Brown discusses the types of applications that are possible when teams don't have to manage the complex infrastructure necessary to support continuous data flows. Can you describe what Meroxa is and the story behind it? How have the focus and goals of the platform and company evolved over the past 2 years?

Data Lake

Data Lake Kafka Machine Learning Data Warehouse

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

How to Get Started with Kafka Topics : A Beginner's Guide

ProjectPro

JUNE 6, 2025

Taming the torrent of data pouring into your systems can be daunting. Kafka Topics are your trusty companions. Learn how Kafka Topics simplify the complex world of big data processing in this comprehensive blog. More than 80% of all Fortune 100 companies trust, and use Kafka. Table of Contents What is Kafka Topic?

Kafka

Kafka Big Data Python Java

Easier Stream Processing On Kafka With ksqlDB

Data Engineering Podcast

MARCH 2, 2020

The ksqlDB project was created to address this state of affairs by building a unified layer on top of the Kafka ecosystem for stream processing. Developers can work with the SQL constructs that they are familiar with while automatically getting the durability and reliability that Kafka offers. How is ksqlDB architected?

Kafka

Kafka Process PostgreSQL MySQL

Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection

Data Engineering Podcast

JULY 1, 2019

Scaling the volume of events that can be processed in real-time can be challenging, so Paul Brebner from Instaclustr set out to see how far he could push Kafka and Cassandra for this use case. Can you start by describing the problem that you were trying to solve and the requirements that you were aiming for?

Kafka

Kafka Finance Media Architecture

Data Management Trends From An Investor Perspective

Data Engineering Podcast

JUNE 8, 2020

Summary The landscape of data management and processing is rapidly changing and evolving. This is a useful conversation to gain a macro perspective on where businesses are looking to improve their capabilities to work with data. If you hand a book to a new data engineer, what wisdom would you add to it?

Data Management

Data Management Management Machine Learning Kafka

Simplifying Data Architecture and Security to Accelerate Value

Snowflake

NOVEMBER 11, 2024

Ingest data more efficiently and manage costs For data managed by Snowflake, we are introducing features that help you access data easily and cost-effectively. This reduces the overall complexity of getting streaming data ready to use: Simply create external access integration with your existing Kafka solution.

Data Architecture

Data Architecture Architecture Data Lake Kafka

Top 10 Data Engineering Tools You Must Learn in 2025

ProjectPro

JUNE 6, 2025

Data engineering tools are specialized applications that make building data pipelines and designing algorithms easier and more efficient. These tools are responsible for making the day-to-day tasks of a data engineer easier in various ways. It's one of the fastest platforms for data management and stream processing.

Data Engineer

Data Engineer Data Engineering Engineering Kafka

IBM Technology Chooses Cloudera as its Preferred Partner for Addressing Real Time Data Movement Using Kafka

Cloudera

SEPTEMBER 26, 2023

IBM and Cloudera’s common goal is to accelerate data-driven decision making for enterprise customers, working on defining and executing the best solution for each customer. You can now elevate your data potential and activate AI’s capabilities through the synergic integration between IBM watsonx and Cloudera.

Kafka

Kafka Technology IT Government

30+ Data Engineering Projects for Beginners in 2025

ProjectPro

JUNE 6, 2025

Project Idea : Build a data pipeline to ingest data from APIs like CoinGecko or Kaggle’s crypto datasets. Fetch live data using the CoinMarketCap API to monitor cryptocurrency prices. Use Kafka for real-time data ingestion, preprocess with Apache Spark, and store data in Snowflake.

Data Engineer

Data Engineer Data Engineering Project Engineering

Data News — Week 24.11

Christophe Blefari

MARCH 15, 2024

Understand how BigQuery inserts, deletes and updates — Once again Vu took time to deep dive into BigQuery internal, this time to explain how data management is done. Pandera, a data validation library for dataframes, now supports Polars. Unlocking Kafka's potential: tackling tail latency with eBPF.

Metadata

Metadata Software Engineering Software Engineer Data Warehouse

Keep Your Data Lake Fresh With Real Time Streams Using Estuary

Data Engineering Podcast

MAY 21, 2023

In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time data lake without all of the headache. Can you describe what Estuary is and the story behind it? Stream processing technologies have been around for around a decade.

Data Lake

Data Lake Kafka Machine Learning Data Warehouse

Tackling Real Time Streaming Data With SQL Using RisingWave

Data Engineering Podcast

FEBRUARY 4, 2024

In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex.

SQL

SQL Data Lake High Quality Data Kafka

Cloudera acquires Eventador to accelerate Stream Processing in Public & Hybrid Clouds

Cloudera

OCTOBER 12, 2020

The DataFlow platform has established a leading position in the data streaming market by unlocking the combined value and synergies of Apache NiFi, Apache Kafka and Apache Flink. We recently delivered all three of these streaming capabilities as cloud services through Cloudera Data Platform (CDP) Data Hub on AWS and Azure.

Cloud

Cloud Process Scala Kafka

Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

Data Engineering Podcast

OCTOBER 15, 2023

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team.

Process

Process Building SQL BI

Data Pipeline- Definition, Architecture, Examples, and Use Cases

ProjectPro

JUNE 6, 2025

In other words, you will write codes to carry out one step at a time and then feed the desired data into machine learning models for training sentimental analysis models or evaluating sentiments of reviews, depending on the use case. You can use big-data processing tools like Apache Spark , Kafka , and more to create such pipelines.

Data Pipeline

Data Pipeline Architecture Kafka Data Lake

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Data Engineering Podcast

FEBRUARY 3, 2018

Summary One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence.

Kafka

Kafka Data Pipeline Data Engineer Data Engineering

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Data Engineering Podcast

FEBRUARY 3, 2018

Summary One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence.

Kafka

Kafka Data Pipeline Data Engineer Data Engineering

Top 21 Big Data Tools That Empower Data Wizards

ProjectPro

JUNE 6, 2025

Data scientists and engineers typically use the ETL (Extract, Transform, and Load) tools for data ingestion and pipeline creation. For implementing ETL, managing relational and non-relational databases, and creating data warehouses, big data professionals rely on a broad range of programming and data management tools.

Big Data Tools

Big Data Tools Big Data Hadoop Kafka

Speed Up And Simplify Your Streaming Data Workloads With Red Panda

Data Engineering Podcast

SEPTEMBER 28, 2020

Summary Kafka has become a de facto standard interface for building decoupled systems and working with streaming data. To make the benefits of the Kafka ecosystem more accessible and reduce the operational burden, Alexander Gallego and his team at Vectorized created the Red Panda engine.

Kafka

Kafka BI Big Data Data Engineer

Scylla and Confluent Integration for IoT Deployments

Confluent

MAY 22, 2019

In light of this, we’ll share an emerging machine-to-machine (M2M) architecture pattern in which MQTT, Apache Kafka ® , and Scylla all work together to provide an end-to-end IoT solution. The explosive number of devices generating, tracking and sharing data across a variety of networks is overwhelming to most data management solutions.

Kafka

Kafka Google Cloud NoSQL Entertainment

Data Engineering Weekly #219

Data Engineering Weekly

MAY 4, 2025

link] Meta: How Meta understands data at scale Meta describes its data management practices as adopting a “shift-left” approach, integrating data schematization and annotations early in product development. TIL about the idle stream problem.

Data Engineer

Data Engineer Data Engineering Engineering Java

Databricks Delta Lake: A Scalable Data Lake Solution

ProjectPro

JUNE 6, 2025

This helps data scientists and business analysts access and analyze all the data at their disposal. To gain a deeper understanding of Databricks Delta Lake and how it can revolutionize the way we approach data management, read on. Image Source: Delta-lake-on-Databricks Table of Contents Why use Databricks Delta lake?

Data Lake

Data Lake Data Warehouse Metadata Unstructured Data

Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10

Data Engineering Podcast

DECEMBER 10, 2017

When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why Confluent has built a schema registry that plugs into Kafka. Conversely, what would be involved in using a storage backend other than Kafka?

Kafka

Kafka Data Pipeline Data Engineer Data Engineering

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

The concept of the data mesh architecture is not entirely new; Its conceptual origins are rooted in the microservices architecture, its design principles (i.e., need to integrate multiple “point solutions” used in a data ecosystem) and organization reasons (e.g., The Value Proposition of CDF in Data Mesh Implementations.

Architecture

Architecture Metadata Kafka Government

StreamNative Brings Streaming Data To The Cloud Native Landscape With Pulsar

Data Engineering Podcast

MAY 11, 2020

Pulsar is a well engineered and robust platform for building the core of any system that relies on durable access to easily scalable streams of data. What is Pulsar’s role in the lifecycle of data and where does it fit in the overall ecosystem of data tools? Why is streaming data such an important capability?

Cloud

Cloud Lambda Architecture Kafka Hadoop

DataOps For Streaming Systems With Lenses.io

Data Engineering Podcast

JULY 6, 2020

Observability and governance of streaming data requires a different approach than batch oriented workflows, and this episode does an excellent job of outlining the complexities involved and how to address them. If you hand a book to a new data engineer, what wisdom would you add to it? Redis and Pulsar)?

Systems

Systems Kafka SQL Government

Top 10 Data Engineering Trends in 2025

Edureka

APRIL 22, 2025

It lets you describe data more complexly and make predictions. AI-powered data engineering solutions make it easier to streamline the data management process, which helps businesses find useful insights with little to no manual work. This will help make better analytics predictions and improve data management.

Data Engineer

Data Engineer Data Engineering Engineering Consulting

Digital Transformation is a Data Journey From Edge to Insight

Cloudera

JANUARY 20, 2021

In order to enable connected manufacturing and emerging IoT use cases, ECC needs a solution that can handle all types of diverse data structures and schemas from the edge, normalize the data, and then share it with any type of data consumer including Big Data applications. . STEP 4: Capture data from Apache Kafka streams.

Manufacturing

Manufacturing Data Warehouse Kafka Retail

Change Data Capture For All Of Your Databases With Debezium

Data Engineering Podcast

JANUARY 5, 2020

If you have ever struggled with implementing your own change data capture pipeline, or understanding when it would be useful then this episode is for you. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.

Database

Database Kafka PostgreSQL MySQL

Druid Deprecation and ClickHouse Adoption at Lyft

Lyft Engineering

NOVEMBER 29, 2023

Druid Data Ingestion Our pipeline for the two methods of ingesting data into Druid—the upper process is for batch ingestion, the lower process is for real-time ingestion. Then, they needed to define an ingestion specification which tells Druid how to process the data being ingested. This was our main form of ingestion.

Kafka

Kafka Data Ingestion Architecture Datasets

ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25

Data Engineering Podcast

APRIL 1, 2018

In this episode ThreatStack’s director of operations, Pete Cheslock, and senior infrastructure security engineer, Patrick Cable, discuss the data infrastructure that supports their platform, how they capture and process the data from client systems, and how that information can be used to keep your systems safe from attackers.

Amazon Web Services

Amazon Web Services Cloud PostgreSQL Kafka

The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33

Data Engineering Podcast

MAY 27, 2018

Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. What is Alooma and what is the origin story? How is the Alooma platform architected?

Data Pipeline

Data Pipeline MongoDB Scala Kafka

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

Data Engineering Podcast

NOVEMBER 18, 2018

Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. What are some of the primary ways that Flink is used?

Process

Process Scala Kafka Google Cloud

Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59

Data Engineering Podcast

DECEMBER 2, 2018

Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode.

Systems

Systems Building Kafka Java

20 Best Open Source Big Data Projects to Contribute on GitHub

ProjectPro

JUNE 6, 2025

CMAK Source: Github CMAK stands for Cluster Manager for Apache Kafka , previously known as Kafka Manager, is a tool for managing Apache Kafka clusters. CMAK is developed to help the Kafka community. It's an open-source database and data management framework.

Big Data

Big Data Project Metadata Programming Language

Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39

Data Engineering Podcast

JULY 8, 2018

They also explained how it fits in the broad landscape of data tools, the interesting and challenging aspects of the project, and how to build new extensions. Where does it sit in the broader landscape of data tools? How do you manage versioning and backup of data flows, as well as promoting them between environments?

Building

Building Transportation Kafka Java

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Data Engineering Podcast

DECEMBER 9, 2018

He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem. How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm? Can you start by explaining what Spark is? Who uses Spark? Who uses Spark?

Kafka

Kafka Scala MySQL Hadoop

Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63

Data Engineering Podcast

DECEMBER 31, 2018

Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. How do you represent a stream on-disk?

Lambda Architecture

Lambda Architecture Process Data Process Kafka

Building a Multi-Tenant Managed Platform For Streaming Data With Pulsar at Datastax

Data Engineering Podcast

JULY 27, 2021

This is a fascinating conversation with a lot of useful lessons for anyone who wants to understand the operational aspects of Pulsar and the benefits that it can provide to data workloads. How do you handle migrating existing projects, particularly if they are using Kafka currently?

Management

Management Building Kafka Data Warehouse

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

Data Engineering Podcast

AUGUST 3, 2021

In this episode Vinoth shares the history of the project, how its architecture allows for building more frequently updated analytical queries, and the work being done to add a more polished experience to the data lake paradigm. Interview Introduction How did you get involved in the area of data management?

Data Lake

Data Lake Data Warehouse Hadoop Kafka

Adding Support For Distributed Transactions To The Redpanda Streaming Engine

Data Engineering Podcast

OCTOBER 5, 2021

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode.

Engineering

Engineering MongoDB Kafka Data Lake

Astronomer with Ry Walker - Episode 6

Data Engineering Podcast

AUGUST 6, 2017

Interview Introduction How did you first get involved in the area of data management? Regulatory challenges of processing other people’s data What does your data pipelining architecture look like? What are the most challenging aspects of building a general purpose data management environment?

MongoDB

MongoDB PostgreSQL Kafka Data Pipeline

Troubleshooting Kafka In Production

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Webinars

Trending Sources

Realtime Data Applications Made Easier With Meroxa

Webinars

How to Get Started with Kafka Topics : A Beginner's Guide

Easier Stream Processing On Kafka With ksqlDB

Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection

Data Management Trends From An Investor Perspective

Simplifying Data Architecture and Security to Accelerate Value

Top 10 Data Engineering Tools You Must Learn in 2025

IBM Technology Chooses Cloudera as its Preferred Partner for Addressing Real Time Data Movement Using Kafka

30+ Data Engineering Projects for Beginners in 2025

Data News — Week 24.11

Keep Your Data Lake Fresh With Real Time Streams Using Estuary

Tackling Real Time Streaming Data With SQL Using RisingWave

Cloudera acquires Eventador to accelerate Stream Processing in Public & Hybrid Clouds

Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

Data Pipeline- Definition, Architecture, Examples, and Use Cases

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Top 21 Big Data Tools That Empower Data Wizards

Speed Up And Simplify Your Streaming Data Workloads With Red Panda

Scylla and Confluent Integration for IoT Deployments

Data Engineering Weekly #219

Databricks Delta Lake: A Scalable Data Lake Solution

Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10

How Cloudera Data Flow Enables Successful Data Mesh Architectures

StreamNative Brings Streaming Data To The Cloud Native Landscape With Pulsar

DataOps For Streaming Systems With Lenses.io

Top 10 Data Engineering Trends in 2025

Digital Transformation is a Data Journey From Edge to Insight

Change Data Capture For All Of Your Databases With Debezium

Druid Deprecation and ClickHouse Adoption at Lyft

ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25

The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59

20 Best Open Source Big Data Projects to Contribute on GitHub

Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63

Building a Multi-Tenant Managed Platform For Streaming Data With Pulsar at Datastax

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

Adding Support For Distributed Transactions To The Redpanda Streaming Engine

Astronomer with Ry Walker - Episode 6

Stay Connected