Cloud, Data Lake and Kafka - Data Engineering Digest

Troubleshooting Kafka In Production

Data Engineering Podcast

DECEMBER 24, 2023

Summary Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. Data lakes are notoriously complex.

Kafka

Kafka Data Lake High Quality Data SQL

Databricks Delta Lake: A Scalable Data Lake Solution

ProjectPro

JUNE 6, 2025

Want to process peta-byte scale data with real-time streaming ingestions rates, build 10 times faster data pipelines with 99.999% reliability, witness 20 x improvement in query performance compared to traditional data lakes, enter the world of Databricks Delta Lake now. Delta Lake is a game-changer for big data.

Data Lake

Data Lake Data Warehouse Metadata Unstructured Data

Azure Data Lake Architecture: Migrating Big Data to The Cloud

ProjectPro

JUNE 6, 2025

Did you know that the global data lakes market will likely grow at a CAGR of 29.9% Modern businesses are more likely to make data-driven decisions. Organizations are generating a massive volume of data due to the rise in digitalization. What is Azure Data Lake ? and reach USD 17.60 billion by 2026?

Data Lake

Data Lake Big Data Architecture Cloud

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

How to Build a Data Lake?

ProjectPro

JUNE 6, 2025

This guide is your roadmap to building a data lake from scratch. We'll break down the fundamentals, walk you through the architecture, and share actionable steps to set up a robust and scalable data lake. That’s where data lakes come in. Table of Contents What is a Data Lake?

Data Lake

Data Lake Building Hadoop Raw Data

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

JUNE 6, 2025

“Data Lake vs Data Warehouse = Load First, Think Later vs Think First, Load Later” The terms data lake and data warehouse are frequently stumbled upon when it comes to storing large volumes of data. Data Warehouse Architecture What is a Data lake? What is a Data lake?

Data Lake

Data Lake Data Warehouse Cloud Hadoop

What is Azure Data Lake?

ProjectPro

JUNE 6, 2025

Many organizations are struggling to store, manage, and analyze data due to its exponential growth. Cloud-based data lakes allow organizations to gather any form of data, whether structured or unstructured, and make this data accessible for usage across various applications, to address these issues.

Data Lake

Data Lake Hadoop Big Data SQL

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Data Engineering Weekly

FEBRUARY 18, 2025

Fluss is a compelling new project in the realm of real-time data processing. I spoke with Jark Wu , who leads the Fluss and Flink SQL team at Alibaba Cloud, to understand its origins and potential. It addresses many of Kafka's challenges in analytical infrastructure. How do you compare Fluss with Apache Kafka?

Kafka

Kafka Lambda Architecture SQL Data Lake

Simplifying Data Architecture and Security to Accelerate Value

Snowflake

NOVEMBER 11, 2024

Ingest data more efficiently and manage costs For data managed by Snowflake, we are introducing features that help you access data easily and cost-effectively. This reduces the overall complexity of getting streaming data ready to use: Simply create external access integration with your existing Kafka solution.

Data Architecture

Data Architecture Architecture Data Lake Kafka

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

Data Engineering Podcast

AUGUST 3, 2021

Summary Data lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis.

Data Lake

Data Lake Data Warehouse Hadoop Kafka

Apache Spark on Azure: When Big Data Meets Cloud

ProjectPro

JUNE 6, 2025

78% of the employees across European organizations claim that the data keeps growing too rapidly for them to process, thus getting siloed on-premise. So, how can businesses leverage the untapped potential of all the data that is available to them? The answer is-Cloud! as needed for big data processing.

Big Data

Big Data Cloud Data Lake Big Data Tools

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

Data Engineering Podcast

NOVEMBER 11, 2018

Summary A data lake can be a highly valuable resource, as long as it is well built and well managed. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.

Data Lake

Data Lake Building Kafka Cloud

Straining Your Data Lake Through A Data Mesh

Data Engineering Podcast

JULY 22, 2019

Summary The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a solution for managing storage and access.

Data Lake

Data Lake Hadoop Data Kafka

30+ Data Engineering Projects for Beginners in 2025

ProjectPro

JUNE 6, 2025

In 2024, the data engineering job market is flourishing, with roles like database administrators and architects projected to grow by 8% and salaries averaging $153,000 annually in the US (as per Glassdoor ). These trends underscore the growing demand and significance of data engineering in driving innovation across industries.

Data Engineering

Data Engineering Data Engineer Project Engineering

AWS Kafka: Your Go-to Solution for Real-Time Data Streaming

ProjectPro

JUNE 6, 2025

Explore the full potential of AWS Kafka with this ultimate guide. Elevate your data processing skills with Amazon Managed Streaming for Apache Kafka, making real-time data streaming a breeze. According to IDC , the worldwide streaming market for event-streaming software, such as Kafka, is likely to reach $5.3

Kafka

Kafka AWS Amazon Web Services Utilities

Tackling Real Time Streaming Data With SQL Using RisingWave

Data Engineering Podcast

FEBRUARY 4, 2024

In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex.

SQL

SQL Data Lake High Quality Data Data Pipeline

Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way

Data Engineering Podcast

MAY 15, 2022

Summary Designing a data platform is a complex and iterative undertaking which requires accounting for many conflicting needs. Designing a platform that relies on a data lake as its central architectural tenet adds additional layers of difficulty. When is a data lake architecture the wrong choice?

Data Lake

Data Lake Building BI Architecture

Easier Stream Processing On Kafka With ksqlDB

Data Engineering Podcast

MARCH 2, 2020

The ksqlDB project was created to address this state of affairs by building a unified layer on top of the Kafka ecosystem for stream processing. Developers can work with the SQL constructs that they are familiar with while automatically getting the durability and reliability that Kafka offers. How is ksqlDB architected?

Kafka

Kafka Process PostgreSQL MySQL

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

Confluent

SEPTEMBER 26, 2019

In the early days, many companies simply used Apache Kafka ® for data ingestion into Hadoop or another data lake. However, Apache Kafka is more than just messaging. Some Kafka and Rockset users have also built real-time e-commerce applications , for example, using Rockset’s Java, Node.js

Kafka

Kafka BI SQL Hadoop

A Data Engineer’s Guide To Real-time Data Ingestion

ProjectPro

JUNE 6, 2025

Data Collection The first step is to collect real-time data (purchase_data) from various sources, such as sensors, IoT devices, and web applications, using data collectors or agents. These collectors send the data to a central location, typically a message broker like Kafka.

Data Ingestion

Data Ingestion Kafka Google Cloud AWS

Unpacking the Latest Streaming Announcements: A Comprehensive Analysis

Jesse Anderson

JUNE 12, 2024

We also discuss the various systems using Kafka’s protocol. Confluent has never shied away from saying Kafka is “easy,” and I disagree. During the Kafka Summit London Keynote, the speakers said “easy” 17 times; in the Kafka Summit Bangalore Keynote, they said it 18 times. Using Confluent Cloud?

Kafka

Kafka Data Lake Architecture Cloud

Top 10 Data Engineering Tools You Must Learn in 2025

ProjectPro

JUNE 6, 2025

Table of Contents What are Data Engineering Tools? Top 10+ Tools For Data Engineers Worth Exploring in 2025 Cloud-Based Data Engineering Tools Data Engineering Tools in AWS Data Engineering Tools in Azure FAQs on Data Engineering Tools What are Data Engineering Tools?

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Microsoft Fabric vs. Snowflake: Key Differences You Need to Know

Edureka

APRIL 22, 2025

The alternative, however, provides more multi-cloud flexibility and strong performance on structured data. It incorporates elements from several Microsoft products working together, like Power BI, Azure Synapse Analytics, Data Factory, and OneLake, into a single SaaS experience.

BI

BI Pipeline-centric Data Lake Google Cloud

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Cloudera

MARCH 31, 2021

CDP Public Cloud is now available on Google Cloud. The addition of support for Google Cloud enables Cloudera to deliver on its promise to offer its enterprise data platform at a global scale. CDP Public Cloud is already available on Amazon Web Services and Microsoft Azure.

Google Cloud

Google Cloud Cloud Amazon Web Services Cloud Storage

Data Pipeline- Definition, Architecture, Examples, and Use Cases

ProjectPro

JUNE 6, 2025

Generally, data pipelines are created to store data in a data warehouse or data lake or provide information directly to the machine learning model development. Keeping data in data warehouses or data lakes helps companies centralize the data for several data-driven initiatives.

Data Pipeline

Data Pipeline Architecture Kafka Data Lake

Happy Birthday, CDP Public Cloud

Cloudera

OCTOBER 13, 2020

On September 24, 2019, Cloudera launched CDP Public Cloud (CDP-PC) as the first step in delivering the industry’s first Enterprise Data Cloud. Over the past year, we’ve not only added Azure as a supported cloud platform, but we have improved the orginal services while growing the CDP-PC family significantly: Improved Services.

Cloud

Cloud Data Warehouse Machine Learning AWS

ETL vs ELT - What’s the Best Approach for Data Engineering?

ProjectPro

JUNE 6, 2025

Extract, Load, Transform, or ELT refers to how a data pipeline duplicates data from a data source into a target location, such as a cloud data warehouse. ELT involves three core stages- Extract- Importing data from the source server is the initial stage in this process.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

7 Popular Azure ETL Tools for Data Engineers in 2025

ProjectPro

JUNE 6, 2025

Azure Data Factory 2. Azure Data Lake Storage 7. Azure Logic Apps Azure ETL Best Practices for Big Data Projects Get Your Hands-on Azure ETL Projects with ProjectPro! It also enables data transformation using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.

ETL Tools

ETL Tools Data Engineering Data Engineer Data Lake

Top 21 Big Data Tools That Empower Data Wizards

ProjectPro

JUNE 6, 2025

Traditional data tools cannot handle this massive volume of complex data, so several unique Big Data software tools and architectural solutions have been developed to handle this task. Big Data Tools extract and process data from multiple data sources. Why Are Big Data Tools Valuable to Data Professionals?

Big Data Tools

Big Data Tools Big Data Hadoop BI

?? On Track with Apache Kafka – Building a Streaming ETL Solution with Rail Data

Confluent

OCTOBER 16, 2019

Trains are an excellent source of streaming data—their movements around the network are an unbounded series of events. Using this data, Apache Kafka ® and Confluent Platform can provide the foundations for both event-driven applications as well as an analytical platform. As with any real system, the data has “character.”

Kafka

Kafka Building Data Coding

20 Best Open Source Big Data Projects to Contribute on GitHub

ProjectPro

JUNE 6, 2025

As per the surveyors, Big data (35 percent), Cloud computing (39 percent), operating systems (33 percent), and the Internet of Things (31 percent) are all expected to be impacted by open source shortly. Following these statistics, big data is set to get bigger with the evolution of open-source projects.

Big Data

Big Data Project Metadata Programming Language

The New Releases of Apache NiFi in Public Cloud and Private Cloud

Cloudera

APRIL 29, 2021

on all three major cloud platforms, and it also brings Flow Management on DataHub with Apache NiFi 1.13.2 QueryNiFiReportingTask : this new reporting task allows you to run SQL queries against the internal monitoring data stored by NiFi (metrics, status, bulletins, provenance, etc.) Cloudera also released CDP 7.2.9 NiFi’s monitoring.

Cloud

Cloud Amazon Web Services Google Cloud Kafka

Designing Data Transfer Systems That Scale

Data Engineering Podcast

DECEMBER 3, 2023

Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. Data lakes are notoriously complex. Materialize]([link] You shouldn't have to throw away the database to build with fast-changing data.

Systems

Systems Designing Data Lake SQL

A Beginner’s Guide to Building a Data Science Pipeline

ProjectPro

JUNE 6, 2025

Continuous, Extensible Data Processing: A robust data science pipeline ensures continuous, extensible data processing for real-time or near-real-time analysis, enabling rapid adaptation to evolving data needs and seamless integration of new data sources for dynamic insights and decision-making.

Data Science

Data Science Building AWS Data Lake

Data Migration Strategies For Large Scale Systems

Data Engineering Podcast

MAY 26, 2024

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. SQL Server version upgrade) Section 2: Types of Migrations for Infrastructure Focus Storage migration: Moving data between systems (HDD to SSD, SAN to NAS, etc.)

Systems

Systems Data Lake High Quality Data Google Cloud

Gartner® Magic Quadrant™ for Cloud Database Report Recognizes Cloudera as a Visionary

Cloudera

JANUARY 19, 2022

Gartner® recognized Cloudera in three recent reports – Magic Quadrant for Cloud Database Management Systems (DBMS), Critical Capabilities for Cloud Database Management Systems for Analytical Use Cases and Critical Capabilities for Cloud Database Management Systems for Operational Use Cases.

Database

Database Cloud Data Warehouse Data Lake

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

Monte Carlo

OCTOBER 31, 2024

In addition, AI data engineers should be familiar with programming languages such as Python , Java, Scala, and more for data pipeline, data lineage, and AI model development. Data Storage Solutions As we all know, data can be stored in a variety of ways.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

15 ETL Project Ideas for Practice in 2025

ProjectPro

JUNE 6, 2025

Anyone who works with data, whether a programmer, a business analyst , or a database developer, creates ETL pipelines , either directly or indirectly. ETL is a must-have for data-driven businesses. The transition to cloud-based software services and enhanced ETL pipelines can ease data processing for businesses.

Project

Project Kafka AWS Data Pipeline

Data Engineering Weekly #206

Data Engineering Weekly

FEBRUARY 2, 2025

Key features include workplan auctioning for resource allocation, in-progress remediation for handling data validation failures, and integration with external Kafka topics, achieving a throughput of 1.2 million entities per second in production.

Data Engineer

Data Engineer Data Engineering Engineering Data Lake

Data Lake vs Data Warehouse - Working Together in the Cloud

ProjectPro

AUGUST 11, 2021

“Data Lake vs Data Warehouse = Load First, Think Later vs Think First, Load Later” The terms data lake and data warehouse are frequently stumbled upon when it comes to storing large volumes of data. Data Warehouse Architecture What is a Data lake? What is a Data lake?

Data Lake

Data Lake Data Warehouse Cloud Hadoop

Snowflake Migration Success Stories: Core Digital Media and NAVEX

Snowflake

OCTOBER 16, 2024

Many of our customers — from Marriott to AT&T — start their journey with the Snowflake AI Data Cloud by migrating their data warehousing workloads to the platform. The company migrated from its outdated Teradata appliance to the Snowflake AI Data Cloud to resolve performance issues and meet growing data demands.

Digital Media

Digital Media Media Data Lake Data Warehouse

Azure Databricks: Streamline Your Data Engineering Workflows

ProjectPro

JUNE 6, 2025

“Unlock the potential of your data with Azure Databricks: a unified analytics platform that combines the power of Apache Spark with the ease of Azure.” ” Azure Databricks is a fully managed service provided by Microsoft that offers the capabilities to create an open data lake house within the Azure cloud environment.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

Zero ETL: The Secret Sauce to Faster Data Analytics

ProjectPro

JUNE 6, 2025

Zero ETL helps mitigate these costs by reducing duplicate data storage and minimizing the need for constant monitoring and testing, thus lowering overall maintenance expenses. This helps organizations to streamline their operations directly assessing Salesforce data in Snowflake for analysis and decision-making.

Data Analytics

Data Analytics MySQL PostgreSQL Data Lake

Tableflow Is GA: Unifying Apache Kafka® Topics with Apache Iceberg™️ and Delta Lake Tables in a Few Clicks

Confluent

MARCH 18, 2025

Tableflow represents Kafka topics as Apache Iceberg (GA) and Delta Lake (EA) tables in a few clicks to feed any data warehouse, data lake, or analytics engine of your choice

Kafka

Kafka Data Lake Data Warehouse Engineering

How Marriott Modernized Their Data Architecture with Snowflake

Snowflake

SEPTEMBER 14, 2023

The Snowflake Data Cloud gives you the flexibility to build a modern architecture of choice to unlock value from your data. Snowflake was built from the ground up in the cloud. Snowflake’s platform provides industry-leading features that ensure the highest standards of governance for your account, users, and data.

Data Architecture

Data Architecture Architecture Hadoop Data Warehouse

Troubleshooting Kafka In Production

Databricks Delta Lake: A Scalable Data Lake Solution

Webinars

Trending Sources

Azure Data Lake Architecture: Migrating Big Data to The Cloud

Webinars

How to Build a Data Lake?

Data Lake vs Data Warehouse - Working Together in the Cloud

What is Azure Data Lake?

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Simplifying Data Architecture and Security to Accelerate Value

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

Apache Spark on Azure: When Big Data Meets Cloud

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

Straining Your Data Lake Through A Data Mesh

30+ Data Engineering Projects for Beginners in 2025

AWS Kafka: Your Go-to Solution for Real-Time Data Streaming

Tackling Real Time Streaming Data With SQL Using RisingWave

Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way

Easier Stream Processing On Kafka With ksqlDB

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

A Data Engineer’s Guide To Real-time Data Ingestion

Unpacking the Latest Streaming Announcements: A Comprehensive Analysis

Top 10 Data Engineering Tools You Must Learn in 2025

Microsoft Fabric vs. Snowflake: Key Differences You Need to Know

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Data Pipeline- Definition, Architecture, Examples, and Use Cases

Happy Birthday, CDP Public Cloud

ETL vs ELT - What’s the Best Approach for Data Engineering?

7 Popular Azure ETL Tools for Data Engineers in 2025

Top 21 Big Data Tools That Empower Data Wizards

?? On Track with Apache Kafka – Building a Streaming ETL Solution with Rail Data

20 Best Open Source Big Data Projects to Contribute on GitHub

The New Releases of Apache NiFi in Public Cloud and Private Cloud

Designing Data Transfer Systems That Scale

A Beginner’s Guide to Building a Data Science Pipeline

Data Migration Strategies For Large Scale Systems

Gartner® Magic Quadrant™ for Cloud Database Report Recognizes Cloudera as a Visionary

What is an AI Data Engineer? 4 Important Skills, Responsibilities, & Tools

15 ETL Project Ideas for Practice in 2025

Data Engineering Weekly #206

Data Lake vs Data Warehouse - Working Together in the Cloud

Snowflake Migration Success Stories: Core Digital Media and NAVEX

Azure Databricks: Streamline Your Data Engineering Workflows

Zero ETL: The Secret Sauce to Faster Data Analytics

Tableflow Is GA: Unifying Apache Kafka® Topics with Apache Iceberg™️ and Delta Lake Tables in a Few Clicks

How Marriott Modernized Their Data Architecture with Snowflake

Stay Connected