Sun.Dec 24, 2023

article thumbnail

Troubleshooting Kafka In Production

Data Engineering Podcast

Summary Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: : Troubleshooting in Production" In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate

Kafka 245
article thumbnail

SparkSQL is Destroying your Pipelines

Confessions of a Data Guy

It’s true, even if you don’t want it to be. SparkSQL is destroying your data pipelines and possibly wreaking havoc on your entire data team, infrastructure, and life. In your heart of hearts, you’ve probably known it for years. With great power comes great responsibility. We all know that even us Data Engineers are human […] The post SparkSQL is Destroying your Pipelines appeared first on Confessions of a Data Guy.

article thumbnail

1.5 Years of Spark Knowledge in 8 Tips

Towards Data Science

My learnings from Databricks customer engagements Figure 1: a technical diagram of how to write apache spark. Image by author. After working with ~15 of the largest retail organizations for the past 18 months, here are the Spark tips I commonly repeat. Throughout this post, we assume a general working knowledge of spark and it’s structure, but this post should be accessible to all levels of spark.

Scala 77
article thumbnail

Data Engineering Weekly #154

Data Engineering Weekly

RudderStack is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Visit rudderstack.com to learn more. Sanjeev Mohan: Unveiling the Crystal Ball: 2024 Data and AI Trends Sanjeev & Rajesh, as usual, share their excellent observations about data & AI industry trends.

article thumbnail

Apache Airflow® Best Practices for ETL and ELT Pipelines

Whether you’re creating complex dashboards or fine-tuning large language models, your data must be extracted, transformed, and loaded. ETL and ELT pipelines form the foundation of any data product, and Airflow is the open-source data orchestrator specifically designed for moving and transforming data in ETL and ELT pipelines. This eBook covers: An overview of ETL vs.