Blog - Data Engineering Digest

Are LLMs making StackOverflow irrelevant?

The Pragmatic Engineer

JANUARY 21, 2025

Before the rise of this technology, StackOverflow was the superior option to Googling in the hope of finding a blog post which answered a question. And if you couldn’t find an answer to a problem, you could post a question on StackOverflow and someone would probably answer it.

Software Engineer

Software Engineer Software Engineering Engineering Coding

An educational side project

The Pragmatic Engineer

JUNE 1, 2023

for the simulation engine Go on the backend PostgreSQL for the data layer React and TypeScript on the frontend Prometheus and Grafana for monitoring and observability And if you were wondering how all of this was built, Juraj documented his process in an incredible, 34-part blog series. You can read this here. Serving a web page.

Education

Education Project PostgreSQL Software Engineer

Event time skew and global watermark in Apache Spark Structured Streaming

Waitingforcode

JANUARY 15, 2025

A few months ago I wrote a blog post about event skew and how dangerous it is for a stateful streaming job. Since it was a high-level explanation, I didn't cover Apache Spark Structured Streaming deeply at that moment.

IT

Webinars

Going Beyond Chatbots: Connecting AI to Your Tools, Systems, & Data

Smart Tech + Human Expertise = How to Modernize Manufacturing Without Losing Control

MORE WEBINARS

Azure Databricks: A Comprehensive Guide

Analytics Vidhya

FEBRUARY 28, 2023

In this blog post, we will take a closer look at Azure Databricks, its key features, […] The post Azure Databricks: A Comprehensive Guide appeared first on Analytics Vidhya. Introduction Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform that is built on top of the Microsoft Azure cloud.

Big Data

Big Data Machine Learning Cloud Data Process

Enhancing Neural Network Training at Yelp: Achieving 1,400x Speedup with WideAndDeep

Yelp Engineering

JANUARY 21, 2025

This blog post delves into our journey of optimizing training time using TensorFlow and Horovod, along with the development of ArrowStreamServer, our in-house library for low-latency data streaming and serving. These models handle large tabular datasets with small parameter spaces, requiring innovative data solutions.

Datasets

Datasets Architecture Data Solutions Data

Explore the World of Data-Tech with DataHour

Analytics Vidhya

MARCH 10, 2023

In this blog post, we […] The post Explore the World of Data-Tech with DataHour appeared first on Analytics Vidhya. Current professionals seeking to transition into the data-tech domain or data science professionals seeking to enhance their career growth and development can also benefit from these sessions.

Data Science

Data Science Data MySQL Machine Learning

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

In this blog, we will delve into an early stage in PAI implementation: data lineage. This took Meta multiple years to complete across our millions of disparate data assets, and well cover each of these more deeply in future blog posts: Inventorying involves collecting various code and data assets (e.g.,

Data Warehouse

Data Warehouse SQL Programming Language Data

Netflix’s Distributed Counter Abstraction

Netflix Tech

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. For more information regarding this, refer to our previous blog.

Datasets

Datasets Computer Science Systems Kafka

Apache XTable. Delta vs Iceberg vs Hudi.

Confessions of a Data Guy

MARCH 4, 2025

The blog post reviews an Apache Incubating project called Apache XTable, which aims to provide cross-format interoperability among Delta Lake, Apache Hudi, and Apache Iceberg. Below is a concise breakdown from some time I spend playing around this this new tool and some technical observations: 1. What is Apache XTable?

Project

Project Data IT Big Data

How to Implement a Data Pipeline Using Amazon Web Services?

Analytics Vidhya

FEBRUARY 6, 2023

In this blog, we will […] The post How to Implement a Data Pipeline Using Amazon Web Services? To make these processes efficient, data pipelines are necessary. Data engineers specialize in building and maintaining these data pipelines that underpin the analytics ecosystem. appeared first on Analytics Vidhya.

Amazon Web Services

Amazon Web Services Data Pipeline Machine Learning Data Science

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

The blog highlights how moving from 6-character base-64 to 20-digit base-2 file distribution brings more distribution in S3 and reduces request failures. The blog is a good summary of how to use Snowflake QUERY_TAG to measure and monitor query performance. The blog post made me curious to understand DataFusion's internals.

Data Engineering

Data Engineering Data Engineer Engineering Insurance

Data Engineering Weekly #195

Data Engineering Weekly

OCTOBER 27, 2024

The blog is an excellent summary of the existing unstructured data landscape. It is exciting to read probably the first blog on building a vector search infrastructure at scale. The blog from Meta discusses how it designed a privacy-preserving storage. link] Alibaba: Evolution of Flink 2.0

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Snowflake

NOVEMBER 26, 2024

This blog post is the second in a three-part series on migrations. That’s why we’ve collected these migration success stories to help you get started on your migration to Snowflake. Today we’re focusing on customers who migrated from a cloud data warehouse to Snowflake and some of the benefits they saw.

Data Warehouse

Data Warehouse Cloud PostgreSQL Hadoop

Platform as a Service (PaaS)

WeCloudData

APRIL 22, 2025

This blog provides detailed information on data Platform as a Service (PaaS),, how it differs from other cloud computing models, its working principles, and its benefits. PaaS is a fundamental cloud computing model that offers developers and organizations a robust environment for building, deploying, and managing applications efficiently.

Cloud Computing

Cloud Computing Cloud Building Management

PagerDuty alternatives

The Pragmatic Engineer

MAY 12, 2023

An explanation on why such a short blog post: I wanted to reply in a tweet, but, apparently, Twitter does not allow posting more than a few links in a reply. OpsGenie is clearly more critical of a system than the ones like JIRA or Confluence, but it is not treated with priority within the Atlassian stack, at least now it seems like it.

Systems

Systems Management Engineering IT

The Struggle Between Data Dark Ages and LLM Accuracy

Cloudera

DECEMBER 6, 2024

And specifically, I was reading one of your blog posts recently that talked about the dark ages of data. The post The Struggle Between Data Dark Ages and LLM Accuracy appeared first on Cloudera Blog. Here are some key takeaways from Ray in that conversation. 85% accuracy for customer experience means that number isnt bad.

Manufacturing

Manufacturing Retail Finance Metadata

Data Engineering Weekly #217

Data Engineering Weekly

APRIL 20, 2025

The blog took out the last edition’s recommendation on AI and summarized the current state of AI adoption in enterprises. The simplistic model expressed in the blog made it easy for me to reason about the transactional system design. Kafka is probably the most reliable data infrastructure in the modern data era.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data News — Week 25.02

Christophe Blefari

JANUARY 11, 2025

Over the past four weeks, I took a break from blogging and LinkedIn to focus on building nao. We published videos about the Forward Data Conference, you can watch Hannes, DuckDB co-creator, keynote about Changing Large Tables.

Data

Data Data Warehouse Coding Programming Language

Databricks on Databricks: Kicking off the Journey to Governance with Unity Catalog

databricks

JULY 23, 2024

In this blog, we are excited to share Databricks's journey in migrating to Unity Catalog for enhanced data governance. We'll discuss our high-level strategy and the tools we developed to facilitate the migration. Our goal is to highlight the benefits of Unity Catalog and make you feel confident about transitioning to it.

Government

Government Data Governance IT Data

Data Engineering Weekly #210

Data Engineering Weekly

MARCH 2, 2025

I found the blog to be a fresh take on the skill in demand by layoff datasets. The blog provides an excellent analysis of smallpond compared to Spark and Daft. Understanding which skills are in growing demand and the need for upskilling as the software abstraction changes is critical. link] Mehdio: DuckDB goes distributed?

Data Engineering

Data Engineering Data Engineer Engineering Datasets

Adding Write Functionality to Pages with Self-Service APIs

Picnic Engineering

APRIL 14, 2025

(Written by Kirill Voloshin & Abdullah Abusamrah ) In our previous blog posts , we have covered our server-driven UI framework called Picnic Page Platform. This blog post explores how weve further evolved our framework to support more complex flows that interact with our back-end systems, persist data andmore.

Java

Java Retail SQL Database

2024 retrospective on waitingforcode.com

Waitingforcode

JANUARY 5, 2025

Even though I was blogging less in the second half of the previous year, the retrospective is still the blog post I'm waiting for each year. Every year I summarize what happened in the past 12 months and share with you my future plans. It's time for the 2024 Edition!

IT

Fueling the Future of GenAI with NiFi: Cloudera DataFlow 2.9 Delivers Enhanced Efficiency and Adaptability

Cloudera

DECEMBER 4, 2024

Delivers Enhanced Efficiency and Adaptability appeared first on Cloudera Blog. Learn More: To explore the new capabilities of Cloudera DataFlow 2.9 and discover how it can transform your data pipelines, watch this video. link] [link] The post Fueling the Future of GenAI with NiFi: Cloudera DataFlow 2.9

Data Pipeline

Data Pipeline Data Ingestion Data Preparation Architecture

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

The post Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate appeared first on Cloudera Blog.

Metadata

Metadata Management Data Governance Government

Improving Pinterest Search Relevance Using Large Language Models

Pinterest Engineering

APRIL 4, 2025

In this blog, we will go through the technical design and share some offline and online results for our LLM-based search relevance pipeline. Pin Text Representations Pins on Pinterest are rich multimedia entities that feature images, videos, and other contents, often linked to external webpages or blogs.

Machine Learning

Machine Learning Metadata Architecture Datasets

10 GitHub Repositories to Master Python

KDnuggets

APRIL 10, 2024

Learn Python through tutorials, blogs, books, project work, and exercises. Access all of it on GitHub for free and join a supportive open-source community.

Python

Python Accessible Accessibility Project

SwiftKV Cuts LLM Inference Costs by 75% with Snowflake Cortex AI

Snowflake

JANUARY 16, 2025

You can learn more in our SwiftKV research blog post. Because SwiftKV is fully open source, you can also deploy it on your own with model checkpoints on Hugging Face and optimized inference on vLLM.

Algorithm

Algorithm Data Analysis Building Process

Private Cloud

WeCloudData

APRIL 16, 2025

This blog explores the fundamentals of the private cloud framework. A private or enterprise cloud is the type of cloud computing in which all the resources are dedicated to a single tenant. Private cloud allows organizations a high level of cloud computing benefits such as scalability, flexibility, access control, and faster service delivery.

Cloud

Cloud Cloud Computing Accessible Accessibility

How Netflix Accurately Attributes eBPF Flow Logs

Netflix Tech

APRIL 8, 2025

By Cheng Xie , Bryan Shultz , and Christine Xu In a previous blog post , we described how Netflix uses eBPF to capture TCP flow logs at scale for enhanced network insights. In this post, we delve deeper into how Netflix solved a core problem: accurately attributing flow IP addresses to workload identities.

AWS

AWS Kafka Cloud Programming

Data Engineering Weekly #207

Data Engineering Weekly

FEBRUARY 9, 2025

I found the product blog from QuantumBlack gives a view of data quality in unstructured data. The blog post highlights the industry trend of search engines transitioning towards embedding-based systems, moving beyond traditional IDF models.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Webinar: Data Quality in a Medallion Architecture – 2024

DataKitchen

DECEMBER 6, 2024

Read the popular blog article. Get the DataOps Advantage: Learn how to apply DataOps to monitor, iterate, and automate quality checkskeeping data quality high without slowing down. Practical Tools to Sprint Ahead: Dive into hands-on tips with open-source tools that supercharge data validation and observability. Want More Detail?

Architecture

Architecture Raw Data High Quality Data Data Validation

Introducing Accelerator for Machine Learning (ML) Projects: Summarization with Gemini from Vertex AI

Cloudera

DECEMBER 9, 2024

The post Introducing Accelerator for Machine Learning (ML) Projects: Summarization with Gemini from Vertex AI appeared first on Cloudera Blog. Stay tuned for future AMPs well build using Cloudera AI and Vertex AI.

Machine Learning

Machine Learning Project Banking Accessible

Hybrid Cloud

WeCloudData

APRIL 18, 2025

This blog explores the current landscape of […] The post Hybrid Cloud appeared first on WeCloudData. It has emerged as a pivotal strategy for organizations aiming to balance scalability, agility, and control. The hybrid cloud empowers businesses to optimize performance, enhance security, and drive innovation.

Cloud

Cloud Cloud Computing IT AWS

Data Engineering Weekly #209

Data Engineering Weekly

FEBRUARY 23, 2025

I found the blog to be a comprehensive roadmap for data engineering in 2025. The blog narrates the gateway's purpose: simplifying access, enabling experimentation, achieving cost-efficiency, and providing auditing and platformization benefits.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Weekly #196

Data Engineering Weekly

NOVEMBER 3, 2024

The blog emphasizes the importance of starting with a clear client focus to avoid over-engineering and ensure user-centric development. link] Gunnar Morling: Revisiting the Outbox Pattern The blog is an excellent summary of the path we crossed with the outbox pattern and the challenges ahead.

Data Engineering

Data Engineering Data Engineer Engineering Pipeline-centric

Delta Lake and restore - traveling in time differently

Waitingforcode

JANUARY 9, 2025

An alternative is the RESTORE command, and it'll be the topic of this blog post. Time travel is a quite popular Delta Lake feature. But do you know it's not the single one you can use to interact with the past versions?

IT

File trigger in Databricks

Waitingforcode

MARCH 5, 2025

The feature looks amazing but hides some implementation challenges that we're going to see in this blog post. For over two years now you can leverage file triggers in Databricks Jobs to start processing as soon as a new file gets written to your storage.

Process

Overwriting partitioned tables in Apache Spark SQL

Waitingforcode

FEBRUARY 12, 2025

After publishing a release of my blog post about the insertInto trap, I got an intriguing question in the comments. The alternative to the insertInto, the saveAsTable method, doesn't work well on partitioned data in overwrite mode while the insertInto does.

SQL

SQL IT Data

Introducing Easier Change Data Capture in Apache Spark™ Structured Streaming

databricks

JANUARY 27, 2025

This blog describes the new change feed and snapshot capabilities in Apache Spark Structured Streamings State Reader API. The State Reader API enables.

Data

Handling Network Throttling with AWS EC2 at Pinterest

Pinterest Engineering

APRIL 7, 2025

In this blog post, well discuss our experiences in identifying the challenges associated with EC2 network throttling. In the remainder of this blog post, well share how we root cause and mitigate the aboveissues. This prompted us to engage with AWS and dive deep into the network performance of our clusters. 4xl with up to 12.5

AWS

AWS Bytes Database Data Ingestion

NLP Project Life Cycle: A Case Study on Automated Resume Screening

WeCloudData

MARCH 20, 2025

In this blog, we will use a case study-Automated Resume Screening to understand […] The post NLP Project Life Cycle: A Case Study on Automated Resume Screening appeared first on WeCloudData.

Project

Project Media Technology Engineering

Data Engineering Weekly #212

Data Engineering Weekly

MARCH 16, 2025

The blog narrates how Apache Arrow offers better data serialization efficiency and avoids design pitfalls from the past. The blog stresses the need for granular, structured feedback, especially from experts, and outlines key considerations for evaluation design. years of manual effort!!!

Data Engineering

Data Engineering Data Engineer Engineering Amazon Web Services

Data Scientist Vs Data Analyst: Key Differences, Career Paths, and How to Choose the Right Role

WeCloudData

FEBRUARY 13, 2025

This blog focuses […] The post Data Scientist Vs Data Analyst: Key Differences, Career Paths, and How to Choose the Right Role appeared first on WeCloudData. The world is becoming increasingly reliant on data, about 2.5

Bytes

Bytes BI Data Engineering

Top 10 Data Engineering Trends in 2025

Edureka

APRIL 22, 2025

This blog will explore the significant advancements, challenges, and opportunities impacting data engineering in 2025, highlighting the increasing importance for companies to stay updated. In 2025, this blog will discuss the most important data engineering trends, problems, and opportunities that companies should be aware of.

Data Engineering

Data Engineering Data Engineer Engineering Consulting

Are LLMs making StackOverflow irrelevant?

An educational side project

Webinars

Trending Sources

Event time skew and global watermark in Apache Spark Structured Streaming

Webinars

Azure Databricks: A Comprehensive Guide

Enhancing Neural Network Training at Yelp: Achieving 1,400x Speedup with WideAndDeep

Explore the World of Data-Tech with DataHour

How Meta discovers data flows via lineage at scale

Netflix’s Distributed Counter Abstraction

Apache XTable. Delta vs Iceberg vs Hudi.

How to Implement a Data Pipeline Using Amazon Web Services?

Data Engineering Weekly #198

Data Engineering Weekly #195

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Platform as a Service (PaaS)

PagerDuty alternatives

The Struggle Between Data Dark Ages and LLM Accuracy

Data Engineering Weekly #217

Data News — Week 25.02

Databricks on Databricks: Kicking off the Journey to Governance with Unity Catalog

Data Engineering Weekly #210

Adding Write Functionality to Pages with Self-Service APIs

2024 retrospective on waitingforcode.com

Fueling the Future of GenAI with NiFi: Cloudera DataFlow 2.9 Delivers Enhanced Efficiency and Adaptability

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Improving Pinterest Search Relevance Using Large Language Models

10 GitHub Repositories to Master Python

SwiftKV Cuts LLM Inference Costs by 75% with Snowflake Cortex AI

Private Cloud

How Netflix Accurately Attributes eBPF Flow Logs

Data Engineering Weekly #207

Webinar: Data Quality in a Medallion Architecture – 2024

Introducing Accelerator for Machine Learning (ML) Projects: Summarization with Gemini from Vertex AI

Hybrid Cloud

Data Engineering Weekly #209

Data Engineering Weekly #196

Delta Lake and restore - traveling in time differently

File trigger in Databricks

Overwriting partitioned tables in Apache Spark SQL

Introducing Easier Change Data Capture in Apache Spark™ Structured Streaming

Handling Network Throttling with AWS EC2 at Pinterest

NLP Project Life Cycle: A Case Study on Automated Resume Screening

Data Engineering Weekly #212

Data Scientist Vs Data Analyst: Key Differences, Career Paths, and How to Choose the Right Role

Top 10 Data Engineering Trends in 2025

Stay Connected