Top Data Engineering Digest Data Engineer Data Engineering Content for January, 2025

January, 2025

The Data Engineering Toolkit: Essential Tools for Your Machine

Simon Späti

JANUARY 22, 2025

To be proficient as a data engineer, you need to know various toolkitsfrom fundamental Linux commands to different virtual environments and optimizing efficiency as a data engineer. This article focuses on the building blocks of data engineering work, such as operating systems, development environments, and essential tools. We’ll start from the ground upexploring crucial Linux commands, containerization with Docker, and the development environments that make modern data engineering possibl

Data Engineering

Data Engineering Data Engineer Engineering Programming Language

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Data Engineering Weekly

JANUARY 15, 2025

The Critical Role of AI Data Engineers in a Data-Driven World How does a chatbot seamlessly interpret your questions? How does a self-driving car understand a chaotic street scene? The answer lies in unstructured data processing—a field that powers modern artificial intelligence (AI) systems. Unlike neatly organized rows and columns in spreadsheets, unstructured data—such as text, images, videos, and audio—requires advanced processing techniques to derive meaningful insights.

Data Engineering

Data Engineering Data Engineer Unstructured Data Engineering

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

Trending Sources

How to ensure consistent metrics in your warehouse

Start Data Engineering

JANUARY 28, 2025

1. Introduction 2. Centralize Metric Definitions in Code Option A: Semantic Layer for On-the-Fly Queries Option B: Pre-Aggregated Tables for Consumers 3. Conclusion & Recap 4. Required Reading 1. Introduction If youve worked on a data team, youve likely encountered situations where multiple teams define metrics in slightly different ways, leaving you to untangle why discrepancies exist.

Utilities

Utilities Coding Data

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

Data lineage is an instrumental part of Metas Privacy Aware Infrastructure (PAI) initiative, a suite of technologies that efficiently protect user privacy. It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Metas systems. This allows us to verify that our users everyday interactions are protected across our family of apps, such as their religious views in the Facebook Dating app, the example well walk through in this

Data Warehouse

Data Warehouse SQL Programming Language Data

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

Data Pipeline

Data Integrity for AI: What’s Old is New Again

Precisely

JANUARY 9, 2025

Artificial Intelligence (AI) is all the rage, and rightly so. By now most of us have experienced how Gen AI and the LLMs (large language models) that fuel it are primed to transform the way we create, research, collaborate, engage, and much more. Yet along with the AI hype and excitement comes very appropriate sanity-checks asking whether AI is ready for prime-time.

Data Integration

Data Integration Data Warehouse Data Lake Hadoop

Are LLMs making StackOverflow irrelevant?

The Pragmatic Engineer

JANUARY 21, 2025

Hi, this is Gergely with a bonus issue of the Pragmatic Engineer Newsletter. In every issue, I cover topics related to Big Tech and startups through the lens of engineering managers and senior engineers. This article is one out of five sections from The Pulse #119. Full subscribers received this issue a week and a half ago. To get articles like this in your inbox, subscribe here.

Software Engineer

Software Engineer Software Engineering Engineering Coding

Data News — Week 25.02

Christophe Blefari

JANUARY 11, 2025

HNY 2025 ( credits ) Happy new year ✨ I wish you the best for 2025. There are multiple ways to start a new year, either with new projects, new ideas, new resolutions or by just keeping doing the same music. I hope you will enjoy 2025. The Data News are here to stay, the format might vary during the year, but here we are for another year. Thank you so much for your support through the years.

Data

Data Data Warehouse Programming Language Java

More Trending

Data News — Week 25.02

Christophe Blefari

JANUARY 11, 2025

Data

Data Data Warehouse Programming Language Java

Event time skew and global watermark in Apache Spark Structured Streaming

Waitingforcode

JANUARY 15, 2025

A few months ago I wrote a blog post about event skew and how dangerous it is for a stateful streaming job. Since it was a high-level explanation, I didn't cover Apache Spark Structured Streaming deeply at that moment. Now the watermark topic is back to my learning backlog and it's a good opportunity to return to the event skew topic and see the dangers it brings for Structured Streaming stateful jobs.

Predictions 2025: AI As Cybersecurity Tool and Target

Snowflake

JANUARY 8, 2025

Though AI is (still) the hottest technology topic, its not the overriding issue for enterprise security in 2025. Advanced AI will open up new attack vectors and also deliver new tools for protecting an organizations data. But the underlying challenge is the sheer quantity of data that overworked cybersecurity teams face as they try to answer basic questions such as, Are we under attack?

Data Lake

Data Lake Data Security Machine Learning Technology

Testing and Development for Databricks Environment and Code.

Confessions of a Data Guy

JANUARY 10, 2025

Every once in a great while, the question comes up: “How do I test my Databricks codebase?” It’s a fair question, and if you’re new to testing your code, it can seem a little overwhelming on the surface. However, I assure you the opposite is the case. Testing your Databricks codebase is no different than […] The post Testing and Development for Databricks Environment and Code. appeared first on Confessions of a Data Guy.

Coding

Coding IT Data

Strobelight: A profiling service built on open source technology

Engineering at Meta

JANUARY 21, 2025

Were sharing details about Strobelight, Metas profiling orchestrator. Strobelight combines several technologies, many open source, into a single service that helps engineers at Meta improve efficiency and utilization across our fleet. Using Strobelight, weve seen significant efficiency wins, including one that has resulted in an estimated 15,000 servers worth of annual capacity savings.

Technology

Technology Metadata Utilities Engineering

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data Workflow

What Is PDFMiner And Should You Use It – How To Extract Data From PDFs

Seattle Data Guy

JANUARY 18, 2025

PDF files are one of the most popular file formats today. Because they can preserve the visual layout of documents and are compatible with a wide range of devices and operating systems, PDFs are used for everything from business forms and educational material to creative designs. However, PDF files also present multiple challenges when it… Read more The post What Is PDFMiner And Should You Use It – How To Extract Data From PDFs appeared first on Seattle Data Guy.

IT Education Data Designing

Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data

Towards Data Science

JANUARY 30, 2025

Building more efficient AI TLDR : Data-centric AI can create more efficient and accurate models. I experimented with data pruning on MNIST to classify handwritten digits. Best runs for furthest-from-centroid selection compared to full dataset. Image byauthor. What if I told you that using just 50% of your training data could achieve better results than using the fulldataset?

Database-centric

Database-centric Datasets Data Architecture

Enhancing Neural Network Training at Yelp: Achieving 1,400x Speedup with WideAndDeep

Yelp Engineering

JANUARY 21, 2025

At Yelp, we encountered challenges that prompted us to enhance the training time of our ad-revenue generating models, which use a Wide and Deep Neural Network architecture for predicting ad click-through rates (pCTR). These models handle large tabular datasets with small parameter spaces, requiring innovative data solutions. This blog post delves into our journey of optimizing training time using TensorFlow and Horovod, along with the development of ArrowStreamServer, our in-house library for lo

Datasets

Datasets Architecture Data Solutions Data

Unlocking the Power of Geospatial Data for Insights

Snowflake

JANUARY 15, 2025

Over the last three geospatial-centric blog posts, weve covered the basics of what geospatial data is, how it works in the broader world of data and how it specifically works in Snowflake based on our native support for GEOGRAPHY , GEOMETRY and H3. Those articles are great for dipping your toe in, getting a feel for the water and maybe even wading into the shallow end of the pool.

Transportation

Transportation BI Database-centric Metadata

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Building a Fast, Light, and CHEAP Lake House with DuckDB, Delta Lake, and AWS Lambda

Confessions of a Data Guy

JANUARY 7, 2025

Building fun things is a real part of Data Engineering. Using your creative side when building a Lake House is possible, and using tools that are outside the normal box can sometimes be preferable. Checkout this video where I dive into how I build just such a Lake House using Modern Data Stack tools like […] The post Building a Fast, Light, and CHEAP Lake House with DuckDB, Delta Lake, and AWS Lambda appeared first on Confessions of a Data Guy.

AWS

AWS Building Data Engineering Data Engineer

ILA Evo: Meta’s journey to reimagine fiber optic in-line amplifier sites

Engineering at Meta

JANUARY 10, 2025

Today’s rapidly evolving landscape of use cases that demand highly performant and efficient network infrastructure is placing new emphasis on how in-line amplifiers (ILAs) are designed and deployed. Metas ILA Evo effort seeks to reimagine how an ILA site could be deployed to improve speed and cost while making a step function improvement in power efficiency.

Transportation

Transportation Manufacturing Consulting Designing

Modern Data Governance: Trends for 2025

Precisely

JANUARY 30, 2025

Key Takeaways: Prioritize metadata maturity as the foundation for scalable, impactful data governance. Recognize that artificial intelligence is a data governance accelerator and a process that must be governed to monitor ethical considerations and risk. Integrate data governance and data quality practices to create a seamless user experience and build trust in your data.

Data Governance

Data Governance Government Metadata Data

Title Launch Observability at Netflix Scale

Netflix Tech

JANUARY 6, 2025

Part 2: Navigating Ambiguity By: VarunKhaitan With special thanks to my stunning colleagues: Mallika Rao , Esmir Mesic , HugoMarques Building on the foundation laid in Part 1 , where we explored the what behind the challenges of title launch observability at Netflix, this post shifts focus to the how. How do we ensure every title launches seamlessly and remains discoverable by the right audience?

Metadata

Metadata Algorithm Systems Engineering

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Understanding Change Data Capture (CDC) in MySQL and PostgreSQL: BinLog vs. WAL + Logical Decoding

Towards Data Science

JANUARY 7, 2025

How CDC tools use MySQL Binlog and PostgreSQL WAL with logical decoding for real-time data streaming Photo by Matoo.Studio on Unsplash CDC (Change Data Capture) is a term that has been gaining significant attention over the past few years. You might already be familiar with it (if not, dont worrytheres a quick introduction below ). One question that puzzled me, though, was how tools like the Debezium CDC connectors can read changes from MySQL and PostgreSQL databases.

PostgreSQL

PostgreSQL MySQL Bytes Data Lake

Anthropic’s Claude 3.5 Sonnet now available in Snowflake Cortex AI

Snowflake

JANUARY 9, 2025

Today, we are excited to announce the general availability of Claude 3.5 Sonnet as the first Anthropic foundation model available in Snowflake Cortex AI. Customers can now access the most intelligent model in the Claude model family from Anthropic using familiar SQL, Python and REST API (coming soon) interfaces, within the Snowflake security perimeter.

Unstructured Data

Unstructured Data Government SQL Python

AWS Lambda + DuckDB + Polars + Daft + Rust

Confessions of a Data Guy

JANUARY 30, 2025

When it comes to building modern Lake House architecture, we often get stuck in the past, doing the same old things time after time. We are human; we are lemmings; it’s just the trap we fall into. Usually, that pit we fall into is called Spark. Now, don’t get me wrong; I love Spark. We […] The post AWS Lambda + DuckDB + Polars + Daft + Rust appeared first on Confessions of a Data Guy.

AWS

AWS Architecture Building IT

Measuring productivity impact with Diff Authoring Time

Engineering at Meta

JANUARY 16, 2025

Do types actually make developers more productive? Or is it just more typing on the keyboard? To answer that question were revisiting Diff Authoring Time (DAT) how Meta measures how long it takes to submit changes to a codebase. DAT is just one of the ways e measure developer productivity and this latest episode of the Meta Tech Podcast takes a look at two concrete use cases for DAT, including a type-safe mocking framework in Hack.

Engineering

Engineering IT

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

Mastering Multi-Cloud with Cloudera: Strategic Data & AI Deployments Across Clouds

Cloudera

JANUARY 7, 2025

In todays dynamic digital landscape, multi-cloud strategies have become vital for organizations aiming to leverage the best of both cloud and on-premises environments. As enterprises navigate complex data-driven transformations, hybrid and multi-cloud models offer unmatched flexibility and resilience. Heres a deep dive into why and how enterprises master multi-cloud deployments to enhance their data and AI initiatives.

Cloud

Cloud Government AWS Unstructured Data

Part 2: A Survey of Analytics Engineering Work at Netflix

Netflix Tech

JANUARY 2, 2025

This article is the second in a multi-part series sharing a breadth of Analytics Engineering work at Netflix, recently presented as part of our annual internal Analytics Engineering conference. Need to catch up? Check out Part 1. In this article, we highlight a few exciting analytic business applications, and in our final article well go into aspects of the technical craft.

Engineering

Engineering Entertainment Designing Technology

The Three Levels of SQL Comprehension: What they are and why you need to know about them

dbt Developer Hub

JANUARY 22, 2025

Ever since dbt Labs acquired SDF Labs last week , I've been head-down diving into their technology and making sense of it all. The main thing I knew going in was "SDF understands SQL". It's a nice pithy quote, but the specifics are fascinating. For the next era of Analytics Engineering to be as transformative as the last, dbt needs to move beyond being a string preprocessor and into fully comprehending SQL.

SQL

SQL Database Coding Technology

Digital Twin Tech for ADAS and Autonomous Vehicle Development

Snowflake

JANUARY 6, 2025

The incredible promise of the fully autonomous vehicle (AV) and more advanced driver assistance systems (ADAS) has been driving the automotive industry for the better part of the last decade. It has inspired original equipment manufacturers (OEMs) to innovate their systems, designs and development processes, using data to achieve unprecedented levels of automation.

Manufacturing

Manufacturing Cloud Computing Data Storage Technology

15 Modern Use Cases for Enterprise Business Intelligence

Large enterprises face unique challenges in optimizing their Business Intelligence (BI) output due to the sheer scale and complexity of their operations. Unlike smaller organizations, where basic BI features and simple dashboards might suffice, enterprises must manage vast amounts of data from diverse sources. What are the top modern BI use cases for enterprise businesses to help you get a leg up on the competition?

Business Intelligence

PySpark Data Quality on Databricks with DQX.

Confessions of a Data Guy

JANUARY 17, 2025

A Deep Dive into Databricks Labs’ DQX: The Data Quality Game Changer for PySpark DataFrames Recently, a LinkedIn announcement caught my eyeand honestly, it had me on the edge of my seat. Databricks Labs has unveiled DQX, a Python-based Data Quality framework explicitly designed for PySpark DataFrames. Finally, a Dedicated Data Quality Tool for PySpark […] The post PySpark Data Quality on Databricks with DQX. appeared first on Confessions of a Data Guy.

Data

Data Python Designing IT

Establishing a Large Scale Learned Retrieval System at Pinterest

Pinterest Engineering

JANUARY 31, 2025

Bowen Deng | Machine Learning Engineer, Homefeed Candidate Generation; Zhibo Fan | Machine Learning Engineer, Homefeed Candidate Generation; Dafang He | Machine Learning Engineer, Homefeed Relevance; Ying Huang | Machine Learning Engineer, Curation; Raymond Hsu | Engineering Manager, Homefeed CG Product Enablement; James Li | Engineering Manager, Homefeed Candidate Generation; Dylan Wang | Director, Homefeed Relevance; Jay Adams | Principal Engineer, Pinner Curation &Growth Introduction At P

Systems

Systems Metadata Machine Learning Architecture

Predictive Models Are Nothing Without Trust

Cloudera

JANUARY 7, 2025

Airports are an interconnected system where one unforeseen event can tip the scale into chaos. For a smaller airport in Canada, data has grown to be its North Star in an industry full of surprises. In order for data to bring true value to operationsand ultimately customer experiencesthose data insights must be grounded in trust. Ryan Garnett, Senior Manager Business Solutions of Halifax International Airport Authority, joined The AI Forecast to share how the airport revamped its approach to data

Finance

Finance Building Cloud Engineering

The Future of Data Lakehouses: A Fireside Chat with Vinoth Chandar - Founder CEO Onehouse & PMC Chair of Apache Hudi

Data Engineering Weekly

JANUARY 8, 2025

What if your data lake could do more than just store information—what if it could think like a database? As data lakehouses evolve, they transform how enterprises manage, store, and analyze their data. To explore this future, I recently sat down with Vinoth Chandar, founder of Onehouse and creator of Apache Hudi, for a fireside chat about the trends shaping the data landscape.

Data Lake

Data Lake Retail Data Ingestion Datasets

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineering

January, 2025

The Data Engineering Toolkit: Essential Tools for Your Machine

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Webinars

Trending Sources

How to ensure consistent metrics in your warehouse

Webinars

How Meta discovers data flows via lineage at scale

A Guide to Debugging Apache Airflow® DAGs

Data Integrity for AI: What’s Old is New Again

Are LLMs making StackOverflow irrelevant?

Data News — Week 25.02

Sign up to get articles personalized to your interests!

More Trending

Data News — Week 25.02

Event time skew and global watermark in Apache Spark Structured Streaming

Predictions 2025: AI As Cybersecurity Tool and Target

Testing and Development for Databricks Environment and Code.

Strobelight: A profiling service built on open source technology

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

What Is PDFMiner And Should You Use It – How To Extract Data From PDFs

Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data

Enhancing Neural Network Training at Yelp: Achieving 1,400x Speedup with WideAndDeep

Unlocking the Power of Geospatial Data for Insights

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Building a Fast, Light, and CHEAP Lake House with DuckDB, Delta Lake, and AWS Lambda

ILA Evo: Meta’s journey to reimagine fiber optic in-line amplifier sites

Modern Data Governance: Trends for 2025

Title Launch Observability at Netflix Scale

How to Modernize Manufacturing Without Losing Control

Understanding Change Data Capture (CDC) in MySQL and PostgreSQL: BinLog vs. WAL + Logical Decoding

Anthropic’s Claude 3.5 Sonnet now available in Snowflake Cortex AI

AWS Lambda + DuckDB + Polars + Daft + Rust

Measuring productivity impact with Diff Authoring Time

Optimizing The Modern Developer Experience with Coder

Mastering Multi-Cloud with Cloudera: Strategic Data & AI Deployments Across Clouds

Part 2: A Survey of Analytics Engineering Work at Netflix

The Three Levels of SQL Comprehension: What they are and why you need to know about them

Digital Twin Tech for ADAS and Autonomous Vehicle Development

15 Modern Use Cases for Enterprise Business Intelligence

PySpark Data Quality on Databricks with DQX.

Establishing a Large Scale Learned Retrieval System at Pinterest

Predictive Models Are Nothing Without Trust

The Future of Data Lakehouses: A Fireside Chat with Vinoth Chandar - Founder CEO Onehouse & PMC Chair of Apache Hudi

The Ultimate Guide to Apache Airflow DAGS

Stay Connected