Data Engineering Digest

Trending Articles

Universal Data Orchestrator in Action: Enterprise Best Practices

Simon Späti

JUNE 17, 2025

Moving from orchestration theory to the enterprise level is a real challenge. How do you handle secrets across environments? Where does your business logic actually live? How do you make pipelines that work for both your senior engineers and the analysts who need to modify them? In Part 1, The Heartbeat of Data Engineering , we discussed the convergent orchestrator combining orchestration as code and no-code.

Builder.ai did not “fake AI with 700 engineers”

The Pragmatic Engineer

JUNE 12, 2025

Originally published in The Pragmatic Engineer Newsletter. An eye-catching detail widely reported by media and on social media about the bankrupt business Builder.ai last week, was that the company faked AI with 700 engineers in India: “Microsoft-backed AI startup chatbots revealed to be human employees” – Mashable “Builder.ai used 700 engineers in India for coding work it marketed as AI-powered” – MSN “Builder.ai faked AI with 700 engineers, now

Engineering

Engineering Media Coding Systems

Join 37,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

Trending Sources

Introducing Agent Bricks: Auto-Optimized Agents Using Your Data

databricks

JUNE 11, 2025

Skip to main content Login Why Databricks Discover For Executives For Startups Lakehouse Architecture Mosaic Research Customers Customer Stories Partners Cloud Providers Databricks on AWS, Azure, GCP, and SAP Consulting & System Integrators Experts to build, deploy and migrate to Databricks Technology Partners Connect your existing tools to your Lakehouse C&SI Partner Program Build, deploy or migrate to the Lakehouse Data Partners Access the ecosystem of data consumers Partner Solutions

Entertainment

Entertainment Manufacturing Consulting Professional Services

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

Using Joins and Group Bys the right way for data warehousing

Start Data Engineering

JUNE 10, 2025

1. Introduction 2. Joins & Group bys are two of the most commonly used operations in data warehousing 2.1. Joins are used to create denormalized dimension tables & to enrich fact tables with dimensions for reporting 2.1.1. When to use joins 2.1.2. How to use joins 2.1.3. Things to watch out for when joining 2.2. Group bys are the cornerstone of reporting 2.

Data

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs. You’ll learn how to: Create a standardized process for debugging to quickly diagnose errors in your DAGs Identify common issues with DAGs, tasks, and connections Distinguish between Airflow-relate

Data Pipeline

AI Agents in Analytics Workflows: Too Early or Already Behind?

KDnuggets

JUNE 13, 2025

Blog Top Posts About Topics AI Career Advice Computer Vision Data Engineering Data Science Language Models Machine Learning MLOps NLP Programming Python SQL Datasets Events Resources Cheat Sheets Recommendations Tech Briefs Advertise Join Newsletter AI Agents in Analytics Workflows: Too Early or Already Behind? A look at how AI agents are reshaping the data analytics workflow and whether you’re ahead or behind the curve.

Data Science

Data Science Datasets SQL Python

The Open Lakehouse Stack: DuckDB and the Rise of Table Formats

Simon Späti

JUNE 16, 2025

Wouldn’t it be great to build a data warehouse on top of affordable storage and scattered files? SSDs and fast storage are expensive, but storing data in a data lake on S3 or R2 is significantly cheaper, allowing you to save a greater amount of essential data. However, the downside is that it quickly becomes messy or unorganized, lacking clear governance and rules.

Lateral column aliases in Apache Spark SQL

Waitingforcode

JUNE 13, 2025

It's the second blog post about laterals in Apache Spark SQL. Previously you discovered how to combine queries with lateral subquery and lateral views. Now it's time to see a more local feature, lateral column aliases.

SQL

SQL IT

More Trending

Lateral column aliases in Apache Spark SQL

Waitingforcode

JUNE 13, 2025

SQL

SQL IT

Announcing Lakeflow Designer: No-Code ETL, Powered by the Databricks Intelligence Platform

databricks

JUNE 12, 2025

We’re excited to announce Lakeflow Designer, an AI-powered, no-code pipeline builder that is fully integrated with the Databricks Data Intelligence Platform.

Designing

Designing Coding Data

Apache Iceberg v3 Table Spec: Celebrating the OSS Community’s Shared Success

Snowflake

JUNE 10, 2025

The Apache Iceberg™ project exemplifies the spirit of open source and shows what’s possible when a community comes together with a common goal: to drive a technology forward. With a mission to bring reliability, performance and openness to large-scale analytics, the Iceberg project continues to evolve and offer many benefits thanks to the diverse voices and efforts of its contributors.

Metadata

Metadata Software Engineering Software Engineer Technology

Bridging the Gap: New Datasets Push Recommender Research Toward Real-World Scale

KDnuggets

JUNE 11, 2025

Blog Top Posts About Topics AI Career Advice Computer Vision Data Engineering Data Science Language Models Machine Learning MLOps NLP Programming Python SQL Datasets Events Resources Cheat Sheets Recommendations Tech Briefs Advertise Join Newsletter Bridging the Gap: New Datasets Push Recommender Research Toward Real-World Scale Publicly available datasets in recommender research currently shaping the field.

Datasets

Datasets Metadata Machine Learning Data Science

From AI Chaos to Control: A Flexible Data Integrity Ecosystem

Precisely

JUNE 10, 2025

If you’re leading any kind of AI initiative right now, you already know the opportunities are vast – but so is the complexity. Between widespread generative AI adoption, a wide variety of LLM options, and compelling visions of agentic AI-fueled automation, the pace of innovation is extraordinary. But the fact is this: we won’t get the most from our AI initiatives unless we have full control: control over the technologies we use, how we use them, where, and – most importantly – the data tha

Data Integration

Data Integration Government Google Cloud Data

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

Speaker: Tamara Fingerlin, Developer Advocate

Apache Airflow® 3.0, the most anticipated Airflow release yet, officially launched this April. As the de facto standard for data orchestration, Airflow is trusted by over 77,000 organizations to power everything from advanced analytics to production AI and MLOps. With the 3.0 release, the top-requested features from the community were delivered, including a revamped UI for easier navigation, stronger security, and greater flexibility to run tasks anywhere at any time.

Data Workflow

Manage geodatabase upgrades in a service-based architecture

ArcGIS

JUNE 13, 2025

Learn how to manage enterprise geodatabase upgrades in ArcGIS service-based architectures. Understand when upgrades are needed, which client to use, and how to apply them using ArcGIS Pro or ArcGIS Enterprise.

Architecture

Architecture Management Data Management Data

Announcing Lakebase Public Preview

databricks

JUNE 11, 2025

At the Data and AI Summit, we introduced a new category of operational databases called lakebases for building intelligent applications.

Database

Database Building Data

Build Better Data Pipelines with SQL and Python in Snowflake

Snowflake

JUNE 10, 2025

Data transformations are the engine room of modern data operations — powering innovations in AI, analytics and applications. As the core building blocks of any effective data strategy, these transformations are crucial for constructing robust and scalable data pipelines. Today, we're excited to announce the latest product advancements in Snowflake to build and orchestrate data pipelines.

Data Pipeline

Data Pipeline SQL Python Building

Integrating DuckDB & Python: An Analytics Guide

KDnuggets

JUNE 10, 2025

Blog Top Posts About Topics AI Career Advice Computer Vision Data Engineering Data Science Language Models Machine Learning MLOps NLP Programming Python SQL Datasets Events Resources Cheat Sheets Recommendations Tech Briefs Advertise Join Newsletter Integrating DuckDB & Python: An Analytics Guide Learn how to run lightning-fast SQL queries on local files with ease.

Python

Python SQL Data Science Machine Learning

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API. As an engineering leader, it can be challenging to make sense of this evolving landscape, but agent tooling provides such high value that it’s critical we figure out how to move forward.

Systems

Accelerate AI and Analytics with these 4 New Enhancements in the Precisely Data Integrity Suite

Precisely

JUNE 10, 2025

Key takeaways: New Data Integrity Suite innovations include AI-powered data quality, and new data observability, lineage, location intelligence, and enrichment capabilities. These enhancements help you scale data quality for AI, boost visibility across hybrid data environments, and embed trusted location data into critical workflows. The Suite ensures you’re able to reduce risk, drive innovation, and maintain a competitive edge.

Data Integration

Data Integration Metadata Data Data Management

Unlocking Efficient Ad Retrieval: Offline Approximate Nearest Neighbors in Pinterest Ads

Pinterest Engineering

JUNE 12, 2025

Authors (non-ordered): Qishan(Shanna) Zhu, Chen Hu Acknowledgements: Longyu Zhao, Jacob Gao, Quannan Li, Dinesh Govindaraj Introduction In the evolving landscape of advertising, the demand for real-time personalization and dynamic ad delivery has made Online Approximate Nearest Neighbors (ANN) a mainstream method for ad retrieval. Pinterest primarily employs online ANN to swiftly adapt to users’ behavior changes (depending on their age, location and privacy settings), thereby enhancing ad respon

Architecture

Architecture Algorithm Utilities Data Storage

Introducing Databricks One

databricks

JUNE 12, 2025

BI Entertainment Manufacturing Consulting

Snowflake Achieves Prestigious ISO/IEC/IEC 42001 Certification, Demonstrating Commitment to Responsible AI Practices

Snowflake

JUNE 12, 2025

As a leader in AI and data, Snowflake is dedicated to ensuring that our artificial intelligence practices are not only effective but also ethical, responsible and transparent. That's why we're proud to announce that we've been awarded the ISO/IEC/IEC* 42001 certification. This prestigious international standard recognizes our commitment to establishing, implementing, maintaining and continually improving a structured framework that helps organizations responsibly and effectively manage the devel

Certification

Certification Unstructured Data Government SQL

How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Manufacturing is evolving, and the right technology can empower—not replace—your workforce. Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. But how do you implement these tools with confidence and ensure they complement human expertise rather than override it? Join industry expert Andrew Skoog as he explores how manufacturers can leverage automation to enhance operations, streamline workflows, and make smarter, data-dri

Manufacturing

Building a Custom PDF Parser with PyPDF and LangChain

KDnuggets

JUNE 12, 2025

Blog Top Posts About Topics AI Career Advice Computer Vision Data Engineering Data Science Language Models Machine Learning MLOps NLP Programming Python SQL Datasets Events Resources Cheat Sheets Recommendations Tech Briefs Advertise Join Newsletter Building a Custom PDF Parser with PyPDF and LangChain PDFs look simple — until you try to parse one.

Building

Building Metadata Raw Data Data Science

Data Engineering Weekly #224

Data Engineering Weekly

JUNE 15, 2025

The Data Platform Fundamentals Guide Learn the fundamental concepts to build a data platform in your organization. - Tips and tricks for data modeling and data ingestion patterns - Explore the benefits of an observation layer across your data pipelines - Learn the key strategies for ensuring data quality for your organization Get the guide Jorge García Herrero: “Localhost tracking” explained.

Data Engineering

Data Engineering Data Engineer Pipeline-centric Engineering

Data Observability vs. Monitoring: What’s the Difference, Really?

Monte Carlo

JUNE 10, 2025

Data engineering is full of buzzwords—data mesh, reverse ETL, lakehouse, you name it. It’s easy to tune them out. So when someone drops “data observability,” it’s fair to ask: what’s data observability vs. monitoring? If you’ve ever wrestled with broken dashboards, missing data, or a pipeline that quietly failed overnight, you know how frustrating it is to figure out what went wrong.

Data

Data Engineering Data Engineering Data Engineer

Introducing Databricks Free Edition

databricks

JUNE 11, 2025

Today, we are excited to announce the availability of Databricks Free Edition, a product for learning and exploring the latest data and AI technologies for free.

Technology

Technology Data

Optimizing The Modern Developer Experience with Coder

Many software teams have migrated their testing and production workloads to the cloud, yet development environments often remain tied to outdated local setups, limiting efficiency and growth. This is where Coder comes in. In our 101 Coder webinar, you’ll explore how cloud-based development environments can unlock new levels of productivity. Discover how to transition from local setups to a secure, cloud-powered ecosystem with ease.

Cloud

Snowflake Postgres: Built for Developers, Ready for the Enterprise

Snowflake

JUNE 10, 2025

PostgreSQL has become the undisputed choice for developers worldwide, celebrated for its open source flexibility, vibrant ecosystem and growing AI capabilities like vector support. But as companies race to build the next generation of AI agents and scale their critical operational systems, a fundamental question emerges: Is your Postgres truly ready for the enterprise, or does it come with hidden compromises?

PostgreSQL

PostgreSQL Government Unstructured Data Database Design

Automating GitHub Workflows with Claude 4

KDnuggets

JUNE 13, 2025

Blog Top Posts About Topics AI Career Advice Computer Vision Data Engineering Data Science Language Models Machine Learning MLOps NLP Programming Python SQL Datasets Events Resources Cheat Sheets Recommendations Tech Briefs Advertise Join Newsletter Automating GitHub Workflows with Claude 4 Learn how to set up the Claude App in your GitHub repository and invoke it directly through comments.

Telecommunication

Telecommunication Data Science Machine Learning Python

Meet Muze: ThoughtSpot's native visualization engine

ThoughtSpot

JUNE 16, 2025

Business intelligence platforms analyze vast amounts of data, requiring visualization engines that balance performance, flexibility, and ease of use. Traditional charting libraries treat each chart type as a distinct entity, requiring separate logic and code for each. This approach leads to code duplication, limited reusability, and reduced maintainability.

Engineering

Engineering Datasets Architecture Data Ingestion

Monte Carlo Expands Databricks Partnership with Support for AI/BI and Unity Catalog

Monte Carlo

JUNE 10, 2025

Monte Carlo, the leader in data + AI observability, today announced extended support for the Databricks Data Intelligence Platform through new integrations with Databricks AI/BI and Unity Catalog Metrics. These enhancements, unveiled ahead of the Databricks Data + AI Summit 2025 , represent a major milestone in enabling AI-ready data at scale for joint customers of Databricks and Monte Carlo.

BI Raw Data Business Intelligence Data Pipeline

The Ultimate Guide to Apache Airflow DAGS

With Airflow being the open-source standard for workflow orchestration, knowing how to write Airflow DAGs has become an essential skill for every data engineer. This eBook provides a comprehensive overview of DAG writing features with plenty of example code. You’ll learn how to: Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to Write DAGs that adapt to your data at runtime and set up alerts and notifications Scale you

Data Engineering

What’s new with Databricks Unity Catalog at Data + AI Summit 2025

databricks

JUNE 12, 2025

Four years ago, Databricks saw tremendous complexity in the data landscape: separate catalogs for each platform, siloed governance tools across clouds, and no unified way

Government

Government Cloud Data

Diskover, Backed by Snowflake Ventures, Empowers Enterprises with Full Visibility into Their Legacy Data Estates

Snowflake

JUNE 17, 2025

A successful AI strategy requires a solid data foundation, yet a striking number of data and AI leaders are feeling unprepared. According to a survey of executives, a quarter described their data foundations as “somewhat unready” to “very unready” to support generative AI applications, and more than half admit they are only “somewhat ready.” Compounding this challenge, enterprises are grappling with petabytes of data trapped on legacy storage devices.

How to Learn Math for Data Science: A Roadmap for Beginners

KDnuggets

JUNE 12, 2025

Blog Top Posts About Topics AI Career Advice Computer Vision Data Engineering Data Science Language Models Machine Learning MLOps NLP Programming Python SQL Datasets Events Resources Cheat Sheets Recommendations Tech Briefs Advertise Join Newsletter How to Learn Math for Data Science: A Roadmap for Beginners Confused about where to start with data science math?

Data Science

Data Science Machine Learning Algorithm Datasets

Adding Eyes to Picnic’s Automated Warehouses

Picnic Engineering

JUNE 11, 2025

How computer vision can spot problems long before a customer notices In Picnic’s fully-automated fulfilment centre in Utrecht thousands of totes move over more than 50 kilometres of conveyor belts every single day. Our in-house control software decides where every tote should go and when. What that software cannot do today is look inside the moving boxes.

Cloud

Cloud Metadata AWS Algorithm

Apache Airflow® Best Practices: DAG Writing

Speaker: Tamara Fingerlin, Developer Advocate

In this new webinar, Tamara Fingerlin, Developer Advocate, will walk you through many Airflow best practices and advanced features that can help you make your pipelines more manageable, adaptive, and robust. She'll focus on how to write best-in-class Airflow DAGs using the latest Airflow features like dynamic task mapping and data-driven scheduling!

Data

Trending Articles

Universal Data Orchestrator in Action: Enterprise Best Practices

Builder.ai did not “fake AI with 700 engineers”

Webinars

Trending Sources

Introducing Agent Bricks: Auto-Optimized Agents Using Your Data

Webinars

Using Joins and Group Bys the right way for data warehousing

A Guide to Debugging Apache Airflow® DAGs

AI Agents in Analytics Workflows: Too Early or Already Behind?

The Open Lakehouse Stack: DuckDB and the Rise of Table Formats

Lateral column aliases in Apache Spark SQL

Sign up to get articles personalized to your interests!

More Trending

Lateral column aliases in Apache Spark SQL

Announcing Lakeflow Designer: No-Code ETL, Powered by the Databricks Intelligence Platform

Apache Iceberg v3 Table Spec: Celebrating the OSS Community’s Shared Success

Bridging the Gap: New Datasets Push Recommender Research Toward Real-World Scale

From AI Chaos to Control: A Flexible Data Integrity Ecosystem

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

Manage geodatabase upgrades in a service-based architecture

Announcing Lakebase Public Preview

Build Better Data Pipelines with SQL and Python in Snowflake

Integrating DuckDB & Python: An Analytics Guide

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Accelerate AI and Analytics with these 4 New Enhancements in the Precisely Data Integrity Suite

Unlocking Efficient Ad Retrieval: Offline Approximate Nearest Neighbors in Pinterest Ads

Introducing Databricks One

Snowflake Achieves Prestigious ISO/IEC/IEC 42001 Certification, Demonstrating Commitment to Responsible AI Practices

How to Modernize Manufacturing Without Losing Control

Building a Custom PDF Parser with PyPDF and LangChain

Data Engineering Weekly #224

Data Observability vs. Monitoring: What’s the Difference, Really?

Introducing Databricks Free Edition

Optimizing The Modern Developer Experience with Coder

Snowflake Postgres: Built for Developers, Ready for the Enterprise

Automating GitHub Workflows with Claude 4

Meet Muze: ThoughtSpot's native visualization engine

Monte Carlo Expands Databricks Partnership with Support for AI/BI and Unity Catalog

The Ultimate Guide to Apache Airflow DAGS

What’s new with Databricks Unity Catalog at Data + AI Summit 2025

Diskover, Backed by Snowflake Ventures, Empowers Enterprises with Full Visibility into Their Legacy Data Estates

How to Learn Math for Data Science: A Roadmap for Beginners

Adding Eyes to Picnic’s Automated Warehouses

Apache Airflow® Best Practices: DAG Writing

Stay Connected