Data and Systems - Data Engineering Digest

10 Essential Linux File System Commands for Data Management

KDnuggets

MAY 14, 2025

In this article, you'll master 10 essential Linux file system commands. This guide provides helpful examples to make working with files easier.

Systems

Systems Data Management Management Data

Introducing Agent Bricks: Auto-Optimized Agents Using Your Data

databricks

JUNE 11, 2025

Events Data + AI Summit Data + AI World Tour Data Intelligence Days Event Calendar Blog and Podcasts Databricks Blog Explore news, product announcements, and more Databricks Mosaic Research Blog Discover the latest in our Gen AI research Data Brew Podcast Let’s talk data! REGISTER Ready to get started?

Entertainment

Entertainment Manufacturing Retail Consulting

Data News — Week 25.02

Christophe Blefari

JANUARY 11, 2025

The Data News are here to stay, the format might vary during the year, but here we are for another year. We published videos about the Forward Data Conference, you can watch Hannes, DuckDB co-creator, keynote about Changing Large Tables. HNY 2025 ( credits ) Happy new year ✨ I wish you the best for 2025. Not really digest.

Data

Data Data Warehouse Programming Language Coding

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

Data lineage is an instrumental part of Metas Privacy Aware Infrastructure (PAI) initiative, a suite of technologies that efficiently protect user privacy. It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Metas systems.

Data Warehouse

Data Warehouse SQL Programming Language Data

A Guide to Debugging Apache Airflow® DAGs

In Airflow, DAGs (your data pipelines) support nearly every use case. As these workflows grow in complexity and scale, efficiently identifying and resolving issues becomes a critical skill for every data engineer. This is a comprehensive guide with best practices and examples to debugging Airflow DAGs.

Data Pipeline

Agents of Change: Navigating 2025 with AI and Data Innovation

Data Engineering Weekly

DECEMBER 28, 2024

In this post, we delve into predictions for 2025, focusing on the transformative role of AI agents, workforce dynamics, and data platforms. Investment in an Agent Management System (AMS) is crucial, as it offers a framework for scaling, monitoring, and refining AI agents.

Unstructured Data

Unstructured Data Metadata Government Data

Data Integrity for AI: What’s Old is New Again

Precisely

JANUARY 9, 2025

Does the LLM capture all the relevant data and context required for it to deliver useful insights? Not to mention the crazy stories about Gen AI making up answers without the data to back it up!) Are we allowed to use all the data, or are there copyright or privacy concerns? But simply moving the data wasnt enough.

Data Integration

Data Integration Hadoop Data Warehouse Data Lake

Data Engineering Roadmap, Learning Path,& Career Track 2025

ProjectPro

JUNE 6, 2025

Data Engineering is gradually becoming a popular career option for young enthusiasts. That's why we've created a comprehensive data engineering roadmap for 2023 to guide you through the essential skills and tools needed to become a successful data engineer. Let's dive into ProjectPro's Data Engineer Roadmap!

Data Engineering

Data Engineering Data Engineer Engineering Amazon Web Services

What Is PDFMiner And Should You Use It – How To Extract Data From PDFs

Seattle Data Guy

JANUARY 18, 2025

Because they can preserve the visual layout of documents and are compatible with a wide range of devices and operating systems, PDFs are used for everything from business forms and educational material to creative designs.

IT

IT Education Data Designing

Improving the Accuracy of Generative AI Systems: A Structured Approach

Speaker: Anindo Banerjea, CTO at Civio & Tony Karrer, CTO at Aggregage

This can be especially difficult when working with a large data corpus, and as the complexity of the task increases. The number of use cases/corner cases that the system is expected to handle essentially explodes. When developing a Gen AI application, one of the most significant challenges is improving accuracy.

Systems

Vector Technologies for AI: Extending Your Existing Data Stack

Simon Späti

MARCH 28, 2025

The database landscape has reached 394 ranked systems across multiple categoriesrelational, document, key-value, graph, search engine, time series, and the rapidly emerging vector databases. As AI applications multiply quickly, vector technologies have become a frontier that data engineers must explore.

Technology

Technology PostgreSQL MySQL Database

How Meta understands data at scale

Engineering at Meta

APRIL 28, 2025

Managing and understanding large-scale data ecosystems is a significant challenge for many organizations, requiring innovative solutions to efficiently safeguard user data. Meta’s vast and diverse systems make it particularly challenging to comprehend its structure, meaning, and context at scale.

Metadata

Metadata Data Utilities Data Warehouse

Data Quality Testing: A Shared Resource for Modern Data Teams

DataKitchen

JUNE 6, 2025

Data Quality Testing: A Shared Resource for Modern Data Teams In today’s AI-driven landscape, where data is king, every role in the modern data and analytics ecosystem shares one fundamental responsibility: ensuring that incorrect data never reaches business customers. Each role touches data differently.

Data Ingestion

Data Ingestion Data Governance Data Government

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

Snowflake

NOVEMBER 26, 2024

Many of our customers — from Marriott to AT&T — start their journey with the Snowflake AI Data Cloud by migrating their data warehousing workloads to the platform. Today we’re focusing on customers who migrated from a cloud data warehouse to Snowflake and some of the benefits they saw. million in cost savings annually.

Data Warehouse

Data Warehouse Cloud PostgreSQL Data Lake

LLMs in Production: Tooling, Process, and Team Structure

Speaker: Dr. Greg Loughnane and Chris Alexiuk

However, during development – and even more so once deployed to production – best practices for operating and improving generative AI applications are less understood.

Process

Mosaic AI Announcements at Data + AI Summit 2025

databricks

JUNE 11, 2025

Events Data + AI Summit Data + AI World Tour Data Intelligence Days Event Calendar Blog and Podcasts Databricks Blog Explore news, product announcements, and more Databricks Mosaic Research Blog Discover the latest in our Gen AI research Data Brew Podcast Let’s talk data! REGISTER Ready to get started?

Entertainment

Entertainment Manufacturing Retail Consulting

Modern Data Architecture: Data Mesh and Data Fabric 101

Precisely

OCTOBER 31, 2024

Key Takeaways: Data mesh is a decentralized approach to data management, designed to shift creation and ownership of data products to domain-specific teams. Data fabric is a unified approach to data management, creating a consistent way to manage, access, and share data across distributed environments.

Data Architecture

Data Architecture Architecture Metadata Government

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

We are excited to announce the acquisition of Octopai , a leading data lineage and catalog platform that provides data discovery and governance for enterprises to enhance their data-driven decision making. This dampens confidence in the data and hampers access, in turn impacting the speed to launch new AI and analytic projects.

Metadata

Metadata Management Data Governance Government

They Handle 500B Events Daily. Here’s Their Data Engineering Architecture.

Monte Carlo

NOVEMBER 12, 2024

A data engineering architecture is the structural framework that determines how data flows through an organization – from collection and storage to processing and analysis. It’s the big blueprint we data engineers follow in order to transform raw data into valuable insights. How Does Uber Know Where to Go?

Architecture

Architecture Data Engineering Data Engineer Engineering

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Speaker: Alex Salazar, CEO & Co-Founder @ Arcade | Nate Barbettini, Founding Engineer @ Arcade | Tony Karrer, Founder & CTO @ Aggregage

There’s a lot of noise surrounding the ability of AI agents to connect to your tools, systems and data. But building an AI application into a reliable, secure workflow agent isn’t as simple as plugging in an API.

Systems

Stop Overcomplicating Data Quality

Towards Data Science

DECEMBER 10, 2024

Three Zero-Cost Solutions That Take Hours, NotMonths A data quality certified pipeline. Source: unsplash.com In my career, data quality initiatives have usually meant big changes. Whats more, fixing the data quality issues this way often leads to new problems. Create a custom dashboard for your specific data qualityproblem.

PostgreSQL

PostgreSQL Data SQL Python

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

dbt Developer Hub

APRIL 20, 2025

dbt is the standard for creating governed, trustworthy datasets on top of your structured data. We expect that over the coming years, structured data is going to become heavily integrated into AI workflows and that dbt will play a key role in building and provisioning this data. What is MCP? Why does this matter?

Structured Data

Structured Data SQL BI Metadata

Data logs: The latest evolution in Meta’s access tools

Engineering at Meta

FEBRUARY 4, 2025

Were sharing how Meta built support for data logs, which provide people with additional data about how they use our products. Here we explore initial system designs we considered, an overview of the current architecture, and some important principles Meta takes into account in making data accessible and easy to understand.

Accessibility

Accessibility Accessible Raw Data Data Warehouse

The Future of Data Management Is Agentic AI

Snowflake

APRIL 13, 2025

Managing and utilizing data effectively is crucial for organizational success in today's fast-paced technological landscape. The vast amounts of data generated daily require advanced tools for efficient management and analysis. Enter agentic AI, a type of artificial intelligence set to transform enterprise data management.

Data Management

Data Management Management Consulting Unstructured Data

Monetizing Analytics Features: Why Data Visualizations Will Never Be Enough

Think your customers will pay more for data visualizations in your application? Five years ago they may have. But today, dashboards and visualizations have become table stakes. Discover which features will differentiate your application and maximize the ROI of your embedded analytics. Brought to you by Logi Analytics.

Data

Make Your Company Data Driven with Redash

KDnuggets

MAY 27, 2025

Develop a data system that every business user wants to use.

Data

Data Systems

Understanding Change Data Capture (CDC) in MySQL and PostgreSQL: BinLog vs. WAL + Logical Decoding

Towards Data Science

JANUARY 7, 2025

How CDC tools use MySQL Binlog and PostgreSQL WAL with logical decoding for real-time data streaming Photo by Matoo.Studio on Unsplash CDC (Change Data Capture) is a term that has been gaining significant attention over the past few years. You might already be familiar with it (if not, dont worrytheres a quick introduction below ).

PostgreSQL

PostgreSQL MySQL Bytes Data Lake

The Race For Data Quality in a Medallion Architecture

DataKitchen

NOVEMBER 5, 2024

The Race For Data Quality In A Medallion Architecture The Medallion architecture pattern is gaining traction among data teams. It is a layered approach to managing and transforming data. By systematically moving data through these layers, the Medallion architecture enhances the data structure in a data lakehouse environment.

Architecture

Architecture Raw Data Pipeline-centric Data Ingestion

A Guide to the Six Types of Data Quality Dashboards

DataKitchen

NOVEMBER 27, 2024

A Guide to the Six Types of Data Quality Dashboards Poor-quality data can derail operations, misguide strategies, and erode the trust of both customers and stakeholders. Data quality dashboards have emerged as indispensable tools, offering a clear window into the health of their data and enabling targeted actionable improvements.

Banking

Banking Data Pharmaceutical Retail

Entity Resolution: Your Guide to Deciding Whether to Build It or Buy It

Adding high-quality entity resolution capabilities to enterprise applications, services, data fabrics or data pipelines can be daunting and expensive. This will help you decide whether to build an in-house entity resolution system or utilize an existing solution like the Senzing® API for entity resolution.

IT

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Snowflake

DECEMBER 4, 2024

Together with a dozen experts and leaders at Snowflake, I have done exactly that, and today we debut the result: the “ Snowflake Data + AI Predictions 2024 ” report. When you’re running a large language model, you need observability into how the model may change as it ingests new data. The next evolution in data is making it AI ready.

Unstructured Data

Unstructured Data Data Lake Deep Learning Metadata

Unlocking the Power of Geospatial Data for Insights

Snowflake

JANUARY 15, 2025

Over the last three geospatial-centric blog posts, weve covered the basics of what geospatial data is, how it works in the broader world of data and how it specifically works in Snowflake based on our native support for GEOGRAPHY , GEOMETRY and H3. But there is so much more you can do with geospatial data in your Snowflake account!

Transportation

Transportation BI Database-centric Metadata

Why You Need RAG to Stay Relevant as a Data Scientist

KDnuggets

JUNE 11, 2025

By Nate Rosidi , KDnuggets Market Trends & SQL Content Specialist on June 11, 2025 in Language Models Image by Author | Canva If you work in a data-related field, you should update yourself regularly. Data scientists use different tools for tasks like data visualization, data modeling, and even warehouse systems.

Data Science

Data Science Machine Learning Python SQL

Your Enterprise Data Needs an Agent

Snowflake

FEBRUARY 12, 2025

AI agents, autonomous systems that perform tasks using AI, can enhance business productivity by handling complex, multi-step operations in minutes. Agents need to access an organization's ever-growing structured and unstructured data to be effective and reliable.

Unstructured Data

Unstructured Data Government SQL Structured Data

Bridging the Gap: New Datasets Push Recommender Research Toward Real-World Scale

KDnuggets

JUNE 11, 2025

By KDnuggets on June 11, 2025 in Partners Sponsored Content Recommender systems rely on data, but access to truly representative data has long been a challenge for researchers. It joins a growing list of resources helping to close the research-to-production gap in recommender systems.

Datasets

Datasets Metadata Machine Learning Data Science

7 Best Data Warehousing Tools for Efficient Data Storage Needs

ProjectPro

JUNE 6, 2025

Data is often referred to as the new oil, and just like oil requires refining to become useful fuel, data also needs a similar transformation to unlock its true value. This transformation is where data warehousing tools come into play, acting as the refining process for your data. Why Choose a Data Warehousing Tool?

Data Storage

Data Storage PostgreSQL Data Warehouse AWS

Build Better Data Pipelines with SQL and Python in Snowflake

Snowflake

JUNE 10, 2025

Data transformations are the engine room of modern data operations — powering innovations in AI, analytics and applications. As the core building blocks of any effective data strategy, these transformations are crucial for constructing robust and scalable data pipelines. This puts data engineers in a critical position.

Data Pipeline

Data Pipeline SQL Python Building

Data Appending vs. Data Enrichment: How to Maximize Data Quality and Insights

Precisely

APRIL 7, 2025

We often use different terms when were talking about the same thing in this case, data appending vs. data enrichment. Ive noticed that “data appending” is more commonly used in industries like marketing and telecommunications, while data enrichment seems to be the preferred term in financial services and retail.

Retail

Retail Datasets Data Portfolio

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Data Engineering Weekly #222

Data Engineering Weekly

JUNE 1, 2025

Dagster for MLOps: Deep Dive into AI Orchestration Learn what it really takes to run production-grade ML systems—without breaking your architecture or compliance efforts. Data Engineering Weekly recently published a reference architecture for a composable data architecture.

Data Engineering

Data Engineering Data Engineer Engineering Relational Database

15+ Exciting Python Flask Projects for Data Science Enthusiasts

ProjectPro

JUNE 6, 2025

Are you a data science enthusiast looking to enhance your Python Flask skills? Check out these exciting python flask projects that will help you apply your Flask knowledge to solve real-world data science challenges. Here is the list of the best Python Flask projects ideal for data experts. This is where Python Flask comes in.

Data Science

Data Science Python Project Google Cloud

Snowflake Ventures Invests in Anomalo for Advanced Data Quality

Snowflake

MARCH 12, 2025

In todays data-driven world, organizations depend on high-quality data to drive accurate analytics and machine learning models. But poor data quality gaps, inconsistencies and errors can undermine even the most sophisticated data and AI initiatives.

Unstructured Data

Unstructured Data High Quality Data Banking Machine Learning

10 Real World Data Science Case Studies Projects with Example

ProjectPro

JUNE 6, 2025

With wide applications in various sectors like healthcare , education, retail, transportation, media, and banking -data science applications are at the core of pretty much every industry out there. How do you prepare a data science case study? petabytes of data every hour! petabytes of data every hour!

Data Science

Data Science Project Food Pharmaceutical

How to Transition from ETL Developer to Data Engineer?

ProjectPro

JUNE 6, 2025

In the thought process of making a career transition from ETL developer to data engineer job roles? Read this blog to know how various data-specific roles, such as data engineer, data scientist, etc., differ from ETL developer and the additional skills you need to transition from ETL developer to data engineer job roles.

Data Engineering

Data Engineering Data Engineer Engineering ETL Tools

10 Essential Linux File System Commands for Data Management

Introducing Agent Bricks: Auto-Optimized Agents Using Your Data

Webinars

Trending Sources

Data News — Week 25.02

Webinars

How Meta discovers data flows via lineage at scale

A Guide to Debugging Apache Airflow® DAGs

Agents of Change: Navigating 2025 with AI and Data Innovation

Data Integrity for AI: What’s Old is New Again

Data Engineering Roadmap, Learning Path,& Career Track 2025

What Is PDFMiner And Should You Use It – How To Extract Data From PDFs

Improving the Accuracy of Generative AI Systems: A Structured Approach

Vector Technologies for AI: Extending Your Existing Data Stack

How Meta understands data at scale

Data Quality Testing: A Shared Resource for Modern Data Teams

Cloud Data Warehouse Migrations: Success Stories from WHOOP and Nexon

LLMs in Production: Tooling, Process, and Team Structure

Mosaic AI Announcements at Data + AI Summit 2025

Modern Data Architecture: Data Mesh and Data Fabric 101

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

They Handle 500B Events Daily. Here’s Their Data Engineering Architecture.

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Stop Overcomplicating Data Quality

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

Data logs: The latest evolution in Meta’s access tools

The Future of Data Management Is Agentic AI

Monetizing Analytics Features: Why Data Visualizations Will Never Be Enough

Make Your Company Data Driven with Redash

Understanding Change Data Capture (CDC) in MySQL and PostgreSQL: BinLog vs. WAL + Logical Decoding

The Race For Data Quality in a Medallion Architecture

A Guide to the Six Types of Data Quality Dashboards

Entity Resolution: Your Guide to Deciding Whether to Build It or Buy It

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Unlocking the Power of Geospatial Data for Insights

Why You Need RAG to Stay Relevant as a Data Scientist

Your Enterprise Data Needs an Agent

Bridging the Gap: New Datasets Push Recommender Research Toward Real-World Scale

7 Best Data Warehousing Tools for Efficient Data Storage Needs

Build Better Data Pipelines with SQL and Python in Snowflake

Data Appending vs. Data Enrichment: How to Maximize Data Quality and Insights

How Apache Iceberg Is Changing the Face of Data Lakes

Data Engineering Weekly #222

15+ Exciting Python Flask Projects for Data Science Enthusiasts

Snowflake Ventures Invests in Anomalo for Advanced Data Quality

10 Real World Data Science Case Studies Projects with Example

How to Transition from ETL Developer to Data Engineer?

Stay Connected