Data and Metadata - Data Engineering Digest

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

We are excited to announce the acquisition of Octopai , a leading data lineage and catalog platform that provides data discovery and governance for enterprises to enhance their data-driven decision making. This dampens confidence in the data and hampers access, in turn impacting the speed to launch new AI and analytic projects.

Metadata

Metadata Management Data Governance Government

How Metadata Improves Security, Quality, and Transparency

KDnuggets

APRIL 25, 2022

Metadata is the data providing context about the data, more than what you see in the rows and columns. By managing your metadata, you're effectively creating an encyclopedia of your data assets.

Metadata

Metadata Management Data Data Science

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance.

Metadata

Metadata MongoDB MySQL Scala

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Data Engineering Best Practices - #2. Metadata & Logging

Start Data Engineering

FEBRUARY 22, 2024

Data Pipeline Logging Best Practices 3.1. Metadata: Information about pipeline runs, & data flowing through your pipeline 3.2. Introduction 2. Setup & Logging architecture 3. Obtain visibility into the code’s execution sequence using text logs 3.3. Understand resource usage by tracking Metrics 3.4.

Metadata

Metadata Data Engineering Data Engineer Engineering

Metadata – Data Interoperability’s Hidden Talent (Part Two)

ArcGIS

SEPTEMBER 23, 2024

Metadata, the data about your data, is incredibly important, and Data Interoperability can help you create, manage, and maintain that data.

Metadata

Metadata Data Management Data Management

Modern Data Architecture: Data Mesh and Data Fabric 101

Precisely

OCTOBER 31, 2024

Key Takeaways: Data mesh is a decentralized approach to data management, designed to shift creation and ownership of data products to domain-specific teams. Data fabric is a unified approach to data management, creating a consistent way to manage, access, and share data across distributed environments.

Data Architecture

Data Architecture Architecture Metadata Government

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

Data storage has been evolving, from databases to data warehouses and expansive data lakes, with each architecture responding to different business and data needs. Traditional databases excelled at structured data and transactional workloads but struggled with performance at scale as data volumes grew.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Agents of Change: Navigating 2025 with AI and Data Innovation

Data Engineering Weekly

DECEMBER 28, 2024

In this post, we delve into predictions for 2025, focusing on the transformative role of AI agents, workforce dynamics, and data platforms. For professionals across domains—data engineers, AI engineers, and data scientists—the message is clear: adapt or become obsolete.

Unstructured Data

Unstructured Data Metadata Data Government

Metadata – Data Interoperability’s Hidden Talent (Part One)

ArcGIS

SEPTEMBER 23, 2024

Metadata, the data about your data, is incredibly important, and Data Interoperability can help you create, manage, and maintain that data.

Metadata

Metadata Data Management Data Management

Modern Data Governance: Trends for 2025

Precisely

JANUARY 30, 2025

Key Takeaways: Prioritize metadata maturity as the foundation for scalable, impactful data governance. Recognize that artificial intelligence is a data governance accelerator and a process that must be governed to monitor ethical considerations and risk.

Data Governance

Data Governance Government Metadata Data

Data News — Week 24.11

Christophe Blefari

MARCH 15, 2024

Saying mainly that " Sora is a tool to extend creativity " Last point Mira has been mocked and criticised online because as a CTO she wasn't able to say on which public / licensed data Sora has been trained on. This is related to Paris testing automated video surveillance during Olympics. This is Croissant.

Metadata

Metadata Data Data Warehouse Software Engineer

How Meta understands data at scale

Engineering at Meta

APRIL 28, 2025

Managing and understanding large-scale data ecosystems is a significant challenge for many organizations, requiring innovative solutions to efficiently safeguard user data. To address these challenges, we made substantial investments in advanced data understanding technologies, as part of our Privacy Aware Infrastructure (PAI).

Metadata

Metadata Data Utilities Data Warehouse

Interesting startup idea: benchmarking cloud platform pricing

The Pragmatic Engineer

OCTOBER 17, 2024

Storing data: data collected is stored to allow for historical comparisons. Benchmarking: for new server types identified – or ones that need an updated benchmark executed to avoid data becoming stale – those instances have a benchmark started on them. Each benchmarking task is evaluated sequentially.

Cloud

Cloud AWS Metadata Cloud Computing

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

Data lineage is an instrumental part of Metas Privacy Aware Infrastructure (PAI) initiative, a suite of technologies that efficiently protect user privacy. It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Metas systems.

Data Warehouse

Data Warehouse SQL Programming Language Data

Foundation Model for Personalized Recommendation

Netflix Tech

MARCH 28, 2025

Furthermore, it was difficult to transfer innovations from one model to another, given that most are independently trained despite using common data sources. Key insights from this shiftinclude: A Data-Centric Approach : Shifting focus from model-centric strategies, which heavily rely on feature engineering, to a data-centric one.

Metadata

Metadata Bytes Data Mining Entertainment

The Data Turf Wars are Over, But the Metadata Turf Wars Have Just Begun

Cloudera

AUGUST 6, 2024

Over the past several years, data leaders asked many questions about where they should keep their data and what architecture they should implement to serve an incredible breadth of analytic use cases. The future for most data teams will be multi-cloud and hybrid. It no longer matters where the data is.

Metadata

Metadata Government Datasets Architecture

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

dbt Developer Hub

APRIL 20, 2025

dbt is the standard for creating governed, trustworthy datasets on top of your structured data. We expect that over the coming years, structured data is going to become heavily integrated into AI workflows and that dbt will play a key role in building and provisioning this data. What is MCP? Why does this matter?

Structured Data

Structured Data SQL BI Project

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Snowflake

DECEMBER 4, 2024

Together with a dozen experts and leaders at Snowflake, I have done exactly that, and today we debut the result: the “ Snowflake Data + AI Predictions 2024 ” report. When you’re running a large language model, you need observability into how the model may change as it ingests new data. The next evolution in data is making it AI ready.

Unstructured Data

Unstructured Data Data Lake Deep Learning Structured Data

Data Engineering Weekly #218

Data Engineering Weekly

APRIL 27, 2025

Before Hoptimator, Pinot ingestion often required data producers to create and manage separate, Pinot-specific preprocessing jobs to optimize data, such as re-keying, filtering, and pre-aggregating. reducing user friction, operator toil, and resource consumption on Pinot servers, while automating pipeline management.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

Editor’s Note: Launching Data & Gen-AI courses in 2025 I can’t believe DEW will reach almost its 200th edition soon. What I started as a fun hobby has become one of the top-rated newsletters in the data engineering industry. We are planning many exciting product lines to trial and launch in 2025.

Data Engineering

Data Engineering Data Engineer Engineering Insurance

Simplifying Multimodal Data Analysis with Snowflake Cortex AI

Snowflake

APRIL 16, 2025

Snowflake Cortex AI now features native multimodal AI capabilities, eliminating data silos and the need for separate, expensive tools. This major enhancement brings the power to analyze images and other unstructured data directly into Snowflakes query engine, using familiar SQL at scale.

Data Analysis

Data Analysis Unstructured Data Manufacturing Retail

Expert Insights for Your 2025 Data, Analytics, and AI Initiatives

Precisely

NOVEMBER 18, 2024

Key Takeaways: Data integrity is required for AI initiatives, better decision-making, and more – but data trust is on the decline. Data quality and data governance are the top data integrity challenges, and priorities. The panelists shared their thoughts: Data ecosystem complexity is increasing.

Data Analytics

Data Analytics Data Governance Data Integration Government

Stop Overcomplicating Data Quality

Towards Data Science

DECEMBER 10, 2024

Three Zero-Cost Solutions That Take Hours, NotMonths A data quality certified pipeline. Source: unsplash.com In my career, data quality initiatives have usually meant big changes. Whats more, fixing the data quality issues this way often leads to new problems. Create a custom dashboard for your specific data qualityproblem.

PostgreSQL

PostgreSQL Data Python SQL

The Best Data Dictionary Tools in 2025

Monte Carlo

APRIL 28, 2025

Different teams love using the same data in totally different ways. Thats where data dictionary tools come in. A data dictionary tool helps define and organize your data so everyones speaking the same language. A data dictionary tool helps define and organize your data so everyones speaking the same language.

Metadata

Metadata Hadoop Data SQL

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Cloudera

OCTOBER 23, 2024

In August, we wrote about how in a future where distributed data architectures are inevitable, unifying and managing operational and business metadata is critical to successfully maximizing the value of data, analytics, and AI.

Metadata

Metadata BI Data Lake Business Intelligence

How to get started with dbt

Christophe Blefari

MARCH 1, 2023

dbt Core is an open-source framework that helps you organise data warehouse SQL transformation. dbt was born out of the analysis that more and more companies were switching from on-premise Hadoop data infrastructure to cloud data warehouses. This switch has been lead by modern data stack vision. Enter the ELT.

Data Warehouse

Data Warehouse SQL Metadata Raw Data

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

The modern data stack constantly evolves, with new technologies promising to solve age-old problems like scalability, cost, and data silos. It promised to address key pain points: Scaling: Handling ever-increasing data volumes. Speed: Accelerating data insights. Data Silos: Breaking down barriers between data sources.

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

JUNE 16, 2024

Summary Stripe is a company that relies on data to power their products and business. In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform.

Data Lake

Data Lake High Quality Data Metadata Machine Learning

AI-Driven Data Integrity Innovations to Solve Your Top Data Management Challenges

Precisely

FEBRUARY 26, 2025

Key Takeaways: New AI-powered innovations in the Precisely Data Integrity Suite help you boost efficiency, maximize the ROI of data investments, and make confident, data-driven decisions. These enhancements improve data accessibility, enable business-friendly governance, and automate manual processes.

Data Integration

Data Integration Data Management Management Data Governance

The Struggle Between Data Dark Ages and LLM Accuracy

Cloudera

DECEMBER 6, 2024

The AI Forecast: Data and AI in the Cloud Era , sponsored by Cloudera, aims to take an objective look at the impact of AI on business, industry, and the world at large. AI is only as successful as the data behind it. LLM precision is good, not great, right now Paul: I wanted to chat about this notion of precision data with you.

Manufacturing

Manufacturing Retail Finance Metadata

Strobelight: A profiling service built on open source technology

Engineering at Meta

JANUARY 21, 2025

When you combine talented engineers with rich performance data you can get efficiency wins by both creating tooling to identify issues before they reach production and finding opportunities in already running code. Metas existing tools can identify the issue and query Strobelight data to estimate the impact on compute cost.

Technology

Technology Metadata Utilities Engineering

Announcing Open Source DataOps Data Quality TestGen 3.0

DataKitchen

FEBRUARY 20, 2025

Announcing DataOps Data Quality TestGen 3.0: Open-Source, Generative Data Quality Software. It assesses your data, deploys production testing, monitors progress, and helps you build a constituency within your company for lasting change. Imagine an open-source tool thats free to download but requires minimal time and effort.

Datasets

Datasets Metadata Data Government

2024 Governance Trends for Data Leaders

phData: Data Engineering

NOVEMBER 1, 2024

In an effort to better understand where data governance is heading, we spoke with top executives from IT, healthcare, and finance to hear their thoughts on the biggest trends, key challenges, and what insights they would recommend. With that, let’s get into the governance trends for data leaders! Want to Save This Guide for Later?

Government

Government Data Governance Finance Metadata

The Data Discovery Team

Jesse Anderson

NOVEMBER 14, 2023

A Guest Post by Ole Olesen-Bagneux In this blog post I would like to describe a new data team, that I call ‘the data discovery team’. Data discovery is thought of in different ways in data science and in information science respectfully. In an enterprise data reality, searching for data is a bit of a hassle.

Metadata

Metadata Data Science Big Data Data

Build Data Products Without A Data Team Using AgileData

Data Engineering Podcast

NOVEMBER 13, 2022

Summary Building data products is an undertaking that has historically required substantial investments of time and talent. With the rise in cloud platforms and self-serve data technologies the barrier of entry is dropping. Atlan is the metadata hub for your data ecosystem.

Building

Building Metadata MongoDB MySQL

Reflecting On The Past 6 Years Of Data Engineering

Data Engineering Podcast

FEBRUARY 5, 2023

In that time there have been a number of generational shifts in how data engineering is done. Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Materialize]([link] Looking for the simplest way to get the freshest data possible to your teams?

Data Engineering

Data Engineering Data Engineer Engineering PostgreSQL

Data logs: The latest evolution in Meta’s access tools

Engineering at Meta

FEBRUARY 4, 2025

Were sharing how Meta built support for data logs, which provide people with additional data about how they use our products. Here we explore initial system designs we considered, an overview of the current architecture, and some important principles Meta takes into account in making data accessible and easy to understand.

Accessible

Accessible Accessibility Raw Data Data Warehouse

Build Better Data Products By Creating Data, Not Consuming It

Data Engineering Podcast

NOVEMBER 6, 2022

Summary A lot of the work that goes into data engineering is trying to make sense of the "data exhaust" from other applications and services. Atlan is the metadata hub for your data ecosystem. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day.

Building

Building IT Metadata MongoDB

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

As we approach 2025, data teams find themselves at a pivotal juncture. The rapid evolution of technology and the increasing demand for data-driven insights have placed immense pressure on these teams. The future of data teams depends on their ability to adapt to new challenges and seize emerging opportunities.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

by Jasmine Omeke , Obi-Ike Nwoke , Olek Gorajek Intro This post is for all data practitioners, who are interested in learning about bootstrapping, standardization and automation of batch data pipelines at Netflix. You may remember Dataflow from the post we wrote last year titled Data pipeline asset management with Dataflow.

Data Pipeline

Data Pipeline Scala Metadata Food

Your Enterprise Data Needs an Agent

Snowflake

FEBRUARY 12, 2025

Agents need to access an organization's ever-growing structured and unstructured data to be effective and reliable. As data connections expand, managing access controls and efficiently retrieving accurate informationwhile maintaining strict privacy protocolsbecomes increasingly complex. text, audio) and structured (e.g.,

Unstructured Data

Unstructured Data Government SQL Structured Data

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms.

Architecture

Architecture Systems Data Lake Google Cloud

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

ThoughtSpot

NOVEMBER 5, 2024

In the realm of modern analytics platforms, where rapid and efficient processing of large datasets is essential, swift metadata access and management are critical for optimal system performance. Any delays in metadata retrieval can negatively impact user experience, resulting in decreased productivity and satisfaction. What is Atlas?

Metadata

Metadata PostgreSQL Java Database

Data Engineering Best Practices - #1. Data flow & Code

Start Data Engineering

JULY 20, 2023

Use standard patterns that progressively transform your data 3.2. Ensure data is valid before exposing it to its consumers (aka data quality checks) 3.3. Avoid data duplicates with idempotent pipelines 3.4. Write DRY code & keep I/O separate from data transformation 3.5.

Coding

Coding Data Engineering Data Engineer Engineering

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

How Metadata Improves Security, Quality, and Transparency

Webinars

Trending Sources

Level Up Your Data Platform With Active Metadata

Webinars

Data Engineering Best Practices - #2. Metadata & Logging

Metadata – Data Interoperability’s Hidden Talent (Part Two)

Modern Data Architecture: Data Mesh and Data Fabric 101

How Apache Iceberg Is Changing the Face of Data Lakes

Agents of Change: Navigating 2025 with AI and Data Innovation

Metadata – Data Interoperability’s Hidden Talent (Part One)

Modern Data Governance: Trends for 2025

Data News — Week 24.11

How Meta understands data at scale

Interesting startup idea: benchmarking cloud platform pricing

How Meta discovers data flows via lineage at scale

Foundation Model for Personalized Recommendation

The Data Turf Wars are Over, But the Metadata Turf Wars Have Just Begun

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Data Engineering Weekly #218

Data Engineering Weekly #198

Simplifying Multimodal Data Analysis with Snowflake Cortex AI

Expert Insights for Your 2025 Data, Analytics, and AI Initiatives

Stop Overcomplicating Data Quality

The Best Data Dictionary Tools in 2025

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

How to get started with dbt

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Being Data Driven At Stripe With Trino And Iceberg

AI-Driven Data Integrity Innovations to Solve Your Top Data Management Challenges

The Struggle Between Data Dark Ages and LLM Accuracy

Strobelight: A profiling service built on open source technology

Announcing Open Source DataOps Data Quality TestGen 3.0

2024 Governance Trends for Data Leaders

The Data Discovery Team

Build Data Products Without A Data Team Using AgileData

Reflecting On The Past 6 Years Of Data Engineering

Data logs: The latest evolution in Meta’s access tools

Build Better Data Products By Creating Data, Not Consuming It

How To Prepare Your Data Team for 2025

Ready-to-go sample data pipelines with Dataflow

Your Enterprise Data Needs an Agent

Why Open Table Format Architecture is Essential for Modern Data Systems

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

Data Engineering Best Practices - #1. Data flow & Code

Stay Connected