Building and Metadata - Data Engineering Digest

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

This acquisition delivers access to trusted data so organizations can build reliable AI models and applications by combining data from anywhere in their environment. It leverages knowledge graphs to keep track of all the data sources and data flows, using AI to fill the gaps so you have the most comprehensive metadata management solution.

Metadata

Metadata Management Data Governance Government

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance.

Metadata

Metadata MongoDB MySQL Scala

Interesting startup idea: benchmarking cloud platform pricing

The Pragmatic Engineer

OCTOBER 17, 2024

A €150K ($165K) grant, three people, and 10 months to build it. Results are stored in git and their database, together with benchmarking metadata. Benchmarking results for each instance type are stored in sc-inspector-data repo, together with the benchmarking task hash and other metadata. There Tech stack.

Cloud

Cloud AWS Metadata Cloud Computing

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

The data warehouse solved for performance and scale but, much like the databases that preceded it, relied on proprietary formats to build vertically integrated systems. Iceberg tables become interoperable while maintaining ACID compliance by adding a layer of metadata to the data files in a users object storage.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

Building Linked Data Products With JSON-LD

Data Engineering Podcast

SEPTEMBER 17, 2023

Summary A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!

Building

Building SQL BI Python

Agents of Change: Navigating 2025 with AI and Data Innovation

Data Engineering Weekly

DECEMBER 28, 2024

Enterprises are encouraged to experiment with AI, build numerous small-scale agents, learn from each, and expand their agent infrastructure over time. Moreover, we anticipate a growing emphasis on intelligent data platforms that unify data and metadata, further supported by efforts to enhance data cataloging and lineage tracking.

Unstructured Data

Unstructured Data Metadata Data Government

Why Column-Aware Metadata Is Key to Automating Data Transformations

Snowflake

JANUARY 25, 2023

Over the multiple decades I’ve spent in the data industry, one observation has remained nearly constant: the majority of the work in building a data analytics platform revolves around data transformations (what we used to call “the T in ETL or ELT”). For the future, our automation tools must collect and manage metadata at the column level.

Metadata

Metadata Data Pipeline Government Data

Modern Data Governance: Trends for 2025

Precisely

JANUARY 30, 2025

Key Takeaways: Prioritize metadata maturity as the foundation for scalable, impactful data governance. Integrate data governance and data quality practices to create a seamless user experience and build trust in your data. The panel agreed that metadata maturity is essential for scalability and driving business outcomes.

Data Governance

Data Governance Government Metadata Data

Building A Data Mesh Platform At PayPal

Data Engineering Podcast

FEBRUARY 26, 2023

Jean-Georges Perrin was tasked with designing a new data platform implementation at PayPal and wound up building a data mesh. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. We feel your pain. It ends up being anything but that. We do this for one simple reason: because time matters.

Building

Building Machine Learning Metadata Data Integration

Build Data Products Without A Data Team Using AgileData

Data Engineering Podcast

NOVEMBER 13, 2022

Summary Building data products is an undertaking that has historically required substantial investments of time and talent. In this episode he explains the design of the platform and how it builds on agile development principles to help you focus on delivering value. Atlan is the metadata hub for your data ecosystem.

Building

Building Metadata MongoDB MySQL

Foundation Model for Personalized Recommendation

Netflix Tech

MARCH 28, 2025

These insights have shaped the design of our foundation model, enabling a transition from maintaining numerous small, specialized models to building a scalable, efficient system. Therefore, its also important to let foundation models use metadata information of entities and inputs, not just member interaction data.

Metadata

Metadata Bytes Data Mining Entertainment

How Meta understands data at scale

Engineering at Meta

APRIL 28, 2025

We discovered that a flexible and incremental approach was necessary to onboard the wide variety of systems and languages used in building Metas products. Were upholding that by investing our vast engineering capabilities into building cutting-edge privacy technology. We believe that privacy drives product innovation.

Metadata

Metadata Data Utilities Data Warehouse

Modern Data Architecture: Data Mesh and Data Fabric 101

Precisely

OCTOBER 31, 2024

While data products may have different definitions in different organizations, in general it is seen as data entity that contains data and metadata that has been curated for a specific business purpose. A data fabric weaves together different data management tools, metadata, and automation to create a seamless architecture.

Data Architecture

Data Architecture Architecture Metadata Government

Data News — Week 24.11

Christophe Blefari

MARCH 15, 2024

Building Meta’s GenAI infrastructure — 2x 24k GPU clusters and it's growing. Attributing Snowflake cost to whom it belongs — Fernando gives ideas about metadata management to attribute better Snowflake cost. I'm speechless. This is Croissant.

Metadata

Metadata Data Data Warehouse Software Engineering

Build Better Data Products By Creating Data, Not Consuming It

Data Engineering Podcast

NOVEMBER 6, 2022

In this episode Nick King discusses how you can be intentional about data creation in your applications and services to reduce the friction and errors involved in building data products and ML applications. Atlan is the metadata hub for your data ecosystem. And don’t forget to thank them for their continued support of this show!

Building

Building IT Metadata MongoDB

Exploring The Nuances Of Building An Intential Data Culture

Data Engineering Podcast

MARCH 5, 2023

It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. We feel your pain. It ends up being anything but that. We feel your pain.

Building

Building Database Design Machine Learning Metadata

Title Launch Observability at Netflix Scale

Netflix Tech

JANUARY 6, 2025

Part 2: Navigating Ambiguity By: VarunKhaitan With special thanks to my stunning colleagues: Mallika Rao , Esmir Mesic , HugoMarques Building on the foundation laid in Part 1 , where we explored the what behind the challenges of title launch observability at Netflix, this post shifts focus to the how.

Metadata

Metadata Algorithm Systems Building

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Snowflake

DECEMBER 4, 2023

Snowflake’s support for Iceberg Tables is now in public preview, helping customers build and integrate Snowflake into their lake architecture. A benefit of the GLUE catalog integration in comparison to OBJECT_STORE is easier table refresh since GLUE doesn’t require a specific metadata file path, while OBJECT_STORE does.

Building

Building Metadata Cloud Storage AWS

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera

Data Engineering Podcast

MARCH 27, 2022

In this episode Balaji Ganesan shares how his experiences building and maintaining Ranger in previous roles helped him understand the needs of organizations and engineers as they define and evolve their data governance policies and practices. Acryl]([link] The modern data stack needs a reimagined metadata management platform.

Data Governance

Data Governance Government Cloud Building

Build your data pipelines like the Toyota Way

François Nguyen

FEBRUARY 28, 2021

The idea is to transpose these 7 principles to data pipeline knowing that Data pipelines are 100% flexible : if you have the skills, you build the pipeline you want. We have 2 teams : one is building the pipelines and the other to maintain them. This is where you have all the main tools to improve manufacturing processes.

Data Pipeline

Data Pipeline Building Manufacturing BI

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

dbt Developer Hub

APRIL 20, 2025

We expect that over the coming years, structured data is going to become heavily integrated into AI workflows and that dbt will play a key role in building and provisioning this data. We are committed to building the data control plane that enables AI to reliably access structured data from across your entire data lineage.

Structured Data

Structured Data SQL BI Project

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

ThoughtSpot

NOVEMBER 5, 2024

In the realm of modern analytics platforms, where rapid and efficient processing of large datasets is essential, swift metadata access and management are critical for optimal system performance. Any delays in metadata retrieval can negatively impact user experience, resulting in decreased productivity and satisfaction. What is Atlas?

Metadata

Metadata PostgreSQL Java Database

Building Real-time Machine Learning Foundations at Lyft

Lyft Engineering

JUNE 28, 2023

On the flip side, there was a substantial appetite to build real-time ML systems from developers at Lyft. Shortly after we built it, it was utilized by another pod within our team to build a Real-time Anomaly Detection product. To meet the needs of our customers, we kicked off the Real-time Machine Learning with Streaming initiative.

Machine Learning

Machine Learning Building Kafka Metadata

How to get started with dbt

Christophe Blefari

MARCH 1, 2023

You can also add metadata on models (in YAML). docs — in dbt you can add metadata on everything, some of the metadata is already expected by the framework and thank to it you can generate a small web page with your light catalog inside: you only need to do dbt docs generate and dbt docs serve.

Data Warehouse

Data Warehouse SQL Metadata Raw Data

An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch

Data Engineering Podcast

DECEMBER 25, 2022

Summary Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. Atlan is the metadata hub for your data ecosystem. Struggling with broken pipelines? Stale dashboards? Missing data?

Building

Building Metadata Business Intelligence Data Lake

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

In order to build high-quality data lineage, we developed different techniques to collect data flow signals across different technology stacks: static code analysis for different languages, runtime instrumentation, and input and output data matching, etc. Lineage can also be extended to other use cases such as security and integrity.

Data Warehouse

Data Warehouse SQL Programming Language Data

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Cloudera

OCTOBER 23, 2024

In August, we wrote about how in a future where distributed data architectures are inevitable, unifying and managing operational and business metadata is critical to successfully maximizing the value of data, analytics, and AI. It is a critical feature for delivering unified access to data in distributed, multi-engine architectures.

Metadata

Metadata BI Data Lake Business Intelligence

Strobelight: A profiling service built on open source technology

Engineering at Meta

JANUARY 21, 2025

Did someone say Metadata? There are even folks who create dashboards from this metadata to help other engineers identify expensive copying, use of inefficient or inappropriate C++ containers, overuse of smart pointers, and much more. Looking at function call stacks with flame graphs is great, nothing against it.

Technology

Technology Metadata Utilities Engineering

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

This ecosystem includes: Catalogs: Services that manage metadata about Iceberg tables (e.g., Maintenance Processes: Operations that optimize Iceberg tables, such as compacting small files and managing metadata. Metadata Overhead: Iceberg relies heavily on metadata to track table changes and enable features like time travel.

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Building a Data Platform in 2024

Towards Data Science

FEBRUARY 9, 2024

How to build a modern, scalable data platform to power your analytics and data science projects (updated) Table of Contents: What’s changed? Orchestration I mentioned modularity as a core concept of building a modern data platform in my 2021 article, but I failed to emphasize the importance of data orchestration.

Building

Building Transportation Data Lake Metadata

Building a Customer 360 in the Snowflake Data Cloud with RudderStack

Snowflake

OCTOBER 2, 2023

To help customers overcome these challenges, RudderStack and Snowflake recently launched Profiles , a new product that allows every data team to build a customer 360 directly in their Snowflake Data Cloud environment. Now teams can leverage their existing data engineering tools and workflows to build their customer 360.

Cloud

Cloud Building Insurance Data Engineer

Build and deploy ML with ease Using Snowpark ML, Snowflake Notebooks, and Snowflake Feature Store

Snowflake

NOVEMBER 1, 2023

The Snowpark Model Registry now builds on a native Snowflake model entity with built-in versioning support, role-based access control and a SQL API for more streamlined management catering to both SQL and Python users. What’s Next?

Building

Building Python SQL Programming Language

Building a Rust workspace with Bazel

Tweag

JULY 26, 2023

The vast majority of the Rust projects are using Cargo as a build tool. Cargo is great when you are developing and packaging a single Rust library or application, but when it comes to a fast-growing and complex workspace, one could be attracted to the idea of using a more flexible and scalable build system.

Building

Building Metadata Coding Project

Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata

Data Engineering Podcast

SEPTEMBER 11, 2022

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode.

Systems

Systems Metadata Building MongoDB

Build A Common Understanding Of Your Data Reliability Rules With Soda Core and Soda Checks Language

Data Engineering Podcast

SEPTEMBER 25, 2022

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode.

Building

Building Metadata MongoDB MySQL

Building A Shared Understanding Of Data Assets In A Business Through A Single Pane Of Glass With Workstream

Data Engineering Podcast

SEPTEMBER 18, 2022

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode.

Building

Building Metadata MongoDB MySQL

Expert Insights for Your 2025 Data, Analytics, and AI Initiatives

Precisely

NOVEMBER 18, 2024

When AI is only as trustworthy as the data it’s trained on, you must prioritize data governance, quality, and overall integrity – whether building new AI solutions or refining existing ones. According to Anandarajan, building a culture of data literacy is what will help to bridge this gap. Focus on metadata management.

Data Analytics

Data Analytics Data Governance Data Integration Government

Data Engineering Weekly #218

Data Engineering Weekly

APRIL 27, 2025

functional correctness AI-as-a-judge comparative evaluation [link] OpenAI: A practical guide to building agents OpenAI publishes a comprehensive guide on building AI Agents. The author walks through three broad categories of evaluation-driven development. The guide walks through three core components of AI Agents.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Scale Unstructured Text Analytics with Batch LLM Inference

Snowflake

MARCH 6, 2025

Meanwhile, operations teams use entity extraction on documents to automate workflows and enable metadata-driven analytical filtering. Skai deployed a categorization tool in just two days to help its customers get better insights about purchasing patterns by building categories that make sense across multiple ecommerce platforms.

Unstructured Data

Unstructured Data Medical Media Data Workflow

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

JUNE 16, 2024

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake.

Data Lake

Data Lake High Quality Data Metadata Machine Learning

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

Below a diagram describing what I think schematises data platforms: Data storage — you need to store data in an efficient manner, interoperable, from the fresh to the old one, with the metadata. It adds metadata, read, write and transactions that allow you to treat a Parquet file as a table. But what is doing Tabular?

Metadata

Metadata Data Warehouse BI MySQL

Dynamic CSV Column Mapping with Stored Procedures

Cloudyard

FEBRUARY 17, 2025

In this blog, well address this challenge by building a metadata-driven solution using a JavaScript stored procedure that dynamically maps and loads only the required columns from multiple CSV files into their respective Snowflake tables. Metadata Proc Step 4: Execute the Stored Procedure.

Metadata

Metadata SQL Data Engineering Data Engineer

Building a Control Plane for Lyft’s Shared Development Environment

Lyft Engineering

SEPTEMBER 6, 2023

Our team, the Developer Infrastructure team, aims to build the best tools to enable microservice owners (our “customers”) to reliably and quickly test changes in a local and/or end-to-end environment. Routing overrides metadata: embed metadata in API request headers defining which offloaded deployment the request will get routed to.

Building

Building Metadata Electronics Engineering

Establishing a Large Scale Learned Retrieval System at Pinterest

Pinterest Engineering

JANUARY 31, 2025

This work illustrates our effort in successfully building Pinterest an internal embedding-based retrieval system for organic content learned purely from logged user engagement events and serves in production. The metadata is generated together with the index. We have deployed our system for homefeed as well as notification.

Systems

Systems Metadata Machine Learning Architecture

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Level Up Your Data Platform With Active Metadata

Webinars

Trending Sources

Interesting startup idea: benchmarking cloud platform pricing

Webinars

How Apache Iceberg Is Changing the Face of Data Lakes

Building Linked Data Products With JSON-LD

Agents of Change: Navigating 2025 with AI and Data Innovation

Why Column-Aware Metadata Is Key to Automating Data Transformations

Modern Data Governance: Trends for 2025

Building A Data Mesh Platform At PayPal

Build Data Products Without A Data Team Using AgileData

Foundation Model for Personalized Recommendation

How Meta understands data at scale

Modern Data Architecture: Data Mesh and Data Fabric 101

Data News — Week 24.11

Build Better Data Products By Creating Data, Not Consuming It

Exploring The Nuances Of Building An Intential Data Culture

Title Launch Observability at Netflix Scale

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera

Build your data pipelines like the Toyota Way

Introducing the dbt MCP Server – Bringing Structured Data to AI Workflows and Agents

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

Building Real-time Machine Learning Foundations at Lyft

How to get started with dbt

An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch

How Meta discovers data flows via lineage at scale

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Strobelight: A profiling service built on open source technology

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Building a Data Platform in 2024

Building a Customer 360 in the Snowflake Data Cloud with RudderStack

Build and deploy ML with ease Using Snowpark ML, Snowflake Notebooks, and Snowflake Feature Store

Building a Rust workspace with Bazel

Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata

Build A Common Understanding Of Your Data Reliability Rules With Soda Core and Soda Checks Language

Building A Shared Understanding Of Data Assets In A Business Through A Single Pane Of Glass With Workstream

Expert Insights for Your 2025 Data, Analytics, and AI Initiatives

Data Engineering Weekly #218

Scale Unstructured Text Analytics with Batch LLM Inference

Being Data Driven At Stripe With Trino And Iceberg

Databricks, Snowflake and the future

Dynamic CSV Column Mapping with Stored Procedures

Building a Control Plane for Lyft’s Shared Development Environment

Establishing a Large Scale Learned Retrieval System at Pinterest

Stay Connected