Blog, Metadata and Systems - Data Engineering Digest

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

Additionally, multiple copies of the same data locked in proprietary systems contribute to version control issues, redundancies, staleness, and management headaches. It leverages knowledge graphs to keep track of all the data sources and data flows, using AI to fill the gaps so you have the most comprehensive metadata management solution.

Metadata

Metadata Management Data Governance Government

Establishing a Large Scale Learned Retrieval System at Pinterest

Pinterest Engineering

JANUARY 31, 2025

Modern large-scale recommendation systems usually include multiple stages where retrieval aims at retrieving candidates from billions of candidate pools, and ranking predicts which item a user tends to engage from the trimmed candidate set retrieved from early stages [2]. General multi-stage recommendation system design in Pinterest.

Systems

Systems Metadata Machine Learning Architecture

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. A variety of platforms have been developed to capture and analyze that information to great effect, but they are inherently limited in their utility due to their nature as storage systems.

Metadata

Metadata MongoDB MySQL Scala

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Metas systems. In this blog, we will delve into an early stage in PAI implementation: data lineage. Data lineage enables us to efficiently navigate these assets and protect user data.

Data Warehouse

Data Warehouse SQL Programming Language Data

Unapologetically Technical Episode 20 – Shane Murray

Jesse Anderson

MAY 5, 2025

Shane diagrams Monte Carlo’s architecture, explaining how it uses agents, metadata, and query logs to provide lineage and monitor data health across complex stacks (Snowflake, Databricks, etc.). We then dive deep into Monte Carlo Data, defining data observability and the crucial concept of “data downtime” (TTD + TTR).

Unstructured Data

Unstructured Data Finance Metadata Architecture

How Meta understands data at scale

Engineering at Meta

APRIL 28, 2025

Meta’s vast and diverse systems make it particularly challenging to comprehend its structure, meaning, and context at scale. We discovered that a flexible and incremental approach was necessary to onboard the wide variety of systems and languages used in building Metas products. We believe that privacy drives product innovation.

Metadata

Metadata Data Utilities Data Warehouse

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. In this blog, we will discuss: What is the Open Table format (OTF)? These systems are built on open standards and offer immense analytical and transactional processing flexibility.

Architecture

Architecture Systems Data Lake Google Cloud

Apache Ozone Metadata Explained

Cloudera

JUNE 2, 2021

It can manage billions of small and large files that are difficult to handle by other distributed file systems. As an important part of achieving better scalability, Ozone separates the metadata management among different services: . Ozone Manager (OM) service manages the metadata of the namespace such as volume, bucket and keys.

Metadata

Metadata Hadoop Certification Algorithm

Metadata Management And Integration At LinkedIn With DataHub

Data Engineering Podcast

AUGUST 24, 2020

The key to those solutions is a robust and flexible metadata management system. LinkedIn has gone through several iterations on the most maintainable and scalable approach to metadata, leading them to their current work on DataHub. What were you using at LinkedIn for metadata management prior to the introduction of DataHub?

Metadata

Metadata Management Kafka Data Engineering

A Look At The Data Systems Behind The Gameplay For League Of Legends

Data Engineering Podcast

NOVEMBER 20, 2022

Summary The majority of blog posts and presentations about data engineering and analytics assume that the consumers of those efforts are internal business users accessing an environment controlled by the business. Atlan is the metadata hub for your data ecosystem. How is everyone going to find the data they need, and understand it?

Systems

Systems Metadata Data Pipeline MongoDB

The Data Turf Wars are Over, But the Metadata Turf Wars Have Just Begun

Cloudera

AUGUST 6, 2024

And for that future to be a reality, data teams must shift their attention to metadata, the new turf war for data. The need for unified metadata While open and distributed architectures offer many benefits, they come with their own set of challenges. Data teams actually need to unify the metadata. Open data is the future.

Metadata

Metadata Government Datasets Architecture

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

The blog highlights how moving from 6-character base-64 to 20-digit base-2 file distribution brings more distribution in S3 and reduces request failures. The blog is a good summary of how to use Snowflake QUERY_TAG to measure and monitor query performance.

Data Engineering

Data Engineering Data Engineer Engineering Insurance

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Cloudera

OCTOBER 23, 2024

In August, we wrote about how in a future where distributed data architectures are inevitable, unifying and managing operational and business metadata is critical to successfully maximizing the value of data, analytics, and AI. It is a critical feature for delivering unified access to data in distributed, multi-engine architectures.

Metadata

Metadata BI Data Lake Business Intelligence

Building Shared State Microservices for Distributed Systems Using Kafka Streams

Confluent

AUGUST 1, 2019

The Kafka Streams API boasts a number of capabilities that make it well suited for maintaining the global state of a distributed system. At Imperva, we took advantage of Kafka Streams to build shared state microservices that serve as fault-tolerant, highly available single sources of truth about the state of objects in our system.

Kafka

Kafka Systems Building Metadata

Metadata Management and Data Governance with Cloudera SDX

Cloudera

JANUARY 26, 2024

This will allow a data office to implement access policies over metadata management assets like tags or classifications, business glossaries, and data catalog entities, laying the foundation for comprehensive data access control. First, a set of initial metadata objects are created by the data steward.

Metadata

Metadata Data Governance Government Management

A Flexible and Efficient Storage System for Diverse Workloads

Cloudera

SEPTEMBER 15, 2022

In this blog post, we will talk about a single Ozone cluster with the capabilities of both Hadoop Core File System (HCFS) and Object Store (like Amazon S3). A unified storage architecture that can store both files and objects and provide a flexible, scalable, and high-performance system. Bucket types. release version.

Systems

Systems Hadoop Metadata Telecommunication

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

This blog post expands on that insightful conversation, offering a critical look at Iceberg's potential and the hurdles organizations face when adopting it. Hadoop achieved this through distributed processing and storage, using a framework called MapReduce and the Hadoop Distributed File System (HDFS). Trino, Spark, Snowflake, DuckDB).

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Improving Pinterest Search Relevance Using Large Language Models

Pinterest Engineering

APRIL 4, 2025

In this blog, we will go through the technical design and share some offline and online results for our LLM-based search relevance pipeline. Pin Text Representations Pins on Pinterest are rich multimedia entities that feature images, videos, and other contents, often linked to external webpages or blogs.

Machine Learning

Machine Learning Metadata Architecture Datasets

Data Engineering Weekly #218

Data Engineering Weekly

APRIL 27, 2025

link] LinkedIn: Powering Apache Pinot ingestion with Hoptimator LinkedIn discusses Hoptimator, a system that enables consumer-driven, managed ingestion pipelines, specifically for Apache Pinot. It highlights the benefits of committing the leader epoch alongside the offset.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Weekly #217

Data Engineering Weekly

APRIL 20, 2025

The blog took out the last edition’s recommendation on AI and summarized the current state of AI adoption in enterprises. The simplistic model expressed in the blog made it easy for me to reason about the transactional system design. Kafka is probably the most reliable data infrastructure in the modern data era.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Ensuring the Successful Launch of Ads on Netflix

Netflix Tech

JUNE 1, 2023

In this blog post, we’ll discuss the methods we used to ensure a successful launch, including: How we tested the system Netflix technologies involved Best practices we developed Realistic Test Traffic Netflix traffic ebbs and flows throughout the day in a sinusoidal pattern. Basic with ads was launched worldwide on November 3rd.

Algorithm

Algorithm Kafka Metadata Systems

Metadata: What Is It and Why it Matters

Ascend.io

JULY 11, 2024

Metadata is the information that provides context and meaning to data, ensuring it’s easily discoverable, organized, and actionable. Imagine a library with millions of books but no catalog system to organize them. This is what managing data without metadata feels like. What is Metadata? Chaos, right?

Metadata

Metadata IT Government High Quality Data

Data Engineering Weekly #196

Data Engineering Weekly

NOVEMBER 3, 2024

Foundation Capital: A System of Agents brings Service-as-Software to life software is no longer simply a tool for organizing work; software becomes the worker itself, capable of understanding, executing, and improving upon traditionally human-delivered services. It's good to know about Dapr and restate.dev.

Data Engineering

Data Engineering Data Engineer Engineering Pipeline-centric

Introducing Impressions at Netflix

Netflix Tech

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily.

Kafka

Kafka Datasets Metadata Utilities

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

Cloudera

DECEMBER 16, 2022

We are pleased to announce that Cloudera has been named a Leader in the 2022 Gartner ® Magic Quadrant for Cloud Database Management Systems. Many of our customers use multiple solutions—but want to consolidate data security, governance, lineage, and metadata management, so that they don’t have to work with multiple vendors.

Database

Database Cloud Systems Management

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Cloudera

NOVEMBER 13, 2020

This is part of our series of blog posts on recent enhancements to Impala. The commonly-accepted best practice in database system design for years is to use an exhaustive search strategy to consider all the possible variations of specific database operations in a query plan. Metadata Caching. More on this below.

Metadata

Metadata Coding SQL Database

Snowflake Ransomware Guardrails

Snowflake

APRIL 30, 2025

In this blog post, we will explore the mitigation techniques that Snowflake offers, which fall into two categories: the Snowflake platform's default capabilities, which require no customer implementation, and capabilities that require customer implementation. Fail-Safe is not a new capability.

Accessible

Accessible Accessibility Metadata Cloud

2024 Governance Trends for Data Leaders

phData: Data Engineering

NOVEMBER 1, 2024

This blog is a collection of those insights, but for the full trendbook, we recommend downloading the PDF. Strong data governance also lays the foundation for better model performance, cost efficiency, and improved data quality, which directly contributes to regulatory compliance and more secure AI systems. No problem!

Government

Government Data Governance Finance Metadata

Change Data Capture at Pinterest

Pinterest Engineering

NOVEMBER 18, 2024

In this blog post, we’ll explore what CDC is, why it’s important, and our journey of implementing Generic CDC solutions for all online databases at Pinterest. This is crucial for applications that require up-to-date information, such as fraud detection systems or recommendation engines. What is Change Data Capture?

Kafka

Kafka MySQL Database Software Engineering

Tackling Configuration: creating Lego-Like Flexibility for non developers

Picnic Engineering

FEBRUARY 6, 2025

In our previous blogs, we explored how Picnics Page Platform transformed the way we build new featuresenabling faster iteration, tighter collaboration and less feature-specific complexity. In this blog, well dive into how we configure the pages within Picnics store. Now, were bringing this same principle to our app.

Metadata

Metadata Architecture SQL Building

Announcing Nickel 1.0

Tweag

MAY 16, 2023

That’s a dangerous mistake: with the advent of IaC for the cloud, configuration has become an important aspect of modern software systems, and a critical point of failure. A REPL nickel repl , a markdown documentation generator nickel doc and a nickel query command to retrieve metadata, types and contracts from code.

MySQL

MySQL Metadata Data Validation Coding

Data Engineering Weekly #209

Data Engineering Weekly

FEBRUARY 23, 2025

It covers nine categories: storage systems, data lake platforms, processing, integration, orchestration, infrastructure, ML/AI, metadata management, and analytics. I found the blog to be a comprehensive roadmap for data engineering in 2025. Let me know in the comments. link] All rights reserved ProtoGrowth Inc, India.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Metadata Management & Data Governance with Cloudera SDX

Cloudera

MARCH 4, 2024

This will allow a data office to implement access policies over metadata management assets like tags or classifications, business glossaries, and data catalog entities, laying the foundation for comprehensive data access control. First, a set of initial metadata objects are created by the data steward.

Metadata

Metadata Data Governance Government Management

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

In this blog post, we’ll explore key strategies that data teams should adopt to prepare for the year ahead. On average, engineers spend over half of their time maintaining existing systems rather than developing new solutions. Instead of driving innovation, data engineers often find themselves bogged down with maintenance tasks.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Building A Data Mesh Platform At PayPal

Data Engineering Podcast

FEBRUARY 26, 2023

What are the technical systems that you are relying on to power the different data domains? What is your philosophy on enforcing uniformity in technical systems vs. relying on interface definitions as the unit of consistency? What are the technical systems that you are relying on to power the different data domains?

Building

Building Machine Learning Metadata Data Integration

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

In this blog, I will demonstrate the value of Cloudera DataFlow (CDF) , the edge-to-cloud streaming data platform available on the Cloudera Data Platform (CDP) , as a Data integration and Democratization fabric. Data and Metadata: Data inputs and data outputs produced based on the application logic. Introduction.

Architecture

Architecture Metadata Kafka Government

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Netflix Tech

MAY 4, 2023

Behind the scenes, a myriad of systems and services are involved in orchestrating the product experience. These backend systems are consistently being evolved and optimized to meet and exceed customer and product expectations. This blog series will examine the tools, techniques, and strategies we have utilized to achieve this goal.

Utilities

Utilities Systems Architecture Coding

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

Thanks to the Netflix internal lineage system (built by Girish Lingappa ) Dataflow migration can then help you identify downstream usage of the table in question. All the above commands are very likely to be described in separate future blog posts, but right now let’s focus on the dataflow sample command.

Data Pipeline

Data Pipeline Scala Metadata Food

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

I honestly don’t have a solid answer, but this blog is an excellent overview of upskilling. The author emphasizes the importance of mastering state management, understanding "local first" data processing (prioritizing single-node solutions before distributed systems), and leveraging an asset graph approach for data pipelines.

Data Engineering

Data Engineering Data Engineer Engineering Data

Data Engineering Weekly #176

Data Engineering Weekly

JUNE 16, 2024

link] Jack Vanlightly: A Cost Analysis Of Replication Vs. S3 Express One Zone In Transactional Data Systems S3 Express One Zone, with low latency and write ops certainty, is promising. Replication-based systems remain more economical at low to medium throughputs, especially with significant cross-AZ data transfer discounts.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Snowflake Cortex Search: State-of-the-Art Hybrid Search for RAG Applications

Snowflake

JULY 25, 2024

Yet, while retrieval is a fundamental component of any AI application stack, creating a high-quality, high-performance RAG system remains challenging for most enterprises. It supports “fuzzy” search — the service takes in natural language queries and returns the most relevant text results, along with associated metadata.

Unstructured Data

Unstructured Data Metadata Government SQL

Data Engineering Weekly #215

Data Engineering Weekly

APRIL 6, 2025

The article summarizes the recent macro trends in AI and data engineering, focusing on Vibe coding, human-in-the-loop system design, and rapid simplification of developer tooling. The Grab blog delights me since I have tried to do this many times. Kudos to the Grab team for building a docs-as-code system.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Netflix Tech

NOVEMBER 14, 2023

In this three-part blog post series, we introduce you to Psyberg , our incremental data processing framework designed to tackle such challenges! Late-arriving data is essentially delayed data due to system retries, network delays, batch processing schedules, system outages, delayed upstream workflows, or reconciliation in source systems.

Data Engineer

Data Engineer Data Engineering Engineering Metadata

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

In this blog post, we will ingest a real world dataset into Ozone, create a Hive table on top of it and analyze the data to study the correlation between new vaccinations and new cases per country using a Spark ML Jupyter notebook in CML. Learn more about the impacts of global data sharing in this blog, The Ethics of Data Exchange.

Data Science

Data Science Cloud Hadoop Metadata

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Establishing a Large Scale Learned Retrieval System at Pinterest

Webinars

Trending Sources

Level Up Your Data Platform With Active Metadata

Webinars

How Meta discovers data flows via lineage at scale

Unapologetically Technical Episode 20 – Shane Murray

How Meta understands data at scale

Why Open Table Format Architecture is Essential for Modern Data Systems

Apache Ozone Metadata Explained

Metadata Management And Integration At LinkedIn With DataHub

A Look At The Data Systems Behind The Gameplay For League Of Legends

The Data Turf Wars are Over, But the Metadata Turf Wars Have Just Begun

Data Engineering Weekly #198

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Building Shared State Microservices for Distributed Systems Using Kafka Streams

Metadata Management and Data Governance with Cloudera SDX

A Flexible and Efficient Storage System for Diverse Workloads

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Improving Pinterest Search Relevance Using Large Language Models

Data Engineering Weekly #218

Data Engineering Weekly #217

Ensuring the Successful Launch of Ads on Netflix

Metadata: What Is It and Why it Matters

Data Engineering Weekly #196

Introducing Impressions at Netflix

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Snowflake Ransomware Guardrails

2024 Governance Trends for Data Leaders

Change Data Capture at Pinterest

Tackling Configuration: creating Lego-Like Flexibility for non developers

Announcing Nickel 1.0

Data Engineering Weekly #209

Metadata Management & Data Governance with Cloudera SDX

How To Prepare Your Data Team for 2025

Building A Data Mesh Platform At PayPal

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Migrating Critical Traffic At Scale with No Downtime?—?Part 1

Ready-to-go sample data pipelines with Dataflow

Data Engineering Weekly #213

Data Engineering Weekly #176

Snowflake Cortex Search: State-of-the-Art Hybrid Search for RAG Applications

Data Engineering Weekly #215

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Apache Ozone Powers Data Science in CDP Private Cloud

Stay Connected