Metadata and Systems - Data Engineering Digest

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

Additionally, multiple copies of the same data locked in proprietary systems contribute to version control issues, redundancies, staleness, and management headaches. It leverages knowledge graphs to keep track of all the data sources and data flows, using AI to fill the gaps so you have the most comprehensive metadata management solution.

Metadata

Metadata Management Data Governance Government

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. A variety of platforms have been developed to capture and analyze that information to great effect, but they are inherently limited in their utility due to their nature as storage systems.

Metadata

Metadata MongoDB MySQL Scala

Establishing a Large Scale Learned Retrieval System at Pinterest

Pinterest Engineering

JANUARY 31, 2025

Modern large-scale recommendation systems usually include multiple stages where retrieval aims at retrieving candidates from billions of candidate pools, and ranking predicts which item a user tends to engage from the trimmed candidate set retrieved from early stages [2]. General multi-stage recommendation system design in Pinterest.

Systems

Systems Metadata Machine Learning Architecture

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

Inside Facebook’s video delivery system

Engineering at Meta

DECEMBER 10, 2024

Were explaining the end-to-end systems the Facebook app leverages to deliver relevant content to people. At Facebooks scale, the systems built to support and overcome these challenges require extensive trade-off analyses, focused optimizations, and architecture built to allow our engineers to push for the same user and business outcomes.

Systems

Systems Architecture Engineering Data Pipeline

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. These systems are built on open standards and offer immense analytical and transactional processing flexibility. These formats are transforming how organizations manage large datasets.

Architecture

Architecture Systems Data Lake Google Cloud

Interesting startup idea: benchmarking cloud platform pricing

The Pragmatic Engineer

OCTOBER 17, 2024

Results are stored in git and their database, together with benchmarking metadata. We recently covered how CockroachDB joins the trend of moving from open source to proprietary and why Oxide decided to keep using it with self-support , regardless Web hosting: Netlify : chosen thanks to their super smooth preview system with SSR support.

Cloud

Cloud AWS Metadata Cloud Computing

Agents of Change: Navigating 2025 with AI and Data Innovation

Data Engineering Weekly

DECEMBER 28, 2024

Investment in an Agent Management System (AMS) is crucial, as it offers a framework for scaling, monitoring, and refining AI agents. AI engineers, in particular, will find their skills in high demand as they navigate managing and optimizing agents to ensure reliability within enterprise systems.

Unstructured Data

Unstructured Data Metadata Data Government

Foundation Model for Personalized Recommendation

Netflix Tech

MARCH 28, 2025

By Ko-Jen Hsiao , Yesu Feng and Sudarshan Lamkhede Motivation Netflixs personalized recommender system is a complex system, boasting a variety of specialized machine learned models each catering to distinct needs including Continue Watching and Todays Top Picks for You. Refer to our recent overview for more details).

Metadata

Metadata Bytes Entertainment Data Mining

A Look At The Data Systems Behind The Gameplay For League Of Legends

Data Engineering Podcast

NOVEMBER 20, 2022

In this episode Ian Schweer shares his experiences at Riot Games supporting player-focused features such as machine learning models and recommeder systems that are deployed as part of the game binary. Atlan is the metadata hub for your data ecosystem. And don’t forget to thank them for their continued support of this show!

Systems

Systems Metadata Data Pipeline MongoDB

Modern Data Architecture: Data Mesh and Data Fabric 101

Precisely

OCTOBER 31, 2024

With data volumes skyrocketing, and complexities increasing in variety and platforms, traditional centralized data management systems often struggle to keep up. A data fabric weaves together different data management tools, metadata, and automation to create a seamless architecture.

Data Architecture

Data Architecture Architecture Metadata Government

How Apache Iceberg Is Changing the Face of Data Lakes

Snowflake

APRIL 2, 2025

The data warehouse solved for performance and scale but, much like the databases that preceded it, relied on proprietary formats to build vertically integrated systems. Iceberg tables become interoperable while maintaining ACID compliance by adding a layer of metadata to the data files in a users object storage.

Data Lake

Data Lake Cloud Storage Metadata Data Warehouse

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Metas systems. It enhances the traceability of data flows within systems, ultimately empowering developers to swiftly implement privacy controls and create innovative products. Hack, C++, Python, etc.)

Data Warehouse

Data Warehouse SQL Programming Language Data

Title Launch Observability at Netflix Scale

Netflix Tech

JANUARY 6, 2025

In this case, the main stakeholders are: - Title Launch Operators Role: Responsible for setting up the title and its metadata into our systems. In this context, were focused on developing systems that ensure successful title launches, build trust between content creators and our brand, and reduce engineering operational overhead.

Metadata

Metadata Algorithm Systems Building

Movie Recommendation System: Definition, Strategies, Usecase

Knowledge Hut

FEBRUARY 1, 2024

Not only could this recommendation system save time browsing through lists of movies, it can also give more personalized results so users don’t feel overwhelmed by too many options. What are Movie Recommendation Systems? Recommender systems have two main categories: content-based & collaborative filtering.

Systems

Systems Entertainment Algorithm Datasets

Strobelight: A profiling service built on open source technology

Engineering at Meta

JANUARY 21, 2025

Because there are so many different things happening in these systems powered by so many different technologies. Strobelight also has concurrency rules and a profiler queuing system. This provides just the right amount of data without impacting the profiled services or overburdening the systems that store Strobelight data.

Technology

Technology Metadata Utilities Engineering

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

ThoughtSpot

NOVEMBER 5, 2024

ThoughtSpot prioritizes the high availability and minimal downtime of our systems to ensure a seamless user experience. In the realm of modern analytics platforms, where rapid and efficient processing of large datasets is essential, swift metadata access and management are critical for optimal system performance. What is Atlas?

Metadata

Metadata PostgreSQL Java Database

Metadata Management and Data Governance with Cloudera SDX

Cloudera

JANUARY 26, 2024

This will allow a data office to implement access policies over metadata management assets like tags or classifications, business glossaries, and data catalog entities, laying the foundation for comprehensive data access control. First, a set of initial metadata objects are created by the data steward.

Metadata

Metadata Data Governance Government Management

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems

Towards Data Science

FEBRUARY 6, 2024

ERP and CRM systems are designed and built to fulfil a broad range of business processes and functions. Accessing Operational Data I used to connect to views in transactional databases or APIs offered by operational systems to request the raw data. Does it sound familiar?

Systems

Systems Raw Data Metadata Data Cleanse

Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata

Data Engineering Podcast

SEPTEMBER 11, 2022

Summary Data engineering systems are complex and interconnected with myriad and often opaque chains of dependencies. Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities.

Systems

Systems Metadata Building MongoDB

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Cloudera

OCTOBER 23, 2024

In August, we wrote about how in a future where distributed data architectures are inevitable, unifying and managing operational and business metadata is critical to successfully maximizing the value of data, analytics, and AI. It is a critical feature for delivering unified access to data in distributed, multi-engine architectures.

Metadata

Metadata BI Data Lake Business Intelligence

Supporting Diverse ML Systems at Netflix

Netflix Tech

MARCH 7, 2024

The Machine Learning Platform (MLP) team at Netflix provides an entire ecosystem of tools around Metaflow , an open source machine learning infrastructure framework we started, to empower data scientists and machine learning practitioners to build and manage a variety of ML systems. ETL workflows), as well as downstream (e.g.

Systems

Systems Media Machine Learning Data Warehouse

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

Hadoop achieved this through distributed processing and storage, using a framework called MapReduce and the Hadoop Distributed File System (HDFS). This ecosystem includes: Catalogs: Services that manage metadata about Iceberg tables (e.g., If not handled correctly, managing this metadata can become a bottleneck.

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

Data Engineering Podcast

APRIL 24, 2022

WhyLogs is a powerful library for flexibly instrumenting all of your data systems to understand the entire lifecycle of your data from source to productionized model. You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms.

Machine Learning

Machine Learning Systems Data Lake Java

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Data Engineering Podcast

AUGUST 13, 2022

The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Sifflet also offers a 2-week free trial. Sifflet also offers a 2-week free trial.

Metadata

Metadata MongoDB MySQL Scala

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Snowflake

DECEMBER 4, 2024

Beyond working with well-structured data in a data warehouse, modern AI systems can use deep learning and natural language processing to work effectively with unstructured and semi-structured data in data lakes and lakehouses.

Unstructured Data

Unstructured Data Data Lake Deep Learning Structured Data

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

The system leverages a combination of an event-based storage model in its TimeSeries Abstraction and continuous background aggregation to calculate counts across millions of counters efficiently. Grab has enhanced its LLM-powered data classification system, Metasense, to improve accuracy and minimize manual workload.

Data Engineering

Data Engineering Data Engineer Engineering Insurance

Scale Unstructured Text Analytics with Batch LLM Inference

Snowflake

MARCH 6, 2025

Meanwhile, operations teams use entity extraction on documents to automate workflows and enable metadata-driven analytical filtering. As data volumes grow and AI automation expands, cost efficiency in processing with LLMs depends on both system architecture and model flexibility.

Unstructured Data

Unstructured Data Medical Media Data Workflow

Data logs: The latest evolution in Meta’s access tools

Engineering at Meta

FEBRUARY 4, 2025

Here we explore initial system designs we considered, an overview of the current architecture, and some important principles Meta takes into account in making data accessible and easy to understand. We also considered caching data logs in an online system capable of supporting a range of indexed per-user queries. What are data logs?

Accessible

Accessible Accessibility Raw Data Data Warehouse

Expert Insights for Your 2025 Data, Analytics, and AI Initiatives

Precisely

NOVEMBER 18, 2024

With the surge of new tools, platforms, and data types, managing these systems effectively is an ongoing challenge. Focus on metadata management. As Yoğurtçu points out, “metadata is critical” for driving insights in AI and advanced analytics. And context also enhances the large language models.

Data Analytics

Data Analytics Data Governance Data Integration Government

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

JUNE 16, 2024

What are the other systems that feed into and rely on the Trino/Iceberg service? what kinds of questions are you answering with table metadata what use case/team does that support comparative utility of iceberg REST catalog What are the shortcomings of Trino and Iceberg? Email hosts@dataengineeringpodcast.com with your story.

Data Lake

Data Lake High Quality Data Metadata Machine Learning

Ensuring the Successful Launch of Ads on Netflix

Netflix Tech

JUNE 1, 2023

In this blog post, we’ll discuss the methods we used to ensure a successful launch, including: How we tested the system Netflix technologies involved Best practices we developed Realistic Test Traffic Netflix traffic ebbs and flows throughout the day in a sinusoidal pattern. Basic with ads was launched worldwide on November 3rd.

Algorithm

Algorithm Kafka Metadata Systems

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

Cloudera

DECEMBER 16, 2022

We are pleased to announce that Cloudera has been named a Leader in the 2022 Gartner ® Magic Quadrant for Cloud Database Management Systems. Many of our customers use multiple solutions—but want to consolidate data security, governance, lineage, and metadata management, so that they don’t have to work with multiple vendors.

Database

Database Cloud Systems Management

What do Snowflake, Databricks, Redshift, BigQuery actually do?

Start Data Engineering

NOVEMBER 21, 2024

A compute engine is a system that transforms data 3.1.2. Metadata catalog stores information about datasets 3.1.3. Analytical databases aggregate large amounts of data 3. Most platforms enable you to do the same thing but have different strengths 3.1. Understand how the platforms process data 3.1.1.

Metadata

Metadata Datasets SQL Database

AI-Driven Data Integrity Innovations to Solve Your Top Data Management Challenges

Precisely

FEBRUARY 26, 2025

Automated metadata management – AI-generated catalog asset descriptions significantly reduce manual efforts and improve metadata quality – enabling teams to focus on more strategic tasks. With the ability to turn functionality on or off based on business requirements, you gain full control over when and how AI is applied.

Data Integration

Data Integration Data Management Management Data Governance

2024 Governance Trends for Data Leaders

phData: Data Engineering

NOVEMBER 1, 2024

Strong data governance also lays the foundation for better model performance, cost efficiency, and improved data quality, which directly contributes to regulatory compliance and more secure AI systems. VP of Architecture, Healthcare Industry Organizations will focus more on metadata tagging of existing and new content in the coming years.

Government

Government Data Governance Finance Metadata

Build Better Data Products By Creating Data, Not Consuming It

Data Engineering Podcast

NOVEMBER 6, 2022

He also describes the considerations involved in bringing behavioral data into your systems, and the ways that he and the rest of the Snowplow team are working to make that an easy addition to your platforms. Atlan is the metadata hub for your data ecosystem. What are some of the unique characteristics of that information?

Building

Building IT Metadata MongoDB

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Data Engineering Weekly

FEBRUARY 18, 2025

Kafka is designed to be a black box to collect all kinds of data, so Kafka doesn't have built-in schema and schema enforcement; this is the biggest problem when integrating with schematized systems like Lakehouse. If you want to build OLAP systems for low-latency complex queries, use Pinot. When to use Fluss vs Apache Pinot?

Kafka

Kafka Lambda Architecture SQL Architecture

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

Thanks to the Netflix internal lineage system (built by Girish Lingappa ) Dataflow migration can then help you identify downstream usage of the table in question. This logic consists of the following parts: DDL code, table metadata information, data transformation and a few audit steps.

Data Pipeline

Data Pipeline Scala Metadata Food

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Cloudera

NOVEMBER 13, 2020

The commonly-accepted best practice in database system design for years is to use an exhaustive search strategy to consider all the possible variations of specific database operations in a query plan. Metadata Caching. See the performance results below for an example of how metadata caching helps reduce latency.

Metadata

Metadata Coding SQL Database

Introducing Impressions at Netflix

Netflix Tech

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily.

Kafka

Kafka Datasets Metadata Utilities

Improving Pinterest Search Relevance Using Large Language Models

Pinterest Engineering

APRIL 4, 2025

To represent each Pin, we use the following varied set of text features derived from metadata, the image itself, as well as user-curated data. A diagram of the search relevance system at Pinterest is shown in Figure 3. Figure 3: Diagram of the proposed search relevance system at Pinterest.

Machine Learning

Machine Learning Metadata Architecture Datasets

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

On average, engineers spend over half of their time maintaining existing systems rather than developing new solutions. Create a Plan for Integration: Automation tools need to work seamlessly with existing systems to be effective. Instead of driving innovation, data engineers often find themselves bogged down with maintenance tasks.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Scaling Media Machine Learning at Netflix

Netflix Tech

FEBRUARY 13, 2023

Even after gaining access, one needed to deal with the challenges of homogeneity across different assets in terms of decoding performance, size, metadata, and general formatting. This feature store is equipped with a data replication system that enables copying data to different storage solutions depending on the required access patterns.

Media

Media Machine Learning Metadata Algorithm

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

The author emphasizes the importance of mastering state management, understanding "local first" data processing (prioritizing single-node solutions before distributed systems), and leveraging an asset graph approach for data pipelines. I honestly don’t have a solid answer, but this blog is an excellent overview of upskilling.

Data Engineering

Data Engineering Data Engineer Engineering Data

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Level Up Your Data Platform With Active Metadata

Webinars

Trending Sources

Establishing a Large Scale Learned Retrieval System at Pinterest

Webinars

Inside Facebook’s video delivery system

Why Open Table Format Architecture is Essential for Modern Data Systems

Interesting startup idea: benchmarking cloud platform pricing

Agents of Change: Navigating 2025 with AI and Data Innovation

Foundation Model for Personalized Recommendation

A Look At The Data Systems Behind The Gameplay For League Of Legends

Modern Data Architecture: Data Mesh and Data Fabric 101

How Apache Iceberg Is Changing the Face of Data Lakes

How Meta discovers data flows via lineage at scale

Title Launch Observability at Netflix Scale

Movie Recommendation System: Definition, Strategies, Usecase

Strobelight: A profiling service built on open source technology

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

Metadata Management and Data Governance with Cloudera SDX

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems

Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Supporting Diverse ML Systems at Netflix

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

AI and Data Predictions 2025: Strategies to Realize the Promise of AI

Data Engineering Weekly #198

Scale Unstructured Text Analytics with Batch LLM Inference

Data logs: The latest evolution in Meta’s access tools

Expert Insights for Your 2025 Data, Analytics, and AI Initiatives

Being Data Driven At Stripe With Trino And Iceberg

Ensuring the Successful Launch of Ads on Netflix

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

What do Snowflake, Databricks, Redshift, BigQuery actually do?

AI-Driven Data Integrity Innovations to Solve Your Top Data Management Challenges

2024 Governance Trends for Data Leaders

Build Better Data Products By Creating Data, Not Consuming It

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Ready-to-go sample data pipelines with Dataflow

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Introducing Impressions at Netflix

Improving Pinterest Search Relevance Using Large Language Models

How To Prepare Your Data Team for 2025

Scaling Media Machine Learning at Netflix

Data Engineering Weekly #213

Stay Connected