Blog and Metadata - Data Engineering Digest

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

It leverages knowledge graphs to keep track of all the data sources and data flows, using AI to fill the gaps so you have the most comprehensive metadata management solution. Together, Cloudera and Octopai will help reinvent how customers manage their metadata and track lineage across all their data sources.

Metadata

Metadata Management Data Governance Government

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance.

Metadata

Metadata MongoDB MySQL Scala

Apache Ozone Metadata Explained

Cloudera

JUNE 2, 2021

As an important part of achieving better scalability, Ozone separates the metadata management among different services: . Ozone Manager (OM) service manages the metadata of the namespace such as volume, bucket and keys. Datanode service manages the metadata of blocks, containers and pipelines running on the datanode. .

Metadata

Metadata Hadoop Certification Algorithm

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

In this blog, we will delve into an early stage in PAI implementation: data lineage. This took Meta multiple years to complete across our millions of disparate data assets, and well cover each of these more deeply in future blog posts: Inventorying involves collecting various code and data assets (e.g.,

Data Warehouse

Data Warehouse SQL Programming Language Data

The Data Turf Wars are Over, But the Metadata Turf Wars Have Just Begun

Cloudera

AUGUST 6, 2024

And for that future to be a reality, data teams must shift their attention to metadata, the new turf war for data. The need for unified metadata While open and distributed architectures offer many benefits, they come with their own set of challenges. Data teams actually need to unify the metadata. Open data is the future.

Metadata

Metadata Government Datasets Architecture

Metadata Management and Data Governance with Cloudera SDX

Cloudera

JANUARY 26, 2024

This will allow a data office to implement access policies over metadata management assets like tags or classifications, business glossaries, and data catalog entities, laying the foundation for comprehensive data access control. First, a set of initial metadata objects are created by the data steward.

Metadata

Metadata Data Governance Government Management

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

The blog highlights how moving from 6-character base-64 to 20-digit base-2 file distribution brings more distribution in S3 and reduces request failures. The blog is a good summary of how to use Snowflake QUERY_TAG to measure and monitor query performance. The blog post made me curious to understand DataFusion's internals.

Data Engineering

Data Engineering Data Engineer Engineering Insurance

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Cloudera

OCTOBER 23, 2024

In August, we wrote about how in a future where distributed data architectures are inevitable, unifying and managing operational and business metadata is critical to successfully maximizing the value of data, analytics, and AI. It is a critical feature for delivering unified access to data in distributed, multi-engine architectures.

Metadata

Metadata BI Data Lake Business Intelligence

Top 7 Metadata Management Tools

Hevo

AUGUST 16, 2024

Managing metadata has become crucial to any organization’s data strategy in today’s data-driven world. This is where metadata management tools come into play. Nowadays, businesses face the challenge of effectively managing their growing and complex data volumes.

Metadata

Metadata Management Data

The Struggle Between Data Dark Ages and LLM Accuracy

Cloudera

DECEMBER 6, 2024

And specifically, I was reading one of your blog posts recently that talked about the dark ages of data. It could be metadata that you weren’t capturing before. The post The Struggle Between Data Dark Ages and LLM Accuracy appeared first on Cloudera Blog. Here are some key takeaways from Ray in that conversation.

Manufacturing

Manufacturing Retail Finance Metadata

Establishing a Large Scale Learned Retrieval System at Pinterest

Pinterest Engineering

JANUARY 31, 2025

To tackle the problem, we attach a piece of model version metadata to each ANN search service host, which contains a mapping from model name to the latest model version. The metadata is generated together with the index. It has the top user coverage and top three save rates.

Systems

Systems Metadata Machine Learning Architecture

The Data Discovery Team

Jesse Anderson

NOVEMBER 14, 2023

A Guest Post by Ole Olesen-Bagneux In this blog post I would like to describe a new data team, that I call ‘the data discovery team’. And that is what the data discovery team that I propose in this blog post should work on: searching for data. In an enterprise data reality, searching for data is a bit of a hassle.

Metadata

Metadata Data Science Big Data Data

Improving Pinterest Search Relevance Using Large Language Models

Pinterest Engineering

APRIL 4, 2025

In this blog, we will go through the technical design and share some offline and online results for our LLM-based search relevance pipeline. Pin Text Representations Pins on Pinterest are rich multimedia entities that feature images, videos, and other contents, often linked to external webpages or blogs.

Machine Learning

Machine Learning Metadata Architecture Datasets

Dynamic CSV Column Mapping with Stored Procedures

Cloudyard

FEBRUARY 17, 2025

In this blog, well address this challenge by building a metadata-driven solution using a JavaScript stored procedure that dynamically maps and loads only the required columns from multiple CSV files into their respective Snowflake tables. Metadata Proc Step 4: Execute the Stored Procedure.

Metadata

Metadata SQL Data Engineering Data Engineer

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

This blog post expands on that insightful conversation, offering a critical look at Iceberg's potential and the hurdles organizations face when adopting it. This ecosystem includes: Catalogs: Services that manage metadata about Iceberg tables (e.g., If not handled correctly, managing this metadata can become a bottleneck.

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Metadata: What Is It and Why it Matters

Ascend.io

JULY 11, 2024

Metadata is the information that provides context and meaning to data, ensuring it’s easily discoverable, organized, and actionable. This is what managing data without metadata feels like. This is what managing data without metadata feels like. Effective metadata management is no longer a luxury—it’s a necessity.

Metadata

Metadata IT Government High Quality Data

Databricks, Snowflake and the future

Christophe Blefari

JUNE 21, 2024

Below a diagram describing what I think schematises data platforms: Data storage — you need to store data in an efficient manner, interoperable, from the fresh to the old one, with the metadata. It adds metadata, read, write and transactions that allow you to treat a Parquet file as a table.

Metadata

Metadata Data Warehouse BI MySQL

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Cloudera

NOVEMBER 13, 2020

This is part of our series of blog posts on recent enhancements to Impala. Metadata Caching. As Impala’s adoption grew the catalog service started to experience these growing pains, therefore recently we introduced two new features to alleviate the stress, On-demand Metadata and Zero Touch Metadata. More on this below.

Metadata

Metadata Coding SQL Database

Continued Investments in Price Performance and Faster Top-K Queries

Snowflake

AUGUST 7, 2024

As we describe in this blog post , the top-k feature uses runtime information — namely, the current contents of the top-k elements — to skip micro-partitions where we can guarantee that they won’t contribute to the overall result. on average, with some queries also reaching up to 99.8% improvement. How does it work?

Metadata

Metadata Algorithm Process Utilities

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Snowflake

DECEMBER 4, 2023

With this public preview, those external catalog options are either “GLUE”, where Snowflake can retrieve table metadata snapshots from AWS Glue Data Catalog, or “OBJECT_STORE”, where Snowflake retrieves metadata snapshots directly from the specified cloud storage location. With these three options, which one should you use?

Building

Building Metadata Cloud Storage AWS

Announcing Nickel 1.0

Tweag

MAY 16, 2023

Since the previous stable version ( 0.3.1 ), efforts have been made on three principal fronts: tooling (in particular the language server), the core language semantics (contracts, metadata, and merging), and the surface language (the syntax and the stdlib). The | symbol attaches metadata to fields.

MySQL

MySQL Metadata Data Validation Coding

Data Engineering Weekly #196

Data Engineering Weekly

NOVEMBER 3, 2024

The blog emphasizes the importance of starting with a clear client focus to avoid over-engineering and ensure user-centric development. link] Gunnar Morling: Revisiting the Outbox Pattern The blog is an excellent summary of the path we crossed with the outbox pattern and the challenges ahead.

Data Engineering

Data Engineering Data Engineer Engineering Pipeline-centric

A Look At The Data Systems Behind The Gameplay For League Of Legends

Data Engineering Podcast

NOVEMBER 20, 2022

Summary The majority of blog posts and presentations about data engineering and analytics assume that the consumers of those efforts are internal business users accessing an environment controlled by the business. Atlan is the metadata hub for your data ecosystem. Atlan is the metadata hub for your data ecosystem.

Systems

Systems Metadata Data Pipeline MongoDB

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

In this blog, I will demonstrate the value of Cloudera DataFlow (CDF) , the edge-to-cloud streaming data platform available on the Cloudera Data Platform (CDP) , as a Data integration and Democratization fabric. Data and Metadata: Data inputs and data outputs produced based on the application logic. Introduction.

Architecture

Architecture Metadata Kafka Government

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

In this blog, we will discuss: What is the Open Table format (OTF)? Then, we add another column called HASHKEY , add more data, and locate the S3 file containing metadata for the iceberg table. In the screenshot below, we can see that the metadata file for the Iceberg table retains the snapshot history. Why should we use it?

Architecture

Architecture Systems Data Lake Google Cloud

Tackling Configuration: creating Lego-Like Flexibility for non developers

Picnic Engineering

FEBRUARY 6, 2025

In our previous blogs, we explored how Picnics Page Platform transformed the way we build new featuresenabling faster iteration, tighter collaboration and less feature-specific complexity. In this blog, well dive into how we configure the pages within Picnics store. Now, were bringing this same principle to our app.

Metadata

Metadata Architecture SQL Building

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

All the above commands are very likely to be described in separate future blog posts, but right now let’s focus on the dataflow sample command. This logic consists of the following parts: DDL code, table metadata information, data transformation and a few audit steps.

Data Pipeline

Data Pipeline Scala Metadata Food

2024 Governance Trends for Data Leaders

phData: Data Engineering

NOVEMBER 1, 2024

This blog is a collection of those insights, but for the full trendbook, we recommend downloading the PDF. VP of Architecture, Healthcare Industry Organizations will focus more on metadata tagging of existing and new content in the coming years. The technology for metadata management, data quality management, etc., No problem!

Government

Government Data Governance Finance Metadata

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

In this blog post, we’ll explore key strategies that data teams should adopt to prepare for the year ahead. Establish shared guidelines for data governance, including data quality, metadata management, and access controls. Implement CI/CD practices to ensure continuous delivery of data products.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Snowflake Cortex Search: State-of-the-Art Hybrid Search for RAG Applications

Snowflake

JULY 25, 2024

It supports “fuzzy” search — the service takes in natural language queries and returns the most relevant text results, along with associated metadata. More details on the research behind the Cortex Search retrieval stack will be shared on our Snowflake Engineering Blog.

Unstructured Data

Unstructured Data Metadata Government SQL

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

In this blog post, we will ingest a real world dataset into Ozone, create a Hive table on top of it and analyze the data to study the correlation between new vaccinations and new cases per country using a Spark ML Jupyter notebook in CML. Learn more about the impacts of global data sharing in this blog, The Ethics of Data Exchange.

Data Science

Data Science Cloud Hadoop Metadata

Data Engineering Weekly #209

Data Engineering Weekly

FEBRUARY 23, 2025

It covers nine categories: storage systems, data lake platforms, processing, integration, orchestration, infrastructure, ML/AI, metadata management, and analytics. I found the blog to be a comprehensive roadmap for data engineering in 2025.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

What’s New in CDP Private Cloud Base 7.1.7?

Cloudera

AUGUST 10, 2021

We expand on this feature later in this blog. Atlas / Kafka integration provides metadata collection for Kafa producers/consumers so that consumers can manage, govern, and monitor Kafka metadata and metadata lineage in the Atlas UI. Figure 8: Data lineage based on Kafka Atlas Hook metadata. x, and 6.3.x,

Cloud

Cloud Kafka Metadata SQL

Data Engineering Weekly #176

Data Engineering Weekly

JUNE 16, 2024

The blog explains KIP-932 and its potential benefits. link] Picnic: Open-sourcing dbt-score: lint model metadata with ease! The more metadata there is, the more readability of the model. It is often challenging as developers are not incentivized to produce quality metadata.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

I honestly don’t have a solid answer, but this blog is an excellent overview of upskilling. [link] Georg Heiler: Upskilling data engineers What should I prefer for 2028, or how can I break into data engineering? These are common LinkedIn requests.

Data Engineering

Data Engineering Data Engineer Engineering Data

Ensuring the Successful Launch of Ads on Netflix

Netflix Tech

JUNE 1, 2023

In this blog post, we’ll discuss the methods we used to ensure a successful launch, including: How we tested the system Netflix technologies involved Best practices we developed Realistic Test Traffic Netflix traffic ebbs and flows throughout the day in a sinusoidal pattern. Basic with ads was launched worldwide on November 3rd.

Algorithm

Algorithm Kafka Metadata Systems

Building A Data Mesh Platform At PayPal

Data Engineering Podcast

FEBRUARY 26, 2023

Contact Info LinkedIn Blog Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? TimeXtender Logo]([link] TimeXtender is a holistic, metadata-driven solution for data integration, optimized for agility. When is a data mesh the wrong choice?

Building

Building Machine Learning Metadata Data Integration

Introducing Apache Iceberg in Cloudera Data Platform

Cloudera

FEBRUARY 22, 2022

Companies such as Adobe , Expedia , LinkedIn , Tencent , and Netflix have published blogs about their Apache Iceberg adoption for processing their large scale analytics datasets. . In CDP we enable Iceberg tables side-by-side with the Hive table types, both of which are part of our SDX metadata and security framework.

Metadata

Metadata Datasets BI SQL

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Netflix Tech

NOVEMBER 14, 2023

In this three-part blog post series, we introduce you to Psyberg , our incremental data processing framework designed to tackle such challenges! It leverages Iceberg metadata to facilitate processing incremental and batch-based data pipelines. Given our role on this critical path, accuracy is paramount. Psyberg: The Game Changer!

Data Engineering

Data Engineering Data Engineer Engineering Metadata

Announcing Open Source DataOps Data Quality TestGen 3.0

DataKitchen

FEBRUARY 20, 2025

Better Metadata Management Add Descriptions and Data Product tags to tables and columns in the Data Catalog for improved governance. Smarter Profiling & Test Generation Improved logic reduces false positives , making test results more accurate and actionable. DataOps just got more intelligent.

Datasets

Datasets Metadata Data Government

2025 Planning Insights: Data Governance Adoption Has Risen Dramatically

Precisely

DECEMBER 9, 2024

Fueled by businesses demand for democratized, self-service data, these architectures rely on effective metadata management and data governance for success. And, stay tuned for the next 2025 Planning Insights blog to explore more report highlights! It follows that 25% of respondents identify a data catalog as a top priority for 2024.

Data Governance

Data Governance Government Data Integration Architecture

Data Engineering Weekly #215

Data Engineering Weekly

APRIL 6, 2025

The Grab blog delights me since I have tried to do this many times. A cross-encoder teacher model, fine-tuned on human-labeled data and enriched Pin metadata, was distilled into a lightweight student model using semi-supervised learning over billions of impressions. Kudos to the Grab team for building a docs-as-code system.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Introducing Cloudera Observability Premium

Cloudera

JULY 10, 2024

Observability for your most secure data For your most sensitive, protected data, we understand even the metadata and telemetry about your workloads must be kept under close watch, and it must stay within your secured environment. The post Introducing Cloudera Observability Premium appeared first on Cloudera Blog.

Metadata

Metadata Cloud Management IT

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Netflix Tech

NOVEMBER 14, 2023

By Abhinaya Shetty , Bharath Mummadisetty In the inaugural blog post of this series, we introduced you to the state of our pipelines before Psyberg and the challenges with incremental processing that led us to create the Psyberg framework within Netflix’s Membership and Finance data engineering team.

Data Process

Data Process Process Metadata Finance

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Level Up Your Data Platform With Active Metadata

Trending Sources

Apache Ozone Metadata Explained

How Meta discovers data flows via lineage at scale

The Data Turf Wars are Over, But the Metadata Turf Wars Have Just Begun

Metadata Management and Data Governance with Cloudera SDX

Data Engineering Weekly #198

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Top 7 Metadata Management Tools

The Struggle Between Data Dark Ages and LLM Accuracy

Establishing a Large Scale Learned Retrieval System at Pinterest

The Data Discovery Team

Improving Pinterest Search Relevance Using Large Language Models

Dynamic CSV Column Mapping with Stored Procedures

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Metadata: What Is It and Why it Matters

Databricks, Snowflake and the future

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Continued Investments in Price Performance and Faster Top-K Queries

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Announcing Nickel 1.0

Data Engineering Weekly #196

A Look At The Data Systems Behind The Gameplay For League Of Legends

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Why Open Table Format Architecture is Essential for Modern Data Systems

Tackling Configuration: creating Lego-Like Flexibility for non developers

Ready-to-go sample data pipelines with Dataflow

2024 Governance Trends for Data Leaders

How To Prepare Your Data Team for 2025

Snowflake Cortex Search: State-of-the-Art Hybrid Search for RAG Applications

Apache Ozone Powers Data Science in CDP Private Cloud

Data Engineering Weekly #209

What’s New in CDP Private Cloud Base 7.1.7?

Data Engineering Weekly #176

Data Engineering Weekly #213

Ensuring the Successful Launch of Ads on Netflix

Building A Data Mesh Platform At PayPal

Introducing Apache Iceberg in Cloudera Data Platform

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Announcing Open Source DataOps Data Quality TestGen 3.0

2025 Planning Insights: Data Governance Adoption Has Risen Dramatically

Data Engineering Weekly #215

Introducing Cloudera Observability Premium

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Stay Connected