Remove Blog Remove Datasets Remove Metadata
article thumbnail

How Meta discovers data flows via lineage at scale

Engineering at Meta

In this blog, we will delve into an early stage in PAI implementation: data lineage. This took Meta multiple years to complete across our millions of disparate data assets, and well cover each of these more deeply in future blog posts: Inventorying involves collecting various code and data assets (e.g.,

article thumbnail

How Meta understands data at scale

Engineering at Meta

Understanding DataSchema requires grasping schematization , which defines the logical structure and relationships of data assets, specifying field names, types, metadata, and policies. Creating a canonical representation for compliance tools. Accurate understanding of data, enabling the application of privacy safeguards at scale.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

The Data Turf Wars are Over, But the Metadata Turf Wars Have Just Begun

Cloudera

And for that future to be a reality, data teams must shift their attention to metadata, the new turf war for data. The need for unified metadata While open and distributed architectures offer many benefits, they come with their own set of challenges. Data teams actually need to unify the metadata. Open data is the future.

article thumbnail

Data Engineering Weekly #198

Data Engineering Weekly

The blog highlights how moving from 6-character base-64 to 20-digit base-2 file distribution brings more distribution in S3 and reduces request failures. The blog is a good summary of how to use Snowflake QUERY_TAG to measure and monitor query performance. The blog post made me curious to understand DataFusion's internals.

article thumbnail

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

LinkedIn Engineering

To remove this bottleneck, we built AvroTensorDataset , a TensorFlow dataset for reading, parsing, and processing Avro data. Today, we’re excited to open source this tool so that other Avro and Tensorflow users can use this dataset in their machine learning pipelines to get a large performance boost to their training workloads.

Datasets 102
article thumbnail

Improving Pinterest Search Relevance Using Large Language Models

Pinterest Engineering

In this blog, we will go through the technical design and share some offline and online results for our LLM-based search relevance pipeline. Pin Text Representations Pins on Pinterest are rich multimedia entities that feature images, videos, and other contents, often linked to external webpages or blogs.

article thumbnail

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

This blog post expands on that insightful conversation, offering a critical look at Iceberg's potential and the hurdles organizations face when adopting it. In the mid-2000s, Hadoop emerged as a groundbreaking solution for processing massive datasets. If not handled correctly, managing this metadata can become a bottleneck.

Hadoop 58