Blog, Datasets and Metadata - Data Engineering Digest

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

In this blog, we will delve into an early stage in PAI implementation: data lineage. This took Meta multiple years to complete across our millions of disparate data assets, and well cover each of these more deeply in future blog posts: Inventorying involves collecting various code and data assets (e.g.,

Data Warehouse

Data Warehouse SQL Programming Language Data

How Meta understands data at scale

Engineering at Meta

APRIL 28, 2025

Understanding DataSchema requires grasping schematization , which defines the logical structure and relationships of data assets, specifying field names, types, metadata, and policies. Creating a canonical representation for compliance tools. Accurate understanding of data, enabling the application of privacy safeguards at scale.

Metadata

Metadata Data Utilities Data Warehouse

The Data Turf Wars are Over, But the Metadata Turf Wars Have Just Begun

Cloudera

AUGUST 6, 2024

And for that future to be a reality, data teams must shift their attention to metadata, the new turf war for data. The need for unified metadata While open and distributed architectures offer many benefits, they come with their own set of challenges. Data teams actually need to unify the metadata. Open data is the future.

Metadata

Metadata Government Datasets Architecture

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

The blog highlights how moving from 6-character base-64 to 20-digit base-2 file distribution brings more distribution in S3 and reduces request failures. The blog is a good summary of how to use Snowflake QUERY_TAG to measure and monitor query performance. The blog post made me curious to understand DataFusion's internals.

Data Engineer

Data Engineer Data Engineering Engineering Insurance

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

LinkedIn Engineering

JUNE 15, 2023

To remove this bottleneck, we built AvroTensorDataset , a TensorFlow dataset for reading, parsing, and processing Avro data. Today, we’re excited to open source this tool so that other Avro and Tensorflow users can use this dataset in their machine learning pipelines to get a large performance boost to their training workloads.

Datasets

Datasets Bytes Process Data Ingestion

Improving Pinterest Search Relevance Using Large Language Models

Pinterest Engineering

APRIL 4, 2025

In this blog, we will go through the technical design and share some offline and online results for our LLM-based search relevance pipeline. Pin Text Representations Pins on Pinterest are rich multimedia entities that feature images, videos, and other contents, often linked to external webpages or blogs.

Machine Learning

Machine Learning Metadata Architecture Datasets

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

This blog post expands on that insightful conversation, offering a critical look at Iceberg's potential and the hurdles organizations face when adopting it. In the mid-2000s, Hadoop emerged as a groundbreaking solution for processing massive datasets. If not handled correctly, managing this metadata can become a bottleneck.

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Announcing Open Source DataOps Data Quality TestGen 3.0

DataKitchen

FEBRUARY 20, 2025

Now With Actionable, Automatic, Data Quality Dashboards Imagine a tool that can point at any dataset, learn from your data, screen for typical data quality issues, and then automatically generate and perform powerful tests, analyzing and scoring your data to pinpoint issues before they snowball. Announcing DataOps Data Quality TestGen 3.0:

Datasets

Datasets Metadata Data Government

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. In this blog, we will discuss: What is the Open Table format (OTF)? These formats are transforming how organizations manage large datasets. Why should we use it?

Architecture

Architecture Systems Data Lake Google Cloud

Introducing Impressions at Netflix

Netflix Tech

FEBRUARY 14, 2025

In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily. Architecture Overview The first pivotal step in managing impressions begins with the creation of a Source-of-Truth (SOT) dataset.

Kafka

Kafka Datasets Metadata Utilities

NVIDIA RAPIDS in Cloudera Machine Learning

Cloudera

MAY 19, 2021

In the previous blog post in this series, we walked through the steps for leveraging Deep Learning in your Cloudera Machine Learning (CML) projects. To try and predict this, an extensive dataset including anonymised details on the individual loanee and their historical credit history are included. Get the Dataset. Introduction.

Machine Learning

Machine Learning Data Science Datasets Raw Data

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

In this blog post, we will ingest a real world dataset into Ozone, create a Hive table on top of it and analyze the data to study the correlation between new vaccinations and new cases per country using a Spark ML Jupyter notebook in CML. Learn more about the impacts of global data sharing in this blog, The Ethics of Data Exchange.

Data Science

Data Science Cloud Hadoop Metadata

Data Engineering Weekly #215

Data Engineering Weekly

APRIL 6, 2025

The Grab blog delights me since I have tried to do this many times. A cross-encoder teacher model, fine-tuned on human-labeled data and enriched Pin metadata, was distilled into a lightweight student model using semi-supervised learning over billions of impressions. Kudos to the Grab team for building a docs-as-code system.

Data Engineer

Data Engineer Data Engineering Engineering Kafka

Introducing Apache Iceberg in Cloudera Data Platform

Cloudera

FEBRUARY 22, 2022

The data architecture layer is one such area where growing datasets have pushed the limits of scalability and performance. The data explosion has to be met with new solutions, that’s why we are excited to introduce the next generation table format for large scale analytic datasets within Cloudera Data Platform (CDP) – Apache Iceberg.

Metadata

Metadata Datasets BI SQL

A Look At The Data Systems Behind The Gameplay For League Of Legends

Data Engineering Podcast

NOVEMBER 20, 2022

Summary The majority of blog posts and presentations about data engineering and analytics assume that the consumers of those efforts are internal business users accessing an environment controlled by the business. Atlan is the metadata hub for your data ecosystem. Atlan is the metadata hub for your data ecosystem.

Systems

Systems Metadata Data Pipeline MongoDB

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

In this blog post, we’ll explore key strategies that data teams should adopt to prepare for the year ahead. Embrace Version Control for Data and Code: Just as software developers use version control for code, DataOps involves tracking versions of datasets and data transformation scripts.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Cloudera

NOVEMBER 13, 2020

This is part of our series of blog posts on recent enhancements to Impala. Apache Impala is synonymous with high-performance processing of extremely large datasets, but what if our data isn’t huge? So clearly Impala is used extensively with datasets both small and large. Metadata Caching. More on this below.

Metadata

Metadata Coding SQL Database

Change Data Capture at Pinterest

Pinterest Engineering

NOVEMBER 18, 2024

In this blog post, we’ll explore what CDC is, why it’s important, and our journey of implementing Generic CDC solutions for all online databases at Pinterest. Kafka: Kafka stores metadata about connectors in several internal topics that are not exposed to end users. What is Change Data Capture? or its affiliates.

Kafka

Kafka MySQL Database Software Engineering

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

mock Generate or validate mock datasets. All the above commands are very likely to be described in separate future blog posts, but right now let’s focus on the dataflow sample command. " ) COMMENT "Example dataset brought to you by Dataflow. -v, --verbose Enables verbose mode. version Show the version and exit.

Data Pipeline

Data Pipeline Scala Metadata Food

Data governance beyond SDX: Adding third party assets to Apache Atlas

Cloudera

MARCH 9, 2021

In this blog, we’ll highlight the key CDP aspects that provide data governance and lineage and show how they can be extended to incorporate metadata for non-CDP systems from across the enterprise. Extending Atlas’ metadata model. The example 1_typedef-server.json describes the server typedef used in this blog. .

Data Governance

Data Governance Government Metadata Datasets

How to analyze dataset performance and schema changes in Databand

Databand.ai

SEPTEMBER 12, 2022

How to analyze dataset performance and schema changes in Databand Eric Jones 2022-09-12 13:06:42 “Why did my dataset schema change?” Databand helps fix this problem by capturing the metadata from your datasets and then alerting you when dataset operations change unexpectedly. Yeah, we hear this question a lot too.

Datasets

Datasets Metadata Data Engineering Data Engineer

6 Ways To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

In this blog post, we’ll explore key strategies that data teams should adopt to prepare for the year ahead. Embrace Version Control for Data and Code: Just as software developers use version control for code, DataOps involves tracking versions of datasets and data transformation scripts.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Data Engineering Weekly #179

Data Engineering Weekly

JULY 7, 2024

The blog highlights the advantages of GNN over traditional machine learning models, which struggle to discern relationships between various entities, such as users and restaurants, and edges, such as order. The author highlights Paimon’s consistency model by examining the metadata model.

Data Engineer

Data Engineer Data Engineering Engineering Data Lake

The Downfall of the Data Engineer

Maxime Beauchemin

AUGUST 28, 2017

Change Management Given that useful datasets become widely used and derived in ways that results in large and complex directed acyclic graphs (DAGs) of dependencies, altering logic or source data tends to break and/or invalidate downstream constructs. Upstream changes will inevitably break and invalidate downstream entities in intricate ways.

Data Engineer

Data Engineer Data Engineering Engineering Software Engineering

Cloudera Data Warehouse outperforms Azure HDInsight in TPC-DS benchmark

Cloudera

SEPTEMBER 29, 2020

In this blog post, we compare Cloudera Data Warehouse (CDW) on Cloudera Data Platform (CDP) using Apache Hive-LLAP to Microsoft HDInsight (also powered by Apache Hive-LLAP) on Azure using the TPC-DS 2.9 A TPC-DS 10TB dataset was generated in ACID ORC format and stored on the ADLS Gen 2 cloud storage. benchmark.

Data Warehouse

Data Warehouse Cloud Storage Metadata Cloud

Next Stop – Building a Data Pipeline from Edge to Insight

Cloudera

FEBRUARY 8, 2021

This is part 2 in this blog series. This blog series follows the manufacturing, operations and sales data for a connected vehicle manufacturer as the data goes through stages and transformations typically experienced in a large manufacturing company on the leading edge of current technology.

Data Pipeline

Data Pipeline Building Manufacturing Data Warehouse

3x better performance with CDP Data Warehouse compared to EMR in TPC-DS benchmark

Cloudera

DECEMBER 11, 2020

In a previous blog post on CDW performance, we compared Azure HDInsight to CDW. In this blog post, we compare Cloudera Data Warehouse (CDW) on Cloudera Data Platform (CDP) using Apache Hive-LLAP to EMR 6.0 (also powered by Apache Hive-LLAP) on Amazon using the TPC-DS 2.9 More on this later in the blog.

Data Warehouse

Data Warehouse Metadata Datasets Machine Learning

Introducing Netflix TimeSeries Data Abstraction Layer

Netflix Tech

OCTOBER 8, 2024

In previous blog posts, we introduced the Key-Value Data Abstraction Layer and the Data Gateway Platform , both of which are integral to Netflix’s data architecture. Configurability : TimeSeries offers a range of tunable options for each dataset, providing the flexibility needed to accommodate a wide array of use cases.

Bytes

Bytes Datasets Metadata Data

Unifying Iceberg Tables on Snowflake

Snowflake

AUGUST 31, 2023

Catalog Integration: Our newly developed Catalog Integration feature allows you to seamlessly plug Snowflake into other Iceberg catalogs tracking table metadata. In this blog post, we’ll dive into the details of these features and the benefits for customers. In addition to Iceberg External Tables, we introduced Native Iceberg Tables.

Metadata

Metadata AWS Data Lake Datasets

Ray Batch Inference at Pinterest (Part 3)

Pinterest Engineering

OCTOBER 11, 2024

In Part 2 of our blog series, we described how we were able to integrate Ray(™) into our existing ML infrastructure. In this blog post, we will discuss a second type of popular application of Ray(™) at Pinterest: offline batch inference of ML models. Dataset execution is pipelined so that multiple execution stages can run in parallel.

Datasets

Datasets Software Engineering Software Engineer Metadata

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

Overview This blog post describes support for materialized views for the Iceberg table format. Apache Iceberg is a high-performance open table format for petabyte-scale analytic datasets. The snapshotId of the source tables involved in the materialized view are also maintained in the metadata.

Metadata

Metadata Data Warehouse BI AWS

Detecting Speech and Music in Audio Content

Netflix Tech

NOVEMBER 13, 2023

In this blog post, we will introduce speech and music detection as an enabling technology for a variety of audio applications in Film & TV, as well as introduce our speech and music activity detection (SMAD) system which we recently published as a journal article in EURASIP Journal on Audio, Speech, and Music Processing.

Datasets

Datasets Metadata Algorithm Architecture

Introducing Netflix’s Key-Value Data Abstraction Layer

Netflix Tech

SEPTEMBER 18, 2024

message Item ( Bytes key, Bytes value, Metadata metadata, Integer chunk ) Database Agnostic Abstraction The KV abstraction is designed to hide the implementation details of the underlying database, offering a consistent interface to application developers regardless of the optimal storage system for that use case. . "persistence_configuration":[

Bytes

Bytes Metadata Database Data

Generating and Viewing Lineage through Apache Ozone

Cloudera

AUGUST 10, 2021

For example, writing a Spark dataset to Ozone or launching a DDL query in Hive that points to a location in Ozone. I’ve chosen those names because I’ll be using an easy method for generating and writing TPC-DS datasets, along with creating their corresponding Hive tables. Create a dataset from the customer table. With CDP 7.1.4

Hadoop

Hadoop Kafka Datasets Government

Accelerate Analytics for All

Cloudera

AUGUST 17, 2022

Sophisticated data practitioners and business analysts want access to new datasets that can help optimize their work and transform whole business functions. Traditionally, it all starts with onboarding and transforming the datasets and then building analytical models that create business value, which can take weeks or months.

Cloud Computing

Cloud Computing Cloud Storage Data Science Government

Data News — Week 24.28

Christophe Blefari

JULY 13, 2024

Fast News ⚡️ End-to-end data lineage in AWS — AWS announced DataZone to bring lineage to your data assets , from the picture it can mixes datasets (?), It provides abstractions and tools for the translation of lakehouse table format metadata. I'm not sure I'm happy to see this on Atlassian blog.

Kafka

Kafka AWS Data Database

The Modern Data Lakehouse: An Architectural Innovation

Cloudera

SEPTEMBER 9, 2022

Iceberg is a next-generation, cloud-native table format designed to be open and scalable to petabyte datasets. With innovations like hidden partitioning and metadata stored at the file level, Iceberg makes querying on very large data sets faster, while also making changes to data easier and safer.

Architecture

Architecture Metadata Machine Learning Unstructured Data

Cloudera Named a Visionary in the Gartner MQ for Cloud DBMS

Cloudera

APRIL 1, 2024

Our cutting-edge Shared data experience (SDX) service provides a unified control plane for common security, governance and metadata management on all structured and unstructured data. The post Cloudera Named a Visionary in the Gartner MQ for Cloud DBMS appeared first on Cloudera Blog. Unlike software, ML models need continuous tuning.

Cloud

Cloud Unstructured Data Metadata Government

Boosting Object Storage Performance with Ozone Manager

Cloudera

JULY 19, 2023

It is a replicated, highly-available service that is responsible for managing the metadata for all objects stored in Ozone. In this blog post, we will highlight the work done recently to improve the performance of Ozone Manager to scale to exabytes of data. The hardware specifications are included at the end of this blog.

Management

Management Metadata Datasets Architecture

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Cloudera

MARCH 23, 2022

Therefore, alleviating the need to use different connectors, exotic and poorly maintained APIs, and other use-case specific workarounds to work with your datasets. . Iceberg is designed to be open and engine agnostic allowing datasets to be shared. 3: Open Performance.

Metadata

Metadata Data Architecture Machine Learning BI

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

FEBRUARY 8, 2023

Read this blog to understand everything about AWS Glue that makes it one of the most popular data integration solutions in the industry. Then, Glue writes the job's metadata into the embedded AWS Glue Data Catalog. For analyzing huge datasets, they want to employ familiar Python primitive types. billion by 2026?

AWS

AWS Scala Metadata Data Lake

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Netflix Tech

MARCH 25, 2019

In the rest of this blog, we will a) touch on the complexity of Netflix cloud landscape, b) discuss lineage design goals, ingestion architecture and the corresponding data model, c) share the challenges we faced and the learnings we picked up along the way, and d) close it out with “what’s next” on this journey. push or pull.

Building

Building Metadata Transportation Data Ingestion

Solving Data Discovery At Lyft

Data Engineering Podcast

AUGUST 5, 2019

Once a dataset has been located, how does Amundsen simplify the process of accessing that data for analysis or further processing? Links Amundsen Data Council Presentation Strata Presentation Blog Post Lyft Airflow Podcast.__init__ Links Amundsen Data Council Presentation Strata Presentation Blog Post Lyft Airflow Podcast.__init__

PostgreSQL

PostgreSQL MongoDB Metadata Media

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Cloudera

OCTOBER 11, 2021

The platform converges data cataloging, data ingestion, data profiling, data tagging, data discovery, and data exploration into a unified platform, driven by metadata. Modak Nabu reliably curates datasets for any line of business and personas, from business analysts to data scientists. Knowledge Graphs for the Business.

Data Engineer

Data Engineer Data Engineering Cloud Engineering

How Meta discovers data flows via lineage at scale

How Meta understands data at scale

Webinars

Trending Sources

The Data Turf Wars are Over, But the Metadata Turf Wars Have Just Begun

Webinars

Data Engineering Weekly #198

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

Improving Pinterest Search Relevance Using Large Language Models

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Announcing Open Source DataOps Data Quality TestGen 3.0

Why Open Table Format Architecture is Essential for Modern Data Systems

Introducing Impressions at Netflix

NVIDIA RAPIDS in Cloudera Machine Learning

Apache Ozone Powers Data Science in CDP Private Cloud

Data Engineering Weekly #215

Introducing Apache Iceberg in Cloudera Data Platform

A Look At The Data Systems Behind The Gameplay For League Of Legends

How To Prepare Your Data Team for 2025

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Change Data Capture at Pinterest

Ready-to-go sample data pipelines with Dataflow

Data governance beyond SDX: Adding third party assets to Apache Atlas

How to analyze dataset performance and schema changes in Databand

6 Ways To Prepare Your Data Team for 2025

Data Engineering Weekly #179

The Downfall of the Data Engineer

Cloudera Data Warehouse outperforms Azure HDInsight in TPC-DS benchmark

Next Stop – Building a Data Pipeline from Edge to Insight

3x better performance with CDP Data Warehouse compared to EMR in TPC-DS benchmark

Introducing Netflix TimeSeries Data Abstraction Layer

Unifying Iceberg Tables on Snowflake

Ray Batch Inference at Pinterest (Part 3)

Materialized Views in Hive for Iceberg Table Format

Detecting Speech and Music in Audio Content

Introducing Netflix’s Key-Value Data Abstraction Layer

Generating and Viewing Lineage through Apache Ozone

Accelerate Analytics for All

Data News — Week 24.28

The Modern Data Lakehouse: An Architectural Innovation

Cloudera Named a Visionary in the Gartner MQ for Cloud DBMS

Boosting Object Storage Performance with Ozone Manager

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Solving Data Discovery At Lyft

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Stay Connected