Data Process and Metadata - Data Engineering Digest

Functional Data Engineering — a modern paradigm for batch data processing

Maxime Beauchemin

JANUARY 7, 2018

Batch data processing — historically known as ETL — is extremely challenging. In this post, we’ll explore how applying the functional programming paradigm to data engineering can bring a lot of clarity to the process. It’s time-consuming, brittle, and often unrewarding.

Data Process

Data Process Data Engineer Data Engineering Process

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Netflix Tech

NOVEMBER 14, 2023

Understanding the nature of the late-arriving data and processing requirements will help decide which pattern is most appropriate for a use case. Stateful Data Processing : This pattern is useful when the output depends on a sequence of events across one or more input streams.

Data Process

Data Process Process Metadata Finance

Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

Data Engineering Podcast

NOVEMBER 27, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Missing data? Atlan is the metadata hub for your data ecosystem. Missing data? Stale dashboards?

Data Process

Data Process Process Metadata Business Intelligence

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Simplifying Multimodal Data Analysis with Snowflake Cortex AI

Snowflake

APRIL 16, 2025

This is where multimodal analysis unlocks its true potential by combining traditional structured data with these rich visual insights, creating a more comprehensive business understanding. In manufacturing, facilities are able to prevent costly defects by linking visual inspection data with production specifications.

Data Analysis

Data Analysis Unstructured Data Manufacturing Retail

Metadata: What Is It and Why it Matters

Ascend.io

JULY 11, 2024

Metadata is the information that provides context and meaning to data, ensuring it’s easily discoverable, organized, and actionable. It enhances data quality, governance, and automation, transforming raw data into valuable insights. This is what managing data without metadata feels like. Chaos, right?

Metadata

Metadata IT Government High Quality Data

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

LinkedIn Engineering

JANUARY 19, 2024

Data consistency, feature reliability, processing scalability, and end-to-end observability are key drivers to ensuring business as usual (zero disruptions) and a cohesive customer experience. With our new data processing framework, we were able to observe a multitude of benefits, including 99.9%

Recruitment

Recruitment Data Process Process Kafka

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

This approach is exemplified in the following code snippet: During runtime execution, Privacy Probes does the following: Capturing payloads : It captures source and sink payloads in memory on a sampled basis, along with supplementary metadata such as event timestamps, asset identifiers, and stack traces as evidence for the data flow.

Data Warehouse

Data Warehouse SQL Programming Language Data

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

Examples include “reduce data processing time by 30%” or “minimize manual data entry errors by 50%.” It aims to streamline and automate data workflows, enhance collaboration and improve the agility of data teams. How effective are your current data workflows?

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

First, we create an Iceberg table in Snowflake and then insert some data. Then, we add another column called HASHKEY , add more data, and locate the S3 file containing metadata for the iceberg table. In the screenshot below, we can see that the metadata file for the Iceberg table retains the snapshot history.

Architecture

Architecture Systems Data Lake Google Cloud

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

Obviously not all tools are made with the same use case in mind, so we are planning to add more code samples for other (than classical batch ETL) data processing purposes, e.g. Machine Learning model building and scoring. The main workflow definition file holds the logic of a single run, in this case one day-worth of data.

Data Pipeline

Data Pipeline Scala Metadata Food

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Netflix Tech

NOVEMBER 14, 2023

In this context, managing the data, especially when it arrives late, can present a substantial challenge! In this three-part blog post series, we introduce you to Psyberg , our incremental data processing framework designed to tackle such challenges! Let’s dive in! To solve these problems, we came up with Psyberg!

Data Engineer

Data Engineer Data Engineering Engineering Metadata

Accelerate Your Machine Learning Workflows in Snowflake with Snowpark ML

Snowflake

JANUARY 23, 2024

Behind the scenes, Snowpark ML parallelizes data processing operations by taking advantage of Snowflake’s scalable computing platform. For Snowpark ML Operations, the Snowpark Model Registry allows customers to securely manage and execute models in Snowflake, regardless of origin.

Machine Learning

Machine Learning Metadata Python Telecommunication

Snowflake and the Pursuit Of Precision Medicine

Snowflake

NOVEMBER 29, 2023

Also, the associated business metadata for omics, which make it findable for later use, are dynamic and complex and need to be captured separately. Additionally, the fact that they need to be standardized makes the data discovery effort challenging for downstream analysis.

Metadata

Metadata Healthcare Medical Data Storage

6 Ways To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

Examples include “reduce data processing time by 30%” or “minimize manual data entry errors by 50%.” It aims to streamline and automate data workflows, enhance collaboration and improve the agility of data teams. How effective are your current data workflows?

Data Pipeline

Data Pipeline Metadata Data Workflow Data

3. Psyberg: Automated end to end catch up

Netflix Tech

NOVEMBER 14, 2023

In the previous installments of this series, we introduced Psyberg and delved into its core operational modes: Stateless and Stateful Data Processing. Pipelines After Psyberg Let’s explore how different modes of Psyberg could help with a multistep data pipeline. Audit Run various quality checks on the staged data.

Metadata

Metadata Data Pipeline Scala Data Process

Data logs: The latest evolution in Meta’s access tools

Engineering at Meta

FEBRUARY 4, 2025

Once the batch has been queued for processing, we copy the list of user IDs who have made requests in that batch into a new Hive table. For each data logs table, we initiate a new worker task that fetches the relevant metadata describing how to correctly query the data.

Accessible

Accessible Accessibility Raw Data Data Warehouse

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

Application Logic: Application logic refers to the type of data processing, and can be anything from analytical or operational systems to data pipelines that ingest data inputs, apply transformations based on some business logic and produce data outputs.

Architecture

Architecture Metadata Kafka Government

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Data Engineering Weekly

FEBRUARY 18, 2025

Fluss is a compelling new project in the realm of real-time data processing. A Fluss cluster consists of two main processes: the CoordinatorServer and the TabletServer. It maintains metadata, manages tablet allocation, lists nodes, and handles permissions.

Kafka

Kafka Lambda Architecture SQL Architecture

Our First Netflix Data Engineering Summit

Netflix Tech

DECEMBER 14, 2023

Engineers from across the company came together to share best practices on everything from Data Processing Patterns to Building Reliable Data Pipelines. The result was a series of talks which we are now sharing with the rest of the Data Engineering community!

Data Engineer

Data Engineer Data Engineering Engineering Metadata

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

The fact that ETL tools evolved to expose graphical interfaces seems like a detour in the history of data processing, and would certainly make for an interesting blog post of its own. Sure, there’s a need to abstract the complexity of data processing, computation and storage.

Data Engineer

Data Engineer Data Engineering Engineering ETL Tools

Continued Investments in Price Performance and Faster Top-K Queries

Snowflake

AUGUST 7, 2024

Algorithmic perspective: From an algorithmic perspective, we implemented a way to process partitions in a smart order, which further reduces the number of I/Os. Before Snowflake starts executing the query, we look at the metadata of the partitions to determine whether the contents of a given partition are likely to end up in the final result.

Metadata

Metadata Algorithm Process Utilities

An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality

Data Engineering Podcast

AUGUST 28, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. RudderStack helps you build a customer data platform on your warehouse or data lake.

Data Engineer

Data Engineer Data Engineering MongoDB Metadata

Data Engineering Weekly #203

Data Engineering Weekly

JANUARY 12, 2025

link] Gradient Flow: Paradigm Shifts in Data Processing for the Generative AI Era data processing pipelines haven't kept pace with the rapid advancement of AI models The article highlights the growing importance of preprocessing data pipelines, but the pipeline processing techniques do not match the demand.

Pipeline-centric

Pipeline-centric Data Engineer Data Engineering Engineering

Change Data Capture at Pinterest

Pinterest Engineering

NOVEMBER 18, 2024

CDC allows applications to respond to these changes in real-time, making it an essential component for data integration, replication, and synchronization. Real-Time Data Processing : CDC enables real-time data processing by capturing changes as they happen. Why is CDC Important?

Kafka

Kafka MySQL Database Software Engineering

Open, Interoperable Storage with Iceberg Tables, Now Generally Available

Snowflake

JUNE 21, 2024

Metadata and evolution support : We’ve added structured-type schema evolution for flexibility as source systems or business reporting needs change. Get better Iceberg ecosystem interoperability with Primary Key information added to Iceberg table metadata.

Data Lake

Data Lake BI Business Intelligence Metadata

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

The author emphasizes the importance of mastering state management, understanding "local first" data processing (prioritizing single-node solutions before distributed systems), and leveraging an asset graph approach for data pipelines.

Data Engineer

Data Engineer Data Engineering Engineering Data

Introducing Impressions at Netflix

Netflix Tech

FEBRUARY 14, 2025

Collecting raw impression events Filtering & Enriching Raw Impressions Once the raw impression events are queued, a stateless Apache Flink job takes charge, meticulously processing this data. This refined output is then structured using an Avro schema, establishing a definitive source of truth for Netflixs impression data.

Kafka

Kafka Datasets Metadata Utilities

Data Engineering Weekly #217

Data Engineering Weekly

APRIL 20, 2025

We all know that data freshness plays a critical role in the performance of Lakehouse. If we can place the metadata, indexing, and recent data files in Express One, we can potentially build a Snowflake-style performant architecture in Lakehouse. Apache Hudi, for example, introduces an indexing technique to Lakehouse.

Data Engineer

Data Engineer Data Engineering Engineering Kafka

The Good and the Bad of Apache Spark Big Data Processing

AltexSoft

JULY 18, 2023

It allows data scientists to analyze large datasets and interactively run jobs on them from the R shell. Big data processing. When transformations are applied to RDDs, Spark records the metadata to build up a DAG, which reflects the sequence of computations performed during the execution of the Spark job.

Big Data

Big Data Data Process Process Hadoop

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

Learn more about the impacts of global data sharing in this blog, The Ethics of Data Exchange. Before we jump into the data ingestion step, here is a quick overview of how Ozone manages its metadata namespace through volumes, buckets and keys. . Data ingestion through ‘s3’. Data processing and visualization.

Data Science

Data Science Cloud Hadoop Metadata

What is Apache Airflow?

Marc Lamberti

SEPTEMBER 22, 2023

Airflow stores metadata in it (DAG runs, XComs, Task instances, etc. It is also essential to understand what Airflow is not – it’s neither a streaming solution nor a data processing framework. The Meta database: Database compatible with SqlAlchemy.

Data Pipeline

Data Pipeline Python Metadata Database

Modern Data Engineering

Towards Data Science

NOVEMBER 4, 2023

The data engineering landscape is constantly changing but major trends seem to remain the same. How to Become a Data Engineer As a data engineer, I am tasked to design efficient data processes almost every day. It was created by Spotify to manage massive data processing workloads.

Data Engineer

Data Engineer Data Engineering Engineering BI

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

Architectural Patterns for Data Quality Now we understand the trade-off between speed & correctness and the difference between data testing and observability. Let’s talk about the data processing types. Two-Phase WAP The Two-Phase WAP, as the name suggests, follows two copy processes.

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Why Data Governance Is Crucial for All Enterprise-Level Businesses

Cloudera

MARCH 3, 2022

Data users in these enterprises don’t know how data is derived and lack confidence in whether it’s the right source to use. . If data access policies and lineage aren’t consistent across an organization’s private cloud and public clouds, gaps will exist in audit logs. From Bad to Worse.

Data Governance

Data Governance Government Metadata Medical

Data Engineering Weekly #177

Data Engineering Weekly

JUNE 24, 2024

Question to the readers, what do you think of the current state of real-time data processing engines? link] Influx Data: How Good is Parquet for Wide Tables (Machine Learning Workloads) Really? Are there enough usecases? Is parquet is still good enough for Machine Learning, Vector and Lake House workloads?

Data Engineer

Data Engineer Data Engineering Engineering Google Cloud

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

FEBRUARY 8, 2023

AWS Glue is a widely-used serverless data integration service that uses automated extract, transform, and load ( ETL ) methods to prepare data for analysis. It offers a simple and efficient solution for data processing in organizations. Then, Glue writes the job's metadata into the embedded AWS Glue Data Catalog.

AWS

AWS Scala Metadata Data Lake

Change Data Capture (CDC): What it is and How it Works

Striim

MARCH 21, 2025

During the transformation phase, data is processed and converted into the appropriate format for the target destination. While legacy ETL has a slow transformation step, modern ETL platforms replace disk-based processing with in-memory processing to allow for real-time data processing, enrichment, and analysis.

IT

IT Data Lake Data Warehouse Relational Database

Build and deploy ML with ease Using Snowpark ML, Snowflake Notebooks, and Snowflake Feature Store

Snowflake

NOVEMBER 1, 2023

We recently announced Snowpark ML Modeling API (generally available soon), which enables the use of popular ML frameworks such as Scikit-learn and XGBoost for feature engineering and model training without moving data out of Snowflake. Snowpark ML enables intuitive model development using these frameworks through familiar Python APIs.

Building

Building Python SQL Programming Language

Hadoop vs Spark: Main Big Data Tools Explained

AltexSoft

JUNE 7, 2021

Hadoop and Spark are the two most popular platforms for Big Data processing. They both enable you to deal with huge collections of data no matter its format — from Excel tables to user feedback on websites to images and video files. Obviously, Big Data processing involves hundreds of computing units.

Big Data Tools

Big Data Tools Hadoop Big Data Database-centric

Effective Pandas Patterns For Data Engineering

Data Engineering Podcast

JANUARY 30, 2022

Matt Harrison is a Python expert with a long history of working with data who now spends his time on consulting and training. You can observe your pipelines with built in metadata search and column level lineage. What are some of the utility features that you have found most helpful for data processing?

Data Engineer

Data Engineer Data Engineering Engineering Python

Enhancing the security of WhatsApp calls

Engineering at Meta

NOVEMBER 8, 2023

Having carefully built this feature to minimize attack surface and external data processing, we are able to help protect users from not only unwanted contact, but also cyber attacks and spyware. Protect your IP address metadata in calls Two common methods of connecting call participants: peer-to-peer and via a relay.

Metadata

Metadata Process Designing Technology

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Cloudera

MARCH 23, 2022

The table information (such as schema, partition) is stored as part of the metadata (manifest) file separately, making it easier for applications to quickly integrate with the tables and the storage formats of their choice. Change data capture (CDC). 3: Open Performance.

Metadata

Metadata Data Architecture Machine Learning BI

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

DECEMBER 15, 2022

With in-place table migration, you can rapidly convert to Iceberg tables since there is no need to regenerate data files. Only metadata will be regenerated. Newly generated metadata will then point to source data files as illustrated in the diagram below. . Data quality using table rollback. Metadata management .

Cloud

Cloud Metadata Data Warehouse Google Cloud

Optimizing data warehouse storage

Netflix Tech

DECEMBER 21, 2020

On the other hand, these optimizations themselves need to be sufficiently inexpensive to justify their own processing cost over the gains they bring. We built AutoOptimize to efficiently and transparently optimize the data and metadata storage layout while maximizing their cost and performance benefits.

Data Warehouse

Data Warehouse Metadata Algorithm Data

Functional Data Engineering — a modern paradigm for batch data processing

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Webinars

Trending Sources

Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

Webinars

Simplifying Multimodal Data Analysis with Snowflake Cortex AI

Metadata: What Is It and Why it Matters

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

How Meta discovers data flows via lineage at scale

How To Prepare Your Data Team for 2025

Why Open Table Format Architecture is Essential for Modern Data Systems

Ready-to-go sample data pipelines with Dataflow

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Accelerate Your Machine Learning Workflows in Snowflake with Snowpark ML

Snowflake and the Pursuit Of Precision Medicine

6 Ways To Prepare Your Data Team for 2025

3. Psyberg: Automated end to end catch up

Data logs: The latest evolution in Meta’s access tools

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Our First Netflix Data Engineering Summit

The Rise of the Data Engineer

Continued Investments in Price Performance and Faster Top-K Queries

An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality

Data Engineering Weekly #203

Change Data Capture at Pinterest

Open, Interoperable Storage with Iceberg Tables, Now Generally Available

Data Engineering Weekly #213

Introducing Impressions at Netflix

Data Engineering Weekly #217

The Good and the Bad of Apache Spark Big Data Processing

Apache Ozone Powers Data Science in CDP Private Cloud

What is Apache Airflow?

Modern Data Engineering

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Why Data Governance Is Crucial for All Enterprise-Level Businesses

Data Engineering Weekly #177

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

Change Data Capture (CDC): What it is and How it Works

Build and deploy ML with ease Using Snowpark ML, Snowflake Notebooks, and Snowflake Feature Store

Hadoop vs Spark: Main Big Data Tools Explained

Effective Pandas Patterns For Data Engineering

Enhancing the security of WhatsApp calls

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Optimizing data warehouse storage

Stay Connected