Blog, Data Process and Metadata - Data Engineering Digest

Functional Data Engineering — a modern paradigm for batch data processing

Maxime Beauchemin

JANUARY 7, 2018

Batch data processing — historically known as ETL — is extremely challenging. In this post, we’ll explore how applying the functional programming paradigm to data engineering can bring a lot of clarity to the process. Late arriving facts Late arriving facts can be problematic with a strict immutable data policy.

Data Process

Data Process Data Engineering Data Engineer Process

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Netflix Tech

NOVEMBER 14, 2023

By Abhinaya Shetty , Bharath Mummadisetty In the inaugural blog post of this series, we introduced you to the state of our pipelines before Psyberg and the challenges with incremental processing that led us to create the Psyberg framework within Netflix’s Membership and Finance data engineering team.

Data Process

Data Process Process Metadata Finance

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

This belief has led us to developing Privacy Aware Infrastructure (PAI) , which offers efficient and reliable first-class privacy constructs embedded in Meta infrastructure to address different privacy requirements, such as purpose limitation , which restricts the purposes for which data can be processed and used.

Data Warehouse

Data Warehouse SQL Programming Language Data

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Metadata: What Is It and Why it Matters

Ascend.io

JULY 11, 2024

Metadata is the information that provides context and meaning to data, ensuring it’s easily discoverable, organized, and actionable. It enhances data quality, governance, and automation, transforming raw data into valuable insights. This is what managing data without metadata feels like. Chaos, right?

Metadata

Metadata IT Government High Quality Data

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

LinkedIn Engineering

JANUARY 19, 2024

Data consistency, feature reliability, processing scalability, and end-to-end observability are key drivers to ensuring business as usual (zero disruptions) and a cohesive customer experience. With our new data processing framework, we were able to observe a multitude of benefits, including 99.9%

Recruitment

Recruitment Data Process Process Kafka

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

Automation, AI, DataOps, and strategic alignment are no longer optional —they are essential components of a successful data strategy. As we look towards 2025, it’s clear that data teams must evolve to meet the demands of evolving technology and opportunities. How effective are your current data workflows?

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Data Engineering Weekly #217

Data Engineering Weekly

APRIL 20, 2025

The blog took out the last edition’s recommendation on AI and summarized the current state of AI adoption in enterprises. The simplistic model expressed in the blog made it easy for me to reason about the transactional system design. Kafka is probably the most reliable data infrastructure in the modern data era.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms. In this blog, we will discuss: What is the Open Table format (OTF)? First, we create an Iceberg table in Snowflake and then insert some data.

Architecture

Architecture Systems Data Lake Google Cloud

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

Its main purpose is to enable easy unit testing of your data pipelines, but it can technically be used in any other situations as a readable data format for small data sets. All the above commands are very likely to be described in separate future blog posts, but right now let’s focus on the dataflow sample command.

Data Pipeline

Data Pipeline Scala Metadata Food

Continued Investments in Price Performance and Faster Top-K Queries

Snowflake

AUGUST 7, 2024

As we describe in this blog post , the top-k feature uses runtime information — namely, the current contents of the top-k elements — to skip micro-partitions where we can guarantee that they won’t contribute to the overall result. Snowflake starts processing those partitions first. on average, with some queries also reaching up to 99.8%

Metadata

Metadata Algorithm Process Utilities

1. Streamlining Membership Data Engineering at Netflix with Psyberg

Netflix Tech

NOVEMBER 14, 2023

In this context, managing the data, especially when it arrives late, can present a substantial challenge! In this three-part blog post series, we introduce you to Psyberg , our incremental data processing framework designed to tackle such challenges! Let’s dive in! To solve these problems, we came up with Psyberg!

Data Engineering

Data Engineering Data Engineer Engineering Metadata

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

In this blog, I will demonstrate the value of Cloudera DataFlow (CDF) , the edge-to-cloud streaming data platform available on the Cloudera Data Platform (CDP) , as a Data integration and Democratization fabric. Data and Metadata: Data inputs and data outputs produced based on the application logic.

Architecture

Architecture Metadata Kafka Government

Change Data Capture at Pinterest

Pinterest Engineering

NOVEMBER 18, 2024

Liang Mou; Staff Software Engineer, Logging Platform | Elizabeth (Vi) Nguyen; Software Engineer I, Logging Platform | In today’s data-driven world, businesses need to process and analyze data in real-time to make informed decisions. What is Change Data Capture? Why is CDC Important? or its affiliates.

Kafka

Kafka MySQL Database Software Engineering

6 Ways To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

Automation, AI, DataOps, and strategic alignment are no longer optional —they are essential components of a successful data strategy. As we look towards 2025, it’s clear that data teams must evolve to meet the demands of evolving technology and opportunities. How effective are your current data workflows?

Data Pipeline

Data Pipeline Metadata Data Workflow Data

3. Psyberg: Automated end to end catch up

Netflix Tech

NOVEMBER 14, 2023

By Abhinaya Shetty , Bharath Mummadisetty This blog post will cover how Psyberg helps automate the end-to-end catchup of different pipelines, including dimension tables. In the previous installments of this series, we introduced Psyberg and delved into its core operational modes: Stateless and Stateful Data Processing.

Metadata

Metadata Data Pipeline Scala Data Process

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

In addition to big data workloads, Ozone is also fully integrated with authorization and data governance providers namely Apache Ranger & Apache Atlas in the CDP stack. While we walk through the steps one by one from data ingestion to analysis, we will also demonstrate how Ozone can serve as an ‘S3’ compatible object store.

Data Science

Data Science Cloud Hadoop Metadata

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

[link] Georg Heiler: Upskilling data engineers What should I prefer for 2028, or how can I break into data engineering? I honestly don’t have a solid answer, but this blog is an excellent overview of upskilling. These are common LinkedIn requests.

Data Engineering

Data Engineering Data Engineer Engineering Data

Data Engineering Weekly #177

Data Engineering Weekly

JUNE 24, 2024

link] Netflix: A Recap of the Data Engineering Open Forum at Netflix Netflix publishes a recap of all the talks in the first Data Engineering open forum tech meetups. The blog contains a summary of each talk and a link to the YouTube channel with all the talks. Are there enough usecases?

Data Engineering

Data Engineering Data Engineer Engineering Google Cloud

Introducing Impressions at Netflix

Netflix Tech

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. This nuanced integration of data and technology empowers us to offer bespoke content recommendations.

Kafka

Kafka Datasets Metadata Utilities

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

The fact that ETL tools evolved to expose graphical interfaces seems like a detour in the history of data processing, and would certainly make for an interesting blog post of its own. Sure, there’s a need to abstract the complexity of data processing, computation and storage.

Data Engineering

Data Engineering Data Engineer Engineering ETL Tools

Data Engineering Weekly #203

Data Engineering Weekly

JANUARY 12, 2025

This blog captures the current state of Agent adoption, emerging software engineering roles, and the use case category. Generative AI demands the processing of vast amounts of diverse, unstructured data (e.g., meeting recordings and videos), which contrasts with traditional SQL-centric systems for structured data.

Pipeline-centric

Pipeline-centric Data Engineering Data Engineer Engineering

Next Stop – Building a Data Pipeline from Edge to Insight

Cloudera

FEBRUARY 8, 2021

This is part 2 in this blog series. You can read part 1, here: Digital Transformation is a Data Journey From Edge to Insight. The first blog introduced a mock connected vehicle manufacturing company, The Electric Car Company (ECC), to illustrate the manufacturing data path through the data lifecycle.

Data Pipeline

Data Pipeline Building Manufacturing Data Warehouse

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

In the second part, we will focus on architectural patterns to implement data quality from a data contract perspective. Why is Data Quality Expensive? I won’t bore you with the importance of data quality in the blog. Let’s talk about the data processing types.

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Why Data Governance Is Crucial for All Enterprise-Level Businesses

Cloudera

MARCH 3, 2022

Data users in these enterprises don’t know how data is derived and lack confidence in whether it’s the right source to use. . If data access policies and lineage aren’t consistent across an organization’s private cloud and public clouds, gaps will exist in audit logs. From Bad to Worse.

Data Governance

Data Governance Government Metadata Medical

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Cloudera

MARCH 2, 2023

Recently, we announced enhanced multi-function analytics support in Cloudera Data Platform (CDP) with Apache Iceberg. Iceberg is a high-performance open table format for huge analytic data sets. The post Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing appeared first on Cloudera Blog.

Process

Process SQL Kafka Database

Data Engineering Weekly #152

Data Engineering Weekly

DECEMBER 10, 2023

The blog is an excellent comparison study of Ray vs. Dask’s performance. The author discusses the OneTable sync mechanism among all three major LakeHouse formats in this blog. The blog discusses Psyberg’s two operational models, stateless & stateful data processing. Stores metadata to utilize later.

Data Engineering

Data Engineering Data Engineer Engineering Metadata

Boosting Object Storage Performance with Ozone Manager

Cloudera

JULY 19, 2023

It is a replicated, highly-available service that is responsible for managing the metadata for all objects stored in Ozone. As Ozone scales to exabytes of data, it is important to ensure that Ozone Manager can perform at scale. The hardware specifications are included at the end of this blog.

Management

Management Metadata Datasets Architecture

Ray Batch Inference at Pinterest (Part 3)

Pinterest Engineering

OCTOBER 11, 2024

In Part 2 of our blog series, we described how we were able to integrate Ray(™) into our existing ML infrastructure. In this blog post, we will discuss a second type of popular application of Ray(™) at Pinterest: offline batch inference of ML models. Ray Data is not bound to any specific ML library.

Datasets

Datasets Software Engineer Software Engineering Metadata

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

DECEMBER 15, 2022

With in-place table migration, you can rapidly convert to Iceberg tables since there is no need to regenerate data files. Only metadata will be regenerated. Newly generated metadata will then point to source data files as illustrated in the diagram below. . Data quality using table rollback. Metadata management .

Cloud

Cloud Metadata Data Warehouse Google Cloud

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Cloudera

MARCH 23, 2022

The table information (such as schema, partition) is stored as part of the metadata (manifest) file separately, making it easier for applications to quickly integrate with the tables and the storage formats of their choice. Change data capture (CDC). 3: Open Performance.

Metadata

Metadata Data Architecture Machine Learning BI

Unleashing the Power of CDC With Snowflake

Workfall

JUNE 12, 2023

So, embrace the power of Change Data Capture, and embark on a captivating journey where the magic of real-time data awaits. In this blog, we will cover: What Is CDC and Its Benefits? These additional columns store metadata like timestamps, user IDs, and change types, ensuring granular change tracking and auditability.

Telecommunication

Telecommunication Metadata Healthcare Finance

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

LinkedIn Engineering

JULY 19, 2023

Open source data lakehouse deployments are built on the foundations of compute engines (like Apache Spark, Trino, Apache Flink), distributed storage (HDFS, cloud blob stores), and metadata catalogs / table formats (like Apache Iceberg, Delta, Hudi, Apache Hive Metastore). Tables are governed as per agreed upon company standards.

Big Data

Big Data Data Management Management Metadata

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

FEBRUARY 8, 2023

Do ETL and data integration activities seem complex to you? Read this blog to understand everything about AWS Glue that makes it one of the most popular data integration solutions in the industry. Did you know the global big data market will likely reach $268.4 Businesses are leveraging big data now more than ever.

AWS

AWS Scala Metadata Data Lake

Change Data Capture (CDC): What it is and How it Works

Striim

MARCH 21, 2025

During the transformation phase, data is processed and converted into the appropriate format for the target destination. While legacy ETL has a slow transformation step, modern ETL platforms replace disk-based processing with in-memory processing to allow for real-time data processing, enrichment, and analysis.

IT

IT Data Lake Data Warehouse Relational Database

Building Netflix’s Distributed Tracing Infrastructure

Netflix Tech

OCTOBER 19, 2020

In our previous blog post we introduced Edgar, our troubleshooting tool for streaming sessions. We could also get contextual information about the streaming session by joining relevant traces with account metadata and service logs. The next challenge was to stream large amounts of traces via a scalable data processing platform.

Building

Building Transportation Java Metadata

Cloudera Named a Visionary in the Gartner MQ for Cloud DBMS

Cloudera

APRIL 1, 2024

We scored the highest in hybrid, intercloud, and multi-cloud capabilities because we are the only vendor in the market with a true hybrid data platform that can run on any cloud including private cloud to deliver a seamless, unified experience for all data, wherever it lies. Sign up for a trial to see for yourself.

Cloud

Cloud Unstructured Data Metadata Government

Cloudera DataFlow for the Public Cloud: A technical deep dive

Cloudera

AUGUST 16, 2021

CDF-PC enables Apache NiFi users to run their existing data flows on a managed, auto-scaling platform with a streamlined way to deploy NiFi data flows and a central monitoring dashboard making it easier than ever before to operate NiFi data flows at scale in the public cloud. The need for a cloud-native Apache NiFi service.

Cloud

Cloud Unstructured Data Utilities Metadata

Data Collection And Management To Power Sound Recognition At Audio Analytic

Data Engineering Podcast

JUNE 29, 2020

This was a great conversation about the complexities of working in a niche domain of data analysis and how to build a pipeline of high quality data from collection to analysis.

Data Collection

Data Collection Management High Quality Data Metadata

Optimizing data warehouse storage

Netflix Tech

DECEMBER 21, 2020

On the other hand, these optimizations themselves need to be sufficiently inexpensive to justify their own processing cost over the gains they bring. We built AutoOptimize to efficiently and transparently optimize the data and metadata storage layout while maximizing their cost and performance benefits.

Data Warehouse

Data Warehouse Metadata Algorithm Data

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

He wrote some years ago 3 articles defining data engineering field. Some concepts When doing data engineering you can touch a lot of different concepts. Read technical blogs, watch conferences and read 📘 Designing Data-Intensive Applications (even if it could be overkill). Is it really modern?

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

LinkedIn Engineering

JUNE 15, 2023

In this blog post, we will discuss the AvroTensorDataset API, techniques we used to improve data processing speeds by up to 162x over existing solutions (thereby decreasing overall training time by up to 66%), and performance results from benchmarks and production. The rest of the steps act as consumers of prefetched bytes.

Datasets

Datasets Bytes Process Data Ingestion

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

The first generation of the Hive Metastore attempted to address the performance considerations to run SQL efficiently on a data lake. It provided the concept of a database, schemas, and tables for describing the structure of a data lake in a way that let BI tools traverse the data efficiently.

Data Lake

Data Lake Data Warehouse BI SQL

Turning Streams Into Data Products

Cloudera

JUNE 16, 2022

Use cases like fraud detection, network threat analysis, manufacturing intelligence, commerce optimization, real-time offers, instantaneous loan approvals, and more are now possible by moving the data processing components up the stream to address these real-time needs. . Not in the manufacturing space? Not to worry.

Kafka

Kafka Manufacturing Data Lake SQL

8 Data Quality Monitoring Techniques & Metrics to Watch

Databand.ai

AUGUST 30, 2023

Data Performance Testing Data performance testing is the process of evaluating the efficiency, effectiveness, and scalability of your data processing systems and infrastructure. To perform data performance testing, you should first establish performance benchmarks and targets for your data processing systems.

Data Cleanse

Data Cleanse Metadata High Quality Data Datasets

Functional Data Engineering — a modern paradigm for batch data processing

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Webinars

Trending Sources

How Meta discovers data flows via lineage at scale

Webinars

Metadata: What Is It and Why it Matters

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

How To Prepare Your Data Team for 2025

Data Engineering Weekly #217

Why Open Table Format Architecture is Essential for Modern Data Systems

Ready-to-go sample data pipelines with Dataflow

Continued Investments in Price Performance and Faster Top-K Queries

1. Streamlining Membership Data Engineering at Netflix with Psyberg

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Change Data Capture at Pinterest

6 Ways To Prepare Your Data Team for 2025

3. Psyberg: Automated end to end catch up

Apache Ozone Powers Data Science in CDP Private Cloud

Data Engineering Weekly #213

Data Engineering Weekly #177

Introducing Impressions at Netflix

The Rise of the Data Engineer

Data Engineering Weekly #203

Next Stop – Building a Data Pipeline from Edge to Insight

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Why Data Governance Is Crucial for All Enterprise-Level Businesses

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Data Engineering Weekly #152

Boosting Object Storage Performance with Ozone Manager

Ray Batch Inference at Pinterest (Part 3)

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Unleashing the Power of CDC With Snowflake

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

Change Data Capture (CDC): What it is and How it Works

Building Netflix’s Distributed Tracing Infrastructure

Cloudera Named a Visionary in the Gartner MQ for Cloud DBMS

Cloudera DataFlow for the Public Cloud: A technical deep dive

Data Collection And Management To Power Sound Recognition At Audio Analytic

Optimizing data warehouse storage

How to learn data engineering

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

The Future of the Data Lakehouse – Open

Turning Streams Into Data Products

8 Data Quality Monitoring Techniques & Metrics to Watch

Stay Connected