Data Process and Systems - Data Engineering Digest

Functional Data Engineering — a modern paradigm for batch data processing

Maxime Beauchemin

JANUARY 7, 2018

Batch data processing — historically known as ETL — is extremely challenging. In this post, we’ll explore how applying the functional programming paradigm to data engineering can bring a lot of clarity to the process. It’s time-consuming, brittle, and often unrewarding. Things have changed quite a bit since then.

Data Process

Data Process Data Engineer Data Engineering Process

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms.

Architecture

Architecture Systems Data Lake Google Cloud

OLAP vs. OLTP: A Comparative Analysis of Data Processing Systems

KDnuggets

AUGUST 21, 2023

A comprehensive comparison between OLAP and OLTP systems, exploring their features, data models, performance needs, and use cases in data engineering.

Systems

Systems Data Process Process Data

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Data Engineering Podcast

JANUARY 7, 2024

Summary Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. What do you have planned for the future of your academic research?

Data Process

Data Process Process Data Lake High Quality Data

Last Mile Data Processing with Ray

Pinterest Engineering

SEPTEMBER 12, 2023

It often requires a long process that touches many languages and frameworks. They have to integrate these jobs with workflow systems, test them at scale, tune them, and release into production. This is not an interactive process, and often bugs are not found until later. However, this approach has its own challenges.

Data Process

Data Process Process Datasets Software Engineering

Type-safe data processing pipelines

Tweag

APRIL 26, 2023

Moreover, these steps can be combined in different ways, perhaps omitting some or changing the order of others, producing different data processing pipelines tailored to a particular task at hand. The reader is assumed to be somewhat familiar with the DataKinds and TypeFamilies extensions, but we will review some peculiarities.

Data Process

Data Process Process Programming Data

Mastering Batch Data Processing with Versatile Data Kit (VDK)

Towards Data Science

NOVEMBER 16, 2023

Data Management A tutorial on how to use VDK to perform batch data processing Photo by Mika Baumeister on Unsplash Versatile Data Ki t (VDK) is an open-source data ingestion and processing framework designed to simplify data management complexities. The following figure shows a snapshot of VDK UI.

Data Process

Data Process Process Raw Data Data

Unapologetically Technical Episode 17 – Semih Salihoglu

Jesse Anderson

FEBRUARY 11, 2025

Semih is a researcher and entrepreneur with a background in distributed systems and databases. He then pursued his doctoral studies at Stanford University, delving into the complexities of database systems. Dont forget to subscribe to my YouTube channel to get the latest on Unapologetically Technical!

Computer Science

Computer Science Database Design Software Engineering Software Engineer

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

LinkedIn Engineering

JANUARY 19, 2024

Data consistency, feature reliability, processing scalability, and end-to-end observability are key drivers to ensuring business as usual (zero disruptions) and a cohesive customer experience. With our new data processing framework, we were able to observe a multitude of benefits, including 99.9%

Recruitment

Recruitment Data Process Process Kafka

Supporting Diverse ML Systems at Netflix

Netflix Tech

MARCH 7, 2024

The Machine Learning Platform (MLP) team at Netflix provides an entire ecosystem of tools around Metaflow , an open source machine learning infrastructure framework we started, to empower data scientists and machine learning practitioners to build and manage a variety of ML systems. ETL workflows), as well as downstream (e.g.

Systems

Systems Media Machine Learning Data Warehouse

Most Essential 2023 Interview Questions on Data Engineering

Analytics Vidhya

FEBRUARY 7, 2023

Introduction Data engineering is the field of study that deals with the design, construction, deployment, and maintenance of data processing systems. This includes designing and implementing […] The post Most Essential 2023 Interview Questions on Data Engineering appeared first on Analytics Vidhya.

Data Engineer

Data Engineer Data Engineering Engineering Data

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

Data lineage is an instrumental part of Metas Privacy Aware Infrastructure (PAI) initiative, a suite of technologies that efficiently protect user privacy. It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Metas systems.

Data Warehouse

Data Warehouse SQL Programming Language Data

Building cost effective data pipelines with Python & DuckDB

Start Data Engineering

MAY 28, 2024

Building efficient data pipelines with DuckDB 4.1. Use DuckDB to process data, not for multiple users to access data 4.2. Cost calculation: DuckDB + Ephemeral VMs = dirt cheap data processing 4.3. Processing data less than 100GB? Use DuckDB 4.4.

Data Pipeline

Data Pipeline Python Building Data

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

Data Engineering Podcast

APRIL 24, 2022

WhyLogs is a powerful library for flexibly instrumenting all of your data systems to understand the entire lifecycle of your data from source to productionized model. You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms.

Machine Learning

Machine Learning Systems Data Lake Java

End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker

Towards Data Science

FEBRUARY 9, 2024

This involves getting data from an API and storing it in a PostgreSQL database. Overview Let’s break down the data pipeline process step-by-step: Data Streaming: Initially, data is streamed from the API into a Kafka topic. You can see some examples and query manually the dataset records using this link.

Kafka

Kafka Data Engineer Data Engineering PostgreSQL

Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

Data Engineering Podcast

NOVEMBER 27, 2022

This has led to inefficiencies in how data is stored, accessed, and shared across process and system boundaries. The Arrow project is designed to eliminate wasted effort in translating between languages, and Voltron Data was created to help grow and support its technology and community.

Data Process

Data Process Process Metadata Business Intelligence

Top 10 Data Engineering Trends in 2025

Edureka

APRIL 22, 2025

Real-time data processing has emerged The demand for real-time data handling is expected to increase significantly in the coming years. To meet this need, people who work in data engineering will focus on making systems that can handle ongoing data streams with little delay.

Data Engineer

Data Engineer Data Engineering Engineering Consulting

Simplifying Multimodal Data Analysis with Snowflake Cortex AI

Snowflake

APRIL 16, 2025

However, relying only on structured data for these models can overlook valuable signals present in unstructured sources like images, which influence user engagement. Cortex AI delivers exceptional quality across a wide range of unstructured data processing tasks through models and specialized functions tailored for different tasks.

Data Analysis

Data Analysis Unstructured Data Manufacturing Retail

Sync Partners with Apex Systems

Sync Computing

NOVEMBER 20, 2024

We’re thrilled to announce that Sync has partnered with Apex Systems, a leading global technology services provider with a presence in more than 70 markets across North America, Europe, and India. Combining Apex Systems industry experience with our cutting-edge tech will drive innovation that will propel the industry forward.

Systems

Systems Consulting Machine Learning Data Pipeline

Snowflake Startup Challenge 2025: Meet the Top 10

Snowflake

APRIL 9, 2025

By enabling advanced analytics and centralized document management, Digityze AI helps pharmaceutical manufacturers eliminate data silos and accelerate data sharing. KAWA Analytics Digital transformation is an admirable goal, but legacy systems and inefficient processes hold back many companies efforts.

Pharmaceutical

Pharmaceutical Manufacturing Data Ingestion SQL

The Race For Data Quality in a Medallion Architecture

DataKitchen

NOVEMBER 5, 2024

The Race For Data Quality In A Medallion Architecture The Medallion architecture pattern is gaining traction among data teams. It is a layered approach to managing and transforming data. By systematically moving data through these layers, the Medallion architecture enhances the data structure in a data lakehouse environment.

Architecture

Architecture Raw Data Pipeline-centric Data Ingestion

Trends and Takeaways from Banking and Payments’ Event of the Year

Snowflake

NOVEMBER 11, 2024

One of the most impactful, yet underdiscussed, areas is the potential of autonomous finance, where systems not only automate payments but manage accounts and financial processes with minimal human intervention.

Banking

Banking Finance Retail Food

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Data Engineering Weekly

JANUARY 15, 2025

The Critical Role of AI Data Engineers in a Data-Driven World How does a chatbot seamlessly interpret your questions? The answer lies in unstructured data processing—a field that powers modern artificial intelligence (AI) systems. Adding to this complexity is the sheer volume of data generated daily.

Data Engineer

Data Engineer Data Engineering Unstructured Data Engineering

Mainframe Data Meets AI: Reducing Bias and Enhancing Predictive Power

Precisely

DECEMBER 12, 2024

Key Takeaways : The significance of using legacy systems like mainframes in modern AI. How mainframe data helps reduce bias in AI models. The challenges and solutions involved in integrating legacy data with modern AI systems. Data Silos Mainframe data often exists in a silo, separated from other enterprise data.

Healthcare

Healthcare Algorithm Finance Data Integration

How To Future-Proof Your Data Pipelines

Ascend.io

NOVEMBER 14, 2024

Why Future-Proofing Your Data Pipelines Matters Data has become the backbone of decision-making in businesses across the globe. The ability to harness and analyze data effectively can make or break a company’s competitive edge. Set Up Auto-Scaling: Configure auto-scaling for your data processing and storage resources.

Data Pipeline

Data Pipeline Amazon Web Services Data Integration Data

Change Data Capture at Pinterest

Pinterest Engineering

NOVEMBER 18, 2024

Real-Time Data Processing : CDC enables real-time data processing by capturing changes as they happen. This is crucial for applications that require up-to-date information, such as fraud detection systems or recommendation engines. Support highly distributed database setup.

Kafka

Kafka MySQL Database Software Engineering

Introducing Impressions at Netflix

Netflix Tech

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. This nuanced integration of data and technology empowers us to offer bespoke content recommendations.

Kafka

Kafka Datasets Metadata Utilities

Data logs: The latest evolution in Meta’s access tools

Engineering at Meta

FEBRUARY 4, 2025

Were sharing how Meta built support for data logs, which provide people with additional data about how they use our products. Here we explore initial system designs we considered, an overview of the current architecture, and some important principles Meta takes into account in making data accessible and easy to understand.

Accessible

Accessible Accessibility Raw Data Data Warehouse

Snowflake Startup Spotlight: Contextual AI

Snowflake

MARCH 20, 2025

While its very easy to put together a compelling AI demo, its very difficult to get AI systems into production. I believe it can revolutionize the world and solve critical global problems. What problem does Contextual AI aim to solve? This is especially true for enterprises, which demand high levels of accuracy, auditability and security.

Programming

Programming Certification Building Designing

Secrets of Spark to Snowflake Migration Success: Customer Stories

Snowflake

NOVEMBER 19, 2024

To overcome these hurdles, CTC moved its processing off of managed Spark and onto Snowflake, where it had already built its data foundation. Thanks to the reduction in costs, CTC now maximizes data to further innovate and increase its market-making capabilities.

Data Governance

Data Governance Government Healthcare Building

Data Engineering Weekly #195

Data Engineering Weekly

OCTOBER 27, 2024

The learning mostly involves understanding the data's nature, frequency of data processing, and awareness of the computing cost. On a similar line, Uber writes about its comprehensive settlement accounting system designed to handle the immense volume of transactions processed each month efficiently.

Data Engineer

Data Engineer Data Engineering Engineering Unstructured Data

What are the Key Parts of Data Engineering?

Start Data Engineering

SEPTEMBER 4, 2024

Key parts of data systems: 2.1. Data flow design 2.3. Data processing design 2.5. Data storage design 2.7. Introduction If you are trying to break into (or land a new) data engineering job, you will inevitably encounter a slew of data engineering tools. Introduction 2. Requirements 2.2.

Data Engineer

Data Engineer Data Engineering Engineering Data Storage

Is the “AI developer”a threat to jobs – or a marketing stunt?

The Pragmatic Engineer

MARCH 19, 2024

A 1959 survey had found that in any data processing installation, the programming cost US$800,000 on average and that translating programs to run on new hardware would cost $600,000. From Wikipedia : “In the late 1950s, computer users and manufacturers were becoming concerned about the rising cost of programming.

Software Engineering

Software Engineering Software Engineer Programming Language Media

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

Instead of driving innovation, data engineers often find themselves bogged down with maintenance tasks. On average, engineers spend over half of their time maintaining existing systems rather than developing new solutions. Tool sprawl is another hurdle that data teams must overcome.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Complete Guide to Data Transformation: Basics to Advanced

Ascend.io

OCTOBER 28, 2024

Read More: Discover how to build a data pipeline in 6 steps Data Integration Data integration involves combining data from different sources into a single, unified view. This technique is vital for ensuring consistency and accuracy across datasets, especially in organizations that rely on multiple data systems.

Raw Data

Raw Data Datasets Aggregated Data Data Pipeline

New Year, New Approaches to Tackling IT Operations Management

Precisely

FEBRUARY 6, 2025

Automation and AI are pushing organizations forward but the reality is that the core systems that run our business still exist. While a cloud-first company may not have on-prem legacy systems, most companies are running an IBM Z or IBM i for transactional data processes. Whats next?

IT

IT Management Datasets Systems

Unlocking Data Team Success: Are You Process-Centric or Data-Centric?

DataKitchen

MARCH 20, 2025

Understanding this framework offers valuable insights into team efficiency, operational excellence, and data quality. Process-centric data teams focus their energies predominantly on orchestrating and automating workflows. Instead, their primary success metric is whether their processes run smoothly and without errors.

Pipeline-centric

Pipeline-centric Database-centric Process Data

Data Engineering Weekly #206

Data Engineering Weekly

FEBRUARY 2, 2025

Evals are introduced to evaluate LLM responses through various techniques, including self-evaluation, using another LLM as a judge, or human evaluation to ensure the system's behavior aligns with intentions. It employs a two-tower model approach to learn query and item embeddings from user engagement data.

Data Engineer

Data Engineer Data Engineering Engineering Data Lake

Review: Building a Real Time Data Warehouse

Start Data Engineering

APRIL 10, 2020

Many data engineers coming from traditional batch processing frameworks have questions about real time data processing systems, like “What kind of data model did you implement, for real-time processing?”

Data Warehouse

Data Warehouse Building Data Data Engineering

X-Ray Vision For Your Flink Stream Processing With Datorios

Data Engineering Podcast

JUNE 9, 2024

Summary Streaming data processing enables new categories of data products and analytics. Unfortunately, reasoning about stream processing engines is complex and lacks sufficient tooling. How has the lack of visibility into the flow of data in Flink impacted the ways that teams think about where/when/how to apply it?

Process

Process Data Lake High Quality Data Machine Learning

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

LinkedIn Engineering

OCTOBER 19, 2023

Authors: Bingfeng Xia and Xinyu Liu Background At LinkedIn, Apache Beam plays a pivotal role in stream processing infrastructures that process over 4 trillion events daily through more than 3,000 pipelines across multiple production data centers.

Process

Process Lambda Architecture Kafka Machine Learning

Unlocking Real-Time Decision-Making with High-Velocity Data Analytics

Striim

APRIL 10, 2025

As data volumes surge and the need for fast, data-driven decisions intensifies, traditional data processing methods no longer suffice. To stay competitive, organizations must embrace technologies that enable them to process data in real time, empowering them to make intelligent, on-the-fly decisions.

Data Analytics

Data Analytics Algorithm Datasets Data

Unapologetically Technical Episode 15 – Frances Perry

Jesse Anderson

DECEMBER 25, 2024

Frances Perry is an engineering manager who spent many years as a heads-down coder creating various distributed systems used in Google and Google Cloud. Frances shares her insights from 16 years at Google, including the development of Flume and Cloud Dataflow, and discusses the challenges and rewards of scaling engineering teams.

Google Cloud

Google Cloud Cloud Database Data Solutions

Data Engineering Weekly #207

Data Engineering Weekly

FEBRUARY 9, 2025

link] QuantumBlack: Solving data quality for gen AI applications Unstructured data processing is a top priority for enterprises that want to harness the power of GenAI. It brings challenges in data processing and quality, but what data quality means in unstructured data is a top question for every organization.

Data Engineer

Data Engineer Data Engineering Engineering Unstructured Data

Functional Data Engineering — a modern paradigm for batch data processing

Why Open Table Format Architecture is Essential for Modern Data Systems

Webinars

Trending Sources

OLAP vs. OLTP: A Comparative Analysis of Data Processing Systems

Webinars

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Last Mile Data Processing with Ray

Type-safe data processing pipelines

Mastering Batch Data Processing with Versatile Data Kit (VDK)

Unapologetically Technical Episode 17 – Semih Salihoglu

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

Supporting Diverse ML Systems at Netflix

Most Essential 2023 Interview Questions on Data Engineering

How Meta discovers data flows via lineage at scale

Building cost effective data pipelines with Python & DuckDB

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker

Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

Top 10 Data Engineering Trends in 2025

Simplifying Multimodal Data Analysis with Snowflake Cortex AI

Sync Partners with Apex Systems

Snowflake Startup Challenge 2025: Meet the Top 10

The Race For Data Quality in a Medallion Architecture

Trends and Takeaways from Banking and Payments’ Event of the Year

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Mainframe Data Meets AI: Reducing Bias and Enhancing Predictive Power

How To Future-Proof Your Data Pipelines

Change Data Capture at Pinterest

Introducing Impressions at Netflix

Data logs: The latest evolution in Meta’s access tools

Snowflake Startup Spotlight: Contextual AI

Secrets of Spark to Snowflake Migration Success: Customer Stories

Data Engineering Weekly #195

What are the Key Parts of Data Engineering?

Is the “AI developer”a threat to jobs – or a marketing stunt?

How To Prepare Your Data Team for 2025

Complete Guide to Data Transformation: Basics to Advanced

New Year, New Approaches to Tackling IT Operations Management

Unlocking Data Team Success: Are You Process-Centric or Data-Centric?

Data Engineering Weekly #206

Review: Building a Real Time Data Warehouse

X-Ray Vision For Your Flink Stream Processing With Datorios

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

Unlocking Real-Time Decision-Making with High-Velocity Data Analytics

Unapologetically Technical Episode 15 – Frances Perry

Data Engineering Weekly #207

Stay Connected