Blog and Data Process - Data Engineering Digest

Functional Data Engineering — a modern paradigm for batch data processing

Maxime Beauchemin

JANUARY 7, 2018

Batch data processing — historically known as ETL — is extremely challenging. In this post, we’ll explore how applying the functional programming paradigm to data engineering can bring a lot of clarity to the process. Late arriving facts Late arriving facts can be problematic with a strict immutable data policy.

Data Process

Data Process Data Engineering Data Engineer Process

Last Mile Data Processing with Ray

Pinterest Engineering

SEPTEMBER 12, 2023

Since it takes so long to iterate on workflows, some ML engineers started to perform data processing directly inside training jobs. This is what we commonly refer to as Last Mile Data Processing. Last Mile processing can boost ML engineers’ velocity as they can write code in Python, directly using PyTorch.

Data Process

Data Process Process Datasets Software Engineering

Azure Databricks: A Comprehensive Guide

Analytics Vidhya

FEBRUARY 28, 2023

A collaborative and interactive workspace allows users to perform big data processing and machine learning tasks easily. In this blog post, we will take a closer look at Azure Databricks, its key features, […] The post Azure Databricks: A Comprehensive Guide appeared first on Analytics Vidhya.

Big Data

Big Data Machine Learning Cloud Data Process

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Netflix Tech

NOVEMBER 14, 2023

By Abhinaya Shetty , Bharath Mummadisetty In the inaugural blog post of this series, we introduced you to the state of our pipelines before Psyberg and the challenges with incremental processing that led us to create the Psyberg framework within Netflix’s Membership and Finance data engineering team.

Data Process

Data Process Process Metadata Finance

Centralize Your Data Processes With a DataOps Process Hub

DataKitchen

NOVEMBER 4, 2021

The typical pharmaceutical organization faces many challenges which slow down the data team: Raw, barely integrated data sets require engineers to perform manual , repetitive, error-prone work to create analyst-ready data sets. Cloud computing has made it much easier to integrate data sets, but that’s only the beginning.

Process

Process Data Process Pharmaceutical Data Lake

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

LinkedIn Engineering

JANUARY 19, 2024

Data consistency, feature reliability, processing scalability, and end-to-end observability are key drivers to ensuring business as usual (zero disruptions) and a cohesive customer experience. With our new data processing framework, we were able to observe a multitude of benefits, including 99.9%

Recruitment

Recruitment Data Process Process Kafka

StreamNative and Databricks Unite to Power Real-Time Data Processing with Pulsar-Spark Connector

databricks

MARCH 4, 2024

StreamNative, a leading Apache Pulsar-based real-time data platform solutions provider, and Databricks, the Data Intelligence Platform, are thrilled to announce the enhanced Pulsar-Spark.

Data Process

Data Process Process Data

Unlock the Power of Real-time Data Processing with Databricks and Google Cloud

databricks

JUNE 15, 2023

We are excited to announce the official launch of the Google Pub/Sub connector for the Databricks Lakehouse Platform. This new connector adds to.

Google Cloud

Google Cloud Data Process Process Cloud

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

This belief has led us to developing Privacy Aware Infrastructure (PAI) , which offers efficient and reliable first-class privacy constructs embedded in Meta infrastructure to address different privacy requirements, such as purpose limitation , which restricts the purposes for which data can be processed and used.

Data Warehouse

Data Warehouse SQL Programming Language Data

Data Engineering Weekly #195

Data Engineering Weekly

OCTOBER 27, 2024

The blog is an excellent summary of the existing unstructured data landscape. The learning mostly involves understanding the data's nature, frequency of data processing, and awareness of the computing cost. It is exciting to read probably the first blog on building a vector search infrastructure at scale.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Top 10 Data Engineering Trends in 2025

Edureka

APRIL 22, 2025

Data engineering can help with it. It is the force behind seamless data flow, enabling everything from AI-driven automation to real-time analytics. To stay competitive, businesses need to adapt to new trends and find new ways to deal with ongoing problems by taking advantage of new possibilities in data engineering.

Data Engineering

Data Engineering Data Engineer Engineering Consulting

Unapologetically Technical Episode 17 – Semih Salihoglu

Jesse Anderson

FEBRUARY 11, 2025

Discover the insights he gained from academia and industry, his perspective on the future of data processing and the story behind building a next-generation graph database. Semih explains how Kuzu addresses the challenges of large graph analytics, the benefits of embeddability, and its potential for applications in AI and beyond.

Computer Science

Computer Science Database Design Software Engineering Software Engineer

Data Engineering Weekly #217

Data Engineering Weekly

APRIL 20, 2025

The blog took out the last edition’s recommendation on AI and summarized the current state of AI adoption in enterprises. The simplistic model expressed in the blog made it easy for me to reason about the transactional system design. Kafka is probably the most reliable data infrastructure in the modern data era.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Data Engineering Weekly #210

Data Engineering Weekly

MARCH 2, 2025

I found the blog to be a fresh take on the skill in demand by layoff datasets. DeepSeek’s smallpond Takes on Big Data. DeepSeek continues to impact the Data and AI landscape with its recent open-source tools, such as Fire-Flyer File System (3FS) and smallpond. link] Mehdio: DuckDB goes distributed?

Data Engineering

Data Engineering Data Engineer Engineering Datasets

Data Engineering Weekly #207

Data Engineering Weekly

FEBRUARY 9, 2025

[link] QuantumBlack: Solving data quality for gen AI applications Unstructured data processing is a top priority for enterprises that want to harness the power of GenAI. It brings challenges in data processing and quality, but what data quality means in unstructured data is a top question for every organization.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Drafting Your Data Pipelines

Team Data Science

MAY 10, 2020

For A Quick Recap You can find the first blog post here, where I learned which tech is most in demand in Toronto: [link] And the second blog post is here where I learn which Toronto industries need data engineers the most: [link] The Pipeline Proposal I'll be creating several pipelines in this project, but first things first; I need to ingest the data, (..)

Data Pipeline

Data Pipeline Data Ingestion AWS Kafka

Simplified Delta Lake operations with Mack

Waitingforcode

FEBRUARY 16, 2023

I like writing code and each time there is a data processing job to write with some business logic I'm very happy. Mack library, the topic of this blog post, is one of those projects discovered recently. However, with time I've learned to appreciate the Open Source contributions enhancing my daily work.

Coding

Coding Data Process Project Process

Best Data Processing Frameworks That You Must Know

Knowledge Hut

JANUARY 18, 2024

“Big data Analytics” is a phrase that was coined to refer to amounts of datasets that are so large traditional data processing software simply can’t manage them. For example, big data is used to pick out trends in economics, and those trends and patterns are used to predict what will happen in the future.

Data Process

Data Process Process Hadoop Scala

Python alternatives to PySpark

Waitingforcode

FEBRUARY 3, 2023

However, it's not the single Python-based framework for distributed data processing and people talk more and more often about the alternatives like Dask or Ray. Since both are completely new for me, I'm going to use this blog post to shed some light on them, and why not plan a deeper exploration next year?

Python

Python Data Process Process IT

Telco Enterprise Data Platforms: Key Success Factors in Building for an AI Future

Cloudera

DECEMBER 17, 2024

However, implementing AI models requires significant computing power and real-time data processing, which cannot be achieved without modern, scalable data platforms. The post Telco Enterprise Data Platforms: Key Success Factors in Building for an AI Future appeared first on Cloudera Blog.

Building

Building Telecommunication Data Architecture Architecture

Change Data Capture at Pinterest

Pinterest Engineering

NOVEMBER 18, 2024

Liang Mou; Staff Software Engineer, Logging Platform | Elizabeth (Vi) Nguyen; Software Engineer I, Logging Platform | In today’s data-driven world, businesses need to process and analyze data in real-time to make informed decisions. What is Change Data Capture? Why is CDC Important? or its affiliates.

Kafka

Kafka MySQL Database Software Engineering

How To Future-Proof Your Data Pipelines

Ascend.io

NOVEMBER 14, 2024

Why Future-Proofing Your Data Pipelines Matters Data has become the backbone of decision-making in businesses across the globe. The ability to harness and analyze data effectively can make or break a company’s competitive edge. In this blog post, we’ll explore key strategies for future-proofing your data pipelines.

Data Pipeline

Data Pipeline Amazon Web Services Data Integration Data

Data Engineering Weekly #212

Data Engineering Weekly

MARCH 16, 2025

It discusses dataset considerations (relevance, annotation quality, size, ethics, data cutoffs, modalities, synthetic data), formats for instruction and preference tuning, synthetic data creation (Self-Instruct), data labeling approaches (human, LLM-assisted, cohort-based, RLHF-based), and data processing architectures using Amazon Web Services.

Data Engineering

Data Engineering Data Engineer Engineering Amazon Web Services

Accelerate AI Development with Snowflake

Snowflake

NOVEMBER 11, 2024

These scalable models can handle millions of records, enabling you to efficiently build high-performing NLP data pipelines. However, scaling LLM data processing to millions of records can pose data transfer and orchestration challenges, easily addressed by the user-friendly SQL functions in Snowflake Cortex.

Unstructured Data

Unstructured Data SQL AWS Healthcare

Next Stop – Building a Data Pipeline from Edge to Insight

Cloudera

FEBRUARY 8, 2021

This is part 2 in this blog series. You can read part 1, here: Digital Transformation is a Data Journey From Edge to Insight. The first blog introduced a mock connected vehicle manufacturing company, The Electric Car Company (ECC), to illustrate the manufacturing data path through the data lifecycle.

Data Pipeline

Data Pipeline Building Manufacturing Data Warehouse

Schema Evolution with Case Sensitivity Handling in Snowflake

Cloudyard

JANUARY 21, 2025

Read Time: 6 Minute, 6 Second In modern data pipelines, handling data in various formats such as CSV, Parquet, and JSON is essential to ensure smooth data processing. However, one of the most common challenges faced by data engineers is the evolution of schemas as new data comes in.

Data Schemas

Data Schemas Data Pipeline Data Warehouse Data Storage

Part 2: A Survey of Analytics Engineering Work at Netflix

Netflix Tech

JANUARY 2, 2025

However, due to the absence of a control group in these countries, we adopt a synthetic control framework ( blog post ) to estimate the counterfactual scenario. With some additional data processing, this yields an expected percent of cash spend each day leading up to and beyond the launch date, which we can base our forecasts on.

Engineering

Engineering Entertainment Designing Technology

Data Engineering Weekly #177

Data Engineering Weekly

JUNE 24, 2024

link] Netflix: A Recap of the Data Engineering Open Forum at Netflix Netflix publishes a recap of all the talks in the first Data Engineering open forum tech meetups. The blog contains a summary of each talk and a link to the YouTube channel with all the talks. Are there enough usecases?

Data Engineering

Data Engineering Data Engineer Engineering Google Cloud

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

Automation, AI, DataOps, and strategic alignment are no longer optional —they are essential components of a successful data strategy. As we look towards 2025, it’s clear that data teams must evolve to meet the demands of evolving technology and opportunities. How effective are your current data workflows?

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Complete Guide to Data Transformation: Basics to Advanced

Ascend.io

OCTOBER 28, 2024

Data transformation helps make sense of the chaos, acting as the bridge between unprocessed data and actionable intelligence. You might even think of effective data transformation like a powerful magnet that draws the needle from the stack, leaving the hay behind.

Raw Data

Raw Data Datasets Aggregated Data Data Pipeline

5 Advantages of Real-Time ETL for Snowflake

Striim

MARCH 21, 2025

This blog post describes the advantages of real-time ETL and how it increases the value gained from Snowflake implementations. With instant elasticity, high-performance, and secure data sharing across multiple clouds , Snowflake has become highly in-demand for its cloud-based data warehouse offering.

Data Warehouse

Data Warehouse MongoDB MySQL Hadoop

Data Engineering Weekly #203

Data Engineering Weekly

JANUARY 12, 2025

This blog captures the current state of Agent adoption, emerging software engineering roles, and the use case category. Generative AI demands the processing of vast amounts of diverse, unstructured data (e.g., meeting recordings and videos), which contrasts with traditional SQL-centric systems for structured data.

Pipeline-centric

Pipeline-centric Data Engineering Data Engineer Engineering

The Stream Processing Model Behind Google Cloud Dataflow

Towards Data Science

APRIL 30, 2024

Balancing correctness, latency, and cost in unbounded data processing Image created by the author. Intro Google Dataflow is a fully managed data processing service that provides serverless unified stream and batch data processing. Apache Beam lets users define processing logic based on the Dataflow model.

Google Cloud

Google Cloud Process Cloud Lambda Architecture

Replace and Boost your Apache Storm Topologies with Apache NiFi Flows

Cloudera

AUGUST 2, 2021

As you’ll see in this blog, NiFi is not only keeping up with Storm; it beats Storm by 4x throughput. . Because, they’ll be able to store massive amounts of data, process this data in real-time or batch, and serve the data to other applications. They asked, “Can NiFi keep up with the same throughput as Storm?”

Kafka

Kafka Java Coding Process

Data News — Week 24.16

Christophe Blefari

APRIL 19, 2024

This is super interesting because it details important steps of the generative process. This blog shows how you can use Gen AI to evaluate inputs like translations with added reasons. How we build Slack AI to be secure and private — How Slack uses VPC and Amazon SageMaker with your data secured and private.

MySQL

MySQL Data Datasets SQL

Data Engineering Weekly #180

Data Engineering Weekly

JULY 14, 2024

[link] Discord: How Discord Uses Open-Source Tools for Scalable Data Orchestration & Transformation Discord writes about its migration journey from a homegrown orchestration engine to Dagster. Streaming execution to process a small chunk of data at a time. Intermediate spilling to disk while computing aggregations.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

In the second part, we will focus on architectural patterns to implement data quality from a data contract perspective. Why is Data Quality Expensive? I won’t bore you with the importance of data quality in the blog. Let’s talk about the data processing types. Should We Build a New Tool?

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

[link] Georg Heiler: Upskilling data engineers What should I prefer for 2028, or how can I break into data engineering? I honestly don’t have a solid answer, but this blog is an excellent overview of upskilling. These are common LinkedIn requests.

Data Engineering

Data Engineering Data Engineer Engineering Data

Introducing Impressions at Netflix

Netflix Tech

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. This nuanced integration of data and technology empowers us to offer bespoke content recommendations.

Kafka

Kafka Datasets Metadata Utilities

Securely Deploy Custom Apps and Models with Snowpark Container Services, Now Generally Available

Snowflake

AUGUST 1, 2024

Check out this blog. Snowpark Container Services gives developers the ability to bring any containerized workload to their data that is already secure in Snowflake — ReactJS front-ends, open source large language models (LLMs), distributed data processing pipelines, you name it. First, security.

Deep Learning

Deep Learning Government AWS Architecture

Top Three Requirements for Data Flows

Cloudera

MARCH 11, 2021

While Apache NiFi is used successfully by hundreds of our customers to power mission critical and large-scale data flows, the expectations for enterprise data flow solutions are constantly evolving. In this blog post, I want to share the top three requirements for data flows in 2021 that we hear from our customers.

Cloud

Cloud Data Data Warehouse Data Integration

UK Government: From cloud first to cloud appropriate?

Cloudera

OCTOBER 1, 2020

GDS will likely be looking at its cloud-first policy and specifically, it’s preference for public cloud, in order to understand if it can enable the Government to successfully mitigate complex data processing legislation and uncertain future playing fields. . appeared first on Cloudera Blog.

Government

Government Cloud Data Storage Architecture

What Are ETL Pipelines? Steps, Benefits, and Use Cases

Hevo

JANUARY 26, 2025

In today’s fast-paced digital landscape, businesses face the daunting challenge of extracting valuable insights from large amounts of data. The ETL (Extract, Transform, Load) pipeline is the backbone of data processing and analysis.

Data Engineer

Data Engineer Data Engineering Engineering Data Process

Python Files within Snowflake Python Procedures

Cloudyard

SEPTEMBER 2, 2024

This capability enables advanced analytics, custom data processing, and seamless integration of Python libraries. In this blog post, we’ll explore how to create and utilize a.Py One particularly powerful feature is the ability to import and use Python files (.py) file inside a Snowflake Python stored procedure.

Python

Python Utilities Coding Data Engineer

Functional Data Engineering — a modern paradigm for batch data processing

Last Mile Data Processing with Ray

Webinars

Trending Sources

Azure Databricks: A Comprehensive Guide

Webinars

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Centralize Your Data Processes With a DataOps Process Hub

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

StreamNative and Databricks Unite to Power Real-Time Data Processing with Pulsar-Spark Connector

Unlock the Power of Real-time Data Processing with Databricks and Google Cloud

How Meta discovers data flows via lineage at scale

Data Engineering Weekly #195

Top 10 Data Engineering Trends in 2025

Unapologetically Technical Episode 17 – Semih Salihoglu

Data Engineering Weekly #217

Data Engineering Weekly #210

Data Engineering Weekly #207

Drafting Your Data Pipelines

Simplified Delta Lake operations with Mack

Best Data Processing Frameworks That You Must Know

Python alternatives to PySpark

Telco Enterprise Data Platforms: Key Success Factors in Building for an AI Future

Change Data Capture at Pinterest

How To Future-Proof Your Data Pipelines

Data Engineering Weekly #212

Accelerate AI Development with Snowflake

Next Stop – Building a Data Pipeline from Edge to Insight

Schema Evolution with Case Sensitivity Handling in Snowflake

Part 2: A Survey of Analytics Engineering Work at Netflix

Data Engineering Weekly #177

How To Prepare Your Data Team for 2025

Complete Guide to Data Transformation: Basics to Advanced

5 Advantages of Real-Time ETL for Snowflake

Data Engineering Weekly #203

The Stream Processing Model Behind Google Cloud Dataflow

Replace and Boost your Apache Storm Topologies with Apache NiFi Flows

Data News — Week 24.16

Data Engineering Weekly #180

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly #213

Introducing Impressions at Netflix

Securely Deploy Custom Apps and Models with Snowpark Container Services, Now Generally Available

Top Three Requirements for Data Flows

UK Government: From cloud first to cloud appropriate?

What Are ETL Pipelines? Steps, Benefits, and Use Cases

Python Files within Snowflake Python Procedures

Stay Connected