Coding and Data Process - Data Engineering Digest

Functional Data Engineering — a modern paradigm for batch data processing

Maxime Beauchemin

JANUARY 7, 2018

Batch data processing — historically known as ETL — is extremely challenging. In this post, we’ll explore how applying the functional programming paradigm to data engineering can bring a lot of clarity to the process. But how do we model this in a functional data warehouse without mutating data?

Data Engineering

Data Engineering Data Engineer Data Process Process

Simplify Data Processing with Pandas Pipeline

KDnuggets

AUGUST 22, 2022

Write a single line of code to clean and process the data for analytics and machine learning tasks.

Data Process

Data Process Process Machine Learning Data

Last Mile Data Processing with Ray

Pinterest Engineering

SEPTEMBER 12, 2023

Since it takes so long to iterate on workflows, some ML engineers started to perform data processing directly inside training jobs. This is what we commonly refer to as Last Mile Data Processing. Last Mile processing can boost ML engineers’ velocity as they can write code in Python, directly using PyTorch.

Data Process

Data Process Process Datasets Software Engineer

Webinars

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Mastering Batch Data Processing with Versatile Data Kit (VDK)

Towards Data Science

NOVEMBER 16, 2023

Data Management A tutorial on how to use VDK to perform batch data processing Photo by Mika Baumeister on Unsplash Versatile Data Ki t (VDK) is an open-source data ingestion and processing framework designed to simplify data management complexities.

Data Process

Data Process Process Raw Data Data

Type-safe data processing pipelines

Tweag

APRIL 26, 2023

Moreover, these steps can be combined in different ways, perhaps omitting some or changing the order of others, producing different data processing pipelines tailored to a particular task at hand. Namely, dependencies are encoded in the types, allowing compile-time checking and serving as the code documentation.

Data Process

Data Process Process Programming Data

Is the “AI developer”a threat to jobs – or a marketing stunt?

The Pragmatic Engineer

MARCH 19, 2024

The company says: “Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork. years ago, and it became the leading AI coding assistant almost overnight. It’s more a copilot.

Software Engineer

Software Engineer Software Engineering Programming Language Media

Massively Parallel Data Processing In Python Without The Effort Using Bodo

Data Engineering Podcast

SEPTEMBER 24, 2021

Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. What are the techniques/technologies that teams might use to optimize or scale out their data processing workflows? Can you describe what Bodo is and the story behind it?

Data Process

Data Process Python Process Data Lake

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

In order to build high-quality data lineage, we developed different techniques to collect data flow signals across different technology stacks: static code analysis for different languages, runtime instrumentation, and input and output data matching, etc. web endpoints, data tables, AI models) used across Meta.

Data Warehouse

Data Warehouse SQL Programming Language Data

Top 12 Data Engineering Project Ideas [With Source Code]

Knowledge Hut

JUNE 26, 2023

If you want to break into the field of data engineering but don't yet have any expertise in the field, compiling a portfolio of data engineering projects may help. Data pipeline best practices should be shown in these initiatives. Source Code: Stock and Twitter Data Extraction Using Python, Kafka, and Spark 2.

Data Engineering

Data Engineering Data Engineer Coding Project

Snowflake Cortex AI Continues to Advance Enterprise AI with No-Code Development, Serverless Fine-Tuning and Managed Services to Build Chat-with-Data Applications

Snowflake

JUNE 5, 2024

Snowflake AI & ML Studio for LLMs (private preview): Enable users of all technical levels to utilize AI with no-code development. Using Snowflake data processing infrastructure, the service is kept up-to-date with the latest information by automating continuous refreshes as new documents are generated.

Coding

Coding Building Management Government

How to Speed up Pandas by 4x with one line of code

KDnuggets

NOVEMBER 12, 2019

While Pandas is the library for data processing in Python, it isn't really built for speed. Learn more about the new library, Modin, developed to distribute Pandas' computation to speedup your data prep.

Coding

Coding Python Data Process Process

Build Your Python Data Processing Your Way And Run It Anywhere With Fugue

Data Engineering Podcast

FEBRUARY 20, 2022

Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. Go to dataengineeringpodcast.com/oxylabs today and use code DEP25 to get your special discount on residential proxies.

Python

Python Data Process IT Building

Improving the code quality of your dbt models with unit tests and TDD

Towards Data Science

JUNE 2, 2023

How to improve the code quality of your dbt models with unit tests and TDD All you need to know to start unit testing your dbt SQL models Photo by Christin Hume on Unsplash If you are a data or analytics engineer, you are probably comfortable writing SQL models and testing for data quality with dbt tests. Kent Beck ?

Coding

Coding SQL Software Engineer Software Engineering

A Detailed Guide of Interview Questions on Apache Kafka

Analytics Vidhya

APRIL 28, 2023

It is a famous Scala-coded data processing tool that offers low latency, extensive throughput, and a unified platform to handle the data in real-time. Introduction Apache Kafka is an open-source publish-subscribe messaging application initially developed by LinkedIn in early 2011.

Kafka

Kafka Scala Coding Data Process

Is the “AI developer”a threat to jobs – or a marketing stunt?

The Pragmatic Engineer

MAY 1, 2024

The company says: “Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork. years ago, and it became the leading AI coding assistant almost overnight. It’s more a copilot.

Software Engineer

Software Engineer Software Engineering Programming Language Media

Confluent for VS Code Simplifies Real-Time Data Streaming Projects for Developers

Confluent

MARCH 18, 2025

Confluent for VS Code streamlines workflows, accelerates development cycles, and enhances real-time data processing, all within a unified environment.

Coding

Coding Project Data Data Process

Introducing Snowpark pandas API: Run Distributed pandas at Scale in Snowflake

Snowflake

JUNE 5, 2024

Snowflake customers are already harnessing the power of Python through Snowpark , a set of libraries and code execution environments that run Python and other programming languages next to your data in Snowflake. pandas is the go-to data processing library for millions worldwide, including countless Snowflake users.

Python

Python Programming Language Government SQL

Accelerate AI Development with Snowflake

Snowflake

NOVEMBER 11, 2024

These scalable models can handle millions of records, enabling you to efficiently build high-performing NLP data pipelines. However, scaling LLM data processing to millions of records can pose data transfer and orchestration challenges, easily addressed by the user-friendly SQL functions in Snowflake Cortex.

Unstructured Data

Unstructured Data SQL AWS Healthcare

Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

Data Engineering Podcast

NOVEMBER 27, 2022

Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams.

Data Process

Data Process Process Metadata Business Intelligence

Part 1: A Survey of Analytics Engineering Work at Netflix

Netflix Tech

DECEMBER 17, 2024

Metric definitions are often scattered across various databases, documentation sites, and code repositories, making it difficult for analysts and data scientists to find reliable information quickly. Besides providing the end user with an instant answer in a preferred data visualization, LORE instantly learns from the users feedback.

Engineering

Engineering Entertainment Amazon Web Services Utilities

X-Ray Vision For Your Flink Stream Processing With Datorios

Data Engineering Podcast

JUNE 9, 2024

Summary Streaming data processing enables new categories of data products and analytics. Unfortunately, reasoning about stream processing engines is complex and lacks sufficient tooling. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments today to subscribe.

Process

Process Data Lake High Quality Data Machine Learning

Snowflake’s Fully Managed Service: Beyond Serverless

Snowflake

FEBRUARY 13, 2025

Snowflake has embraced serverless since our founding in 2012, with customers providing their code to load, manage and query data and us taking care of the rest. They can easily access multiple code interfaces, including those for SQL and Python, and the Snowflake AI & ML Studio for no-code development.

Management

Management Government Cloud Unstructured Data

Modern Data Engineering: Free Spark to Snowpark Migration Accelerator for Faster, Cheaper Pipelines in Snowflake

Snowflake

JUNE 20, 2024

Designed for processing large data sets, Spark has been a popular solution, yet it is one that can be challenging to manage, especially for users who are new to big data processing or distributed systems. It provided us insights as to code compatibility and allowed us to better estimate our migration time.”

Data Engineering

Data Engineering Data Engineer Scala Engineering

What are the Key Parts of Data Engineering?

Start Data Engineering

SEPTEMBER 4, 2024

Key parts of data systems: 2.1. Data flow design 2.3. Data processing design 2.5. Code organization 2.6. Data storage design 2.7. Introduction If you are trying to break into (or land a new) data engineering job, you will inevitably encounter a slew of data engineering tools. Introduction 2.

Data Engineering

Data Engineering Data Engineer Engineering Data Storage

How to use nested data types effectively in SQL

Start Data Engineering

OCTOBER 14, 2024

Code & Data 3. Using nested data types effectively 3.1. Using nested data types in data processing 3.3.1. STRUCT enables more straightforward data schema and data access 3.3.2. Nested data types can be sorted 3.3.3. Introduction 2.

SQL

SQL Data Schemas Data Coding

The Race For Data Quality in a Medallion Architecture

DataKitchen

NOVEMBER 5, 2024

The Race For Data Quality In A Medallion Architecture The Medallion architecture pattern is gaining traction among data teams. It is a layered approach to managing and transforming data. By systematically moving data through these layers, the Medallion architecture enhances the data structure in a data lakehouse environment.

Architecture

Architecture Raw Data Pipeline-centric Data Ingestion

Microsoft Fabric Tutorial for Beginners

Edureka

MAY 27, 2025

Imagine entering a control room with complete control over your data ecosystem. You won’t have to deal with siloed systems, jump between tools, or write endless lines of code to make data useful. Welcome to Microsoft Fabric for Complete Novices, where no prior knowledge of coding or confusion is necessary.

BI

BI Data Pipeline Business Intelligence Data Engineering

Simplified End-to-End Development for Production-Ready Data Pipelines, Applications, and ML Models

Snowflake

JUNE 4, 2024

Snowflake’s new Python API (GA soon) simplifies data pipelines and is readily available through pip install snowflake. Automate or code, the choice is yours. Finally, Tasks Backfill (PrPr) automates historical data processing within Task Graphs. Interact with Snowflake objects directly in Python.

Data Pipeline

Data Pipeline Python SQL Database

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

One of the main reasons this feature exists is just like with food samples, to give you “a taste” of the production quality ETL code that you could encounter inside the Netflix data ecosystem. " , country_code STRING COMMENT "Country code of the playback session." Let’s review the transformation steps below.

Data Pipeline

Data Pipeline Scala Metadata Food

Simplified Delta Lake operations with Mack

Waitingforcode

FEBRUARY 16, 2023

I like writing code and each time there is a data processing job to write with some business logic I'm very happy. However, with time I've learned to appreciate the Open Source contributions enhancing my daily work. Mack library, the topic of this blog post, is one of those projects discovered recently.

Coding

Coding Data Process Project Process

Automation tool to Convert Informatica Code to Talend

RandomTrees

APRIL 18, 2024

Among the various tools available for data integration, Informatica and Talend stand out as popular choices, each with its strengths and capabilities. However, migrating from one platform to another can be a daunting task, especially when it involves converting existing code. Customizable: Tailors to specific project needs and rules.

Coding

Coding Retail Metadata Python

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

Examples include “reduce data processing time by 30%” or “minimize manual data entry errors by 50%.” It aims to streamline and automate data workflows, enhance collaboration and improve the agility of data teams. How effective are your current data workflows?

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Simplifying Multimodal Data Analysis with Snowflake Cortex AI

Snowflake

APRIL 16, 2025

Cortex AI delivers exceptional quality across a wide range of unstructured data processing tasks through models and specialized functions tailored for different tasks. In addition, Cortex AI Translate effectively handles noisy text, code-mixing, and extended context with coherence. Visit our documentation page to learn more.

Data Analysis

Data Analysis Unstructured Data Manufacturing Retail

Microsoft Fabric Architecture Explained: Core Components & Benefit

Edureka

MAY 27, 2025

The architecture of Microsoft Fabric is based on several essential elements that work together to simplify data processes: 1. OneLake Data Lake OneLake provides a centralized data repository and is the fundamental storage layer of Microsoft Fabric. It is developed for real-time insights from streaming data.

Architecture

Architecture BI Business Intelligence Raw Data

Effective Pandas Patterns For Data Engineering

Data Engineering Podcast

JANUARY 30, 2022

Matt Harrison is a Python expert with a long history of working with data who now spends his time on consulting and training. Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. What are some of the utility features that you have found most helpful for data processing?

Data Engineering

Data Engineering Data Engineer Engineering Python

Best Practices for Real-Time Stream Processing

Striim

MARCH 21, 2025

To access real-time data, organizations are turning to stream processing. There are two main data processing paradigms: batch processing and stream processing. Your electric consumption is collected during a month and then processed and billed at the end of that period.

Process

Process Data Warehouse Kafka Data Pipeline

Data Engineering Weekly #206

Data Engineering Weekly

FEBRUARY 2, 2025

Streamline code deployment, enhance collaboration, and ensure DevOps best practices with Astro's robust CI/CD capabilities. The evaluation process includes over 4,000 automated tests, measuring the percentage of passing unit tests, similarity to known passing states, and token usage. Automate Airflow deploys with built-in CI/CD.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

Automation and Data Integrity: A Duo for Digital Transformation Success

Precisely

NOVEMBER 21, 2024

This relationship is particularly important in SAP® environments, where data and processes must work together seamlessly at scale. To achieve true transformation, you need an aligned approach where both processes and data management evolve together.

Data Integration

Data Integration High Quality Data Data Manufacturing

Data News — Week 23.02

Christophe Blefari

JANUARY 14, 2023

On the data processing side there is Polars, a DataFrame library that could replace pandas. df = pl.read_csv("lost-objects-stations.csv", sep=";") Then you can use the same code as pandas to select the data (head, ["col"], etc.). With this release you can really mix Python and SQL code.

Python

Python Kafka Data Scala

How Meta enforces purpose limitation via Privacy Aware Infrastructure at scale

Engineering at Meta

AUGUST 27, 2024

Commonly, purpose limitation can rely on “point checking” controls at the point of data processing. This approach involves using simple if statements in code (“code assets”) or access control mechanisms for datasets (“data assets”) in data systems.

Programming Language

Programming Language Coding Data Warehouse Systems

Snowflake Startup Challenge 2025: Meet the Top 10

Snowflake

APRIL 9, 2025

KAWA combines analytics, automation and AI agents to help enterprises build data apps and AI workflows quickly and achieve their digital transformation goals. It connects structured and unstructured databases across sources and uses a no-code UI or Python for advanced and predictive analytics.

Pharmaceutical

Pharmaceutical Manufacturing Data Ingestion SQL

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

Editor’s Note: Data Council 2025, Apr 22-24, Oakland, CA Data Council has always been one of my favorite events to connect with and learn from the data engineering community. Data Council 2025 is set for April 22-24 in Oakland, CA. The article highlights Nuage 3.0's

Data Engineering

Data Engineering Data Engineer Engineering Data

Data logs: The latest evolution in Meta’s access tools

Engineering at Meta

FEBRUARY 4, 2025

To prevent this issue, we built verification in the post-processing stage to ensure that the user ID column in the data matches the identifier for the user whose logs we are generating. Finally, once content has been reviewed, it can be implemented in code using the renderers we described above.

Accessibility

Accessibility Accessible Raw Data Data Warehouse

Applying software development & DevOps best practices to Delta Live Table pipelines

databricks

APRIL 28, 2023

Databricks Delta Live Tables (DLT) radically simplifies the development of the robust data processing pipelines by decreasing the amount of code that data.

Coding

Coding Data Process Process Data

Functional Data Engineering — a modern paradigm for batch data processing

Simplify Data Processing with Pandas Pipeline

Webinars

Trending Sources

Last Mile Data Processing with Ray

Webinars

Mastering Batch Data Processing with Versatile Data Kit (VDK)

Type-safe data processing pipelines

Is the “AI developer”a threat to jobs – or a marketing stunt?

Massively Parallel Data Processing In Python Without The Effort Using Bodo

How Meta discovers data flows via lineage at scale

Top 12 Data Engineering Project Ideas [With Source Code]

Snowflake Cortex AI Continues to Advance Enterprise AI with No-Code Development, Serverless Fine-Tuning and Managed Services to Build Chat-with-Data Applications

How to Speed up Pandas by 4x with one line of code

Build Your Python Data Processing Your Way And Run It Anywhere With Fugue

Improving the code quality of your dbt models with unit tests and TDD

A Detailed Guide of Interview Questions on Apache Kafka

Is the “AI developer”a threat to jobs – or a marketing stunt?

Confluent for VS Code Simplifies Real-Time Data Streaming Projects for Developers

Introducing Snowpark pandas API: Run Distributed pandas at Scale in Snowflake

Accelerate AI Development with Snowflake

Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

Part 1: A Survey of Analytics Engineering Work at Netflix

X-Ray Vision For Your Flink Stream Processing With Datorios

Snowflake’s Fully Managed Service: Beyond Serverless

Modern Data Engineering: Free Spark to Snowpark Migration Accelerator for Faster, Cheaper Pipelines in Snowflake

What are the Key Parts of Data Engineering?

How to use nested data types effectively in SQL

The Race For Data Quality in a Medallion Architecture

Microsoft Fabric Tutorial for Beginners

Simplified End-to-End Development for Production-Ready Data Pipelines, Applications, and ML Models

Ready-to-go sample data pipelines with Dataflow

Simplified Delta Lake operations with Mack

Automation tool to Convert Informatica Code to Talend

How To Prepare Your Data Team for 2025

Simplifying Multimodal Data Analysis with Snowflake Cortex AI

Microsoft Fabric Architecture Explained: Core Components & Benefit

Effective Pandas Patterns For Data Engineering

Best Practices for Real-Time Stream Processing

Data Engineering Weekly #206

Automation and Data Integrity: A Duo for Digital Transformation Success

Data News — Week 23.02

How Meta enforces purpose limitation via Privacy Aware Infrastructure at scale

Snowflake Startup Challenge 2025: Meet the Top 10

Data Engineering Weekly #213

Data logs: The latest evolution in Meta’s access tools

Applying software development & DevOps best practices to Delta Live Table pipelines

Stay Connected