Data Process and SQL - Data Engineering Digest

Functional Data Engineering — a modern paradigm for batch data processing

Maxime Beauchemin

JANUARY 7, 2018

Batch data processing — historically known as ETL — is extremely challenging. In this post, we’ll explore how applying the functional programming paradigm to data engineering can bring a lot of clarity to the process. It’s time-consuming, brittle, and often unrewarding.

Data Engineering

Data Engineering Data Engineer Data Process Process

25 SQL tips to level up your data engineering skills

Start Data Engineering

OCTOBER 17, 2024

Introduction Setup SQL tips 1. Handy functions for common data processing scenarios 1.1. STRUCT data types are sorted based on their keys from left to right 1.4. Need to filter on WINDOW function without CTE/Subquery use QUALIFY 1.2. Need the first/last row in a partition, use DISTINCT ON 1.3.

SQL

SQL Data Engineering Data Engineer Engineering

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Data Engineering Podcast

JANUARY 7, 2024

Summary Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up.

Data Process

Data Process Process Data Lake High Quality Data

Startup Spotlight: How ROE AI Empowers Data Teams

Snowflake

MARCH 26, 2025

In this edition, we talk to Richard Meng, co-founder and CEO of ROE AI , a startup that empowers data teams to extract insights from unstructured, multimodal data including documents, images and web pages using familiar SQL queries. Whats the coolest thing youre doing with data? What inspires you as a founder?

Unstructured Data

Unstructured Data SQL Data Data Workflow

Accelerate AI Development with Snowflake

Snowflake

NOVEMBER 11, 2024

These scalable models can handle millions of records, enabling you to efficiently build high-performing NLP data pipelines. However, scaling LLM data processing to millions of records can pose data transfer and orchestration challenges, easily addressed by the user-friendly SQL functions in Snowflake Cortex.

Unstructured Data

Unstructured Data SQL AWS Healthcare

Mastering Batch Data Processing with Versatile Data Kit (VDK)

Towards Data Science

NOVEMBER 16, 2023

Data Management A tutorial on how to use VDK to perform batch data processing Photo by Mika Baumeister on Unsplash Versatile Data Ki t (VDK) is an open-source data ingestion and processing framework designed to simplify data management complexities.

Data Process

Data Process Process Raw Data Data

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

These stages propagate through various systems including function-based systems that load, process, and propagate data through stacks of function calls in different programming languages (e.g., For simplicity, we will demonstrate these for the web, the data warehouse, and AI, per the diagram below. Hack, C++, Python, etc.)

Data Warehouse

Data Warehouse SQL Programming Language Data

How to use nested data types effectively in SQL

Start Data Engineering

OCTOBER 14, 2024

Using nested data types effectively 3.1. Using nested data types in data processing 3.3.1. STRUCT enables more straightforward data schema and data access 3.3.2. Nested data types can be sorted 3.3.3. Use STRUCT for one-to-one & hierarchical relationships 3.2.

SQL

SQL Data Schemas Data Coding

Massively Parallel Data Processing In Python Without The Effort Using Bodo

Data Engineering Podcast

SEPTEMBER 24, 2021

Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. What are the techniques/technologies that teams might use to optimize or scale out their data processing workflows? Can you describe what Bodo is and the story behind it?

Data Process

Data Process Python Process Data Lake

Centralize Your Data Processes With a DataOps Process Hub

DataKitchen

NOVEMBER 4, 2021

The typical pharmaceutical organization faces many challenges which slow down the data team: Raw, barely integrated data sets require engineers to perform manual , repetitive, error-prone work to create analyst-ready data sets. Cloud computing has made it much easier to integrate data sets, but that’s only the beginning.

Process

Process Data Process Pharmaceutical Data Lake

Build Your Python Data Processing Your Way And Run It Anywhere With Fugue

Data Engineering Podcast

FEBRUARY 20, 2022

Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. Can you describe what Fugue is and the story behind it?

Python

Python Data Process IT Building

Streaming Market Data with Flink SQL Part II: Intraday Value-at-Risk

Cloudera

MAY 18, 2021

Event-driven and streaming architectures enable complex processing on market events as they happen, making them a natural fit for financial market applications. Flink SQL is a data processing language that enables rapid prototyping and development of event-driven and streaming applications. You can view the code here.

SQL

SQL Java Data Business Analyst

Snowflake Startup Challenge 2025: Meet the Top 10

Snowflake

APRIL 9, 2025

It employs Snowpark Container Services to build scalable AI/ML models for satellite data processing and Snowflake AI/ML functions to enable advanced analytics and predictive insights for satellite operators. Sherloq aims to change this by offering a collaborative platform for managing and documenting data analytics workflows.

Pharmaceutical

Pharmaceutical Manufacturing Data Ingestion SQL

Introducing Snowpark pandas API: Run Distributed pandas at Scale in Snowflake

Snowflake

JUNE 5, 2024

With Snowpark’s existing DataFrame API , users have access to a robust framework for lazily evaluated, relational operations on data, closely resembling Spark’s conventions. pandas is the go-to data processing library for millions worldwide, including countless Snowflake users. Why introduce a distributed pandas API?

Python

Python Programming Language Government SQL

Add One Line of SQL to Optimise Your BigQuery Tables

Towards Data Science

DECEMBER 8, 2023

Clustering: A simple way to group similar rows and prevent unnecessary data processing Continue reading on Towards Data Science »

SQL

SQL Data Science Data Process Process

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Data Engineering Weekly

FEBRUARY 18, 2025

Fluss is a compelling new project in the realm of real-time data processing. I spoke with Jark Wu , who leads the Fluss and Flink SQL team at Alibaba Cloud, to understand its origins and potential. Among the 20,000 Flink SQL jobs at Alibaba, only 49% of columns of Kafka data are read on average.

Kafka

Kafka Lambda Architecture SQL Architecture

Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

Data Engineering Podcast

NOVEMBER 27, 2022

Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams.

Data Process

Data Process Process Metadata Business Intelligence

Best Practices for Real-Time Stream Processing

Striim

MARCH 21, 2025

To access real-time data, organizations are turning to stream processing. There are two main data processing paradigms: batch processing and stream processing. Your electric consumption is collected during a month and then processed and billed at the end of that period.

Process

Process Data Warehouse Kafka Data Pipeline

Data Engineering Weekly #207

Data Engineering Weekly

FEBRUARY 9, 2025

link] QuantumBlack: Solving data quality for gen AI applications Unstructured data processing is a top priority for enterprises that want to harness the power of GenAI. It brings challenges in data processing and quality, but what data quality means in unstructured data is a top question for every organization.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Simplified End-to-End Development for Production-Ready Data Pipelines, Applications, and ML Models

Snowflake

JUNE 4, 2024

Snowflake offers a secure, streamlined approach to developing across data workloads, reducing costs and reliance on external tools. This means faster development and happier data teams. Explore and experiment with data, visualize results, share insights — all in one place. Let’s dive deeper into what we announced.

Data Pipeline

Data Pipeline Python SQL Database

Is Data Engineering a must for Data Scientists?

Team Data Science

DECEMBER 17, 2020

It’s not a must for data scientist to have skill in data engineering before they can analyze data processed by data engineer or before they can move uniformly with other group (involving data engineer) for the progress of the company. Data scientists should acquire some basic SQL functionality.

Data Engineering

Data Engineering Data Engineer Engineering Cloud Computing

Introducing Snowflake Notebooks, an End-to-End Interactive Environment for Data & AI Teams

Snowflake

JUNE 6, 2024

Snowflake Notebooks aim to provide a convenient, easy-to-use interactive environment that seamlessly blends Python, SQL and Markdown, as well as integrations with key Snowflake offerings, like Snowpark ML, Streamlit, Cortex and Iceberg tables. Discover valuable business insights through exploratory data analysis.

SQL

SQL Python Machine Learning Data Workflow

AWS RDS MSSQL to Databricks: Efficient Data Processing Strategy

Hevo

APRIL 26, 2024

Most organizations find it challenging to manage data from diverse sources efficiently. Amazon Web Services (AWS) enables you to address this challenge with Amazon RDS, a scalable relational database service for Microsoft SQL Server (MS SQL). However, simply storing the data isn’t enough.

AWS

AWS Amazon Web Services Data Process Process

5 Advantages of Real-Time ETL for Snowflake

Striim

MARCH 21, 2025

In addition to log files, sensors, and messaging systems, Striim continuously ingests real-time data from cloud-based or on-premises data warehouses and databases such as Oracle, Oracle Exadata, Teradata, Netezza, Amazon Redshift, SQL Server, HPE NonStop, MongoDB, and MySQL.

Data Warehouse

Data Warehouse MongoDB MySQL Hadoop

Most Essential 2023 Interview Questions on Data Engineering

Analytics Vidhya

FEBRUARY 7, 2023

Introduction Data engineering is the field of study that deals with the design, construction, deployment, and maintenance of data processing systems. The goal of this domain is to collect, store, and process data efficiently and efficiently so that it can be used to support business decisions and power data-driven applications.

Data Engineering

Data Engineering Data Engineer Engineering Data

Part 1: A Survey of Analytics Engineering Work at Netflix

Netflix Tech

DECEMBER 17, 2024

To add this metric to DJ, they need to provide two pieces of information: The fact table that the metric comesfrom: SELECT account_id, country_iso_code, streaming_hours FROM streaming_fact_table The metric expression: `SUM(streaming_hours)` Then metric consumers throughout the organization can call DJ to request either the SQL or the resulting data.

Engineering

Engineering Entertainment Amazon Web Services Utilities

Best Data Processing Frameworks That You Must Know

Knowledge Hut

JANUARY 18, 2024

“Big data Analytics” is a phrase that was coined to refer to amounts of datasets that are so large traditional data processing software simply can’t manage them. For example, big data is used to pick out trends in economics, and those trends and patterns are used to predict what will happen in the future.

Data Process

Data Process Process Hadoop Scala

What is Azure SQL Database? A Complete Guide

Knowledge Hut

MARCH 14, 2024

Should that be the case, Azure SQL Database might be your best bet. Microsoft SQL Server's functionalities are fully included in Azure SQL Database, a cloud-based database service that also offers greater flexibility and scalability. In this article, I will cover the various aspects of Azure SQL Database.

Database

Database SQL Relational Database BI

Mastering AI-Powered Product Development: Introducing Promptimize for Test-Driven Prompt…

Maxime Beauchemin

APRIL 26, 2023

To achieve this, we’re integrating AI into various aspects of our product, such as natural language data queries, text-to-SQL, and chart suggestions. Focusing on the text-to-SQL use case specifically, GPT-4 is exceptional at writing code and highly proficient in SQL. However, to perform optimally, GPT requires context.

SQL

SQL Database Engineering Software Engineer

Change Data Capture (CDC): What it is and How it Works

Striim

MARCH 21, 2025

Change Data Capture (CDC) has emerged as an ideal solution for near real-time movement of data from relational databases (like SQL Server or Oracle) to data warehouses, data lakes or other databases. The final step of ETL involves loading data into the target destination.

IT

IT Data Lake Data Warehouse Relational Database

Build Real Time Applications With Operational Simplicity Using Dozer

Data Engineering Podcast

JULY 23, 2023

Summary Real-time data processing has steadily been gaining adoption due to advances in the accessibility of the technologies involved. To bring streaming data in reach of application engineers Matteo Pelati helped to create Dozer. Use SQL, Python, R, no-code and AI to find and share insights across your organization.

Building

Building Machine Learning SQL Python

Build and deploy ML with ease Using Snowpark ML, Snowflake Notebooks, and Snowflake Feature Store

Snowflake

NOVEMBER 1, 2023

Snowflake’s latest ML announcements Develop interactively with SQL and Python in Snowflake Notebooks Snowflake Notebooks, in private preview, is a new development interface that offers an interactive, cell-based programming environment for Python and SQL users to explore, process and experiment with data in Snowpark.

Building

Building Python SQL Programming Language

How To Future-Proof Your Data Pipelines

Ascend.io

NOVEMBER 14, 2024

Why Future-Proofing Your Data Pipelines Matters Data has become the backbone of decision-making in businesses across the globe. The ability to harness and analyze data effectively can make or break a company’s competitive edge. Set Up Auto-Scaling: Configure auto-scaling for your data processing and storage resources.

Data Pipeline

Data Pipeline Amazon Web Services Data Integration Data

Data Engineering Weekly #203

Data Engineering Weekly

JANUARY 12, 2025

link] Gradient Flow: Paradigm Shifts in Data Processing for the Generative AI Era data processing pipelines haven't kept pace with the rapid advancement of AI models The article highlights the growing importance of preprocessing data pipelines, but the pipeline processing techniques do not match the demand.

Pipeline-centric

Pipeline-centric Data Engineering Data Engineer Engineering

Modern Data Engineering: Free Spark to Snowpark Migration Accelerator for Faster, Cheaper Pipelines in Snowflake

Snowflake

JUNE 20, 2024

In the age of AI, enterprises are increasingly looking to extract value from their data at scale but often find it difficult to establish a scalable data engineering foundation that can process the large amounts of data required to build or improve models.

Data Engineering

Data Engineering Data Engineer Scala Engineering

Startup Spotlight: Making Snowflake Queries Smarter and Cheaper with Sundeck

Snowflake

MAY 31, 2023

Define a Sundeck SQL post-hook that examines current load and time of day to suspend idle warehouses. Implement a Sundeck SQL post-hook that collects query activity and records that in a Snowflake table, which is then consulted in a SQL pre-hook to reject excessive consumers (unless a manager overrides).

SQL

SQL Database Data Engineering Data Engineer

Ready-to-go sample data pipelines with Dataflow

Netflix Tech

DECEMBER 3, 2022

Obviously not all tools are made with the same use case in mind, so we are planning to add more code samples for other (than classical batch ETL) data processing purposes, e.g. Machine Learning model building and scoring. See below example of hooking the table creation SQL file into the main workflow definition. -

Data Pipeline

Data Pipeline Scala Metadata Food

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Cloudera

MARCH 2, 2023

Iceberg is a high-performance open table format for huge analytic data sets. It allows multiple data processing engines, such as Flink, NiFi, Spark, Hive, and Impala to access and analyze data in simple, familiar SQL tables. This enables you to maximize utilization of streaming data at scale.

Process

Process SQL Kafka Database

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Data Engineering Weekly

JANUARY 15, 2025

The Critical Role of AI Data Engineers in a Data-Driven World How does a chatbot seamlessly interpret your questions? The answer lies in unstructured data processing—a field that powers modern artificial intelligence (AI) systems. How does a self-driving car understand a chaotic street scene?

Data Engineering

Data Engineering Data Engineer Unstructured Data Engineering

Accelerate Your Machine Learning Workflows in Snowflake with Snowpark ML

Snowflake

JANUARY 23, 2024

Behind the scenes, Snowpark ML parallelizes data processing operations by taking advantage of Snowflake’s scalable computing platform. This is a first-class, schema-level Snowflake object that provides a versioned container of ML model artifacts with full role-based access control (RBAC) support, and APIs for Python and SQL.

Machine Learning

Machine Learning Metadata Python Telecommunication

Data News — Week 22.50

Christophe Blefari

DECEMBER 16, 2022

For instance a small team of 2 analytics engineers will pay now $2400/year just to have a server running their SQL queries and a web IDE that is yet to perfect. Query your data in Kafka using SQL — This is a post that compares Flink, ksqlDB, Trino, Materialize, RisingWave and timeplus (the authors) in order to query Kafka.

Kafka

Kafka Data SQL Cloud

Schema Evolution with Case Sensitivity Handling in Snowflake

Cloudyard

JANUARY 21, 2025

Read Time: 6 Minute, 6 Second In modern data pipelines, handling data in various formats such as CSV, Parquet, and JSON is essential to ensure smooth data processing. However, one of the most common challenges faced by data engineers is the evolution of schemas as new data comes in.

Data Schemas

Data Schemas Data Pipeline Data Warehouse Data Storage

Integrating Striim with BigQuery ML: Real-time Data Processing for Machine Learning

Striim

NOVEMBER 17, 2023

Meanwhile, Google BigQuery ML is a machine learning service provided by Google Cloud, allowing you to create and deploy machine learning models using SQL-like syntax directly within the BigQuery environment. ensuring the consistency and integrity of your data. ELSE data[1] END, data[2] = CASE WHEN data[2] IS NULL THEN TO_FLOAT(0.0)

Machine Learning

Machine Learning Data Process PostgreSQL Process

Data News — Week 24.16

Christophe Blefari

APRIL 19, 2024

Structured generative AI — Oren explains how you can constraint generative algorithms to produce structured outputs (like JSON or SQL—seen as an AST). This is super interesting because it details important steps of the generative process. SQLMesh is bringing fresh ideas to the SQL transformation landscape.

MySQL

MySQL Data Datasets SQL

Functional Data Engineering — a modern paradigm for batch data processing

25 SQL tips to level up your data engineering skills

Trending Sources

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Startup Spotlight: How ROE AI Empowers Data Teams

Accelerate AI Development with Snowflake

Mastering Batch Data Processing with Versatile Data Kit (VDK)

How Meta discovers data flows via lineage at scale

How to use nested data types effectively in SQL

Massively Parallel Data Processing In Python Without The Effort Using Bodo

Centralize Your Data Processes With a DataOps Process Hub

Build Your Python Data Processing Your Way And Run It Anywhere With Fugue

Streaming Market Data with Flink SQL Part II: Intraday Value-at-Risk

Snowflake Startup Challenge 2025: Meet the Top 10

Introducing Snowpark pandas API: Run Distributed pandas at Scale in Snowflake

Add One Line of SQL to Optimise Your BigQuery Tables

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

Best Practices for Real-Time Stream Processing

Data Engineering Weekly #207

Simplified End-to-End Development for Production-Ready Data Pipelines, Applications, and ML Models

Is Data Engineering a must for Data Scientists?

Introducing Snowflake Notebooks, an End-to-End Interactive Environment for Data & AI Teams

AWS RDS MSSQL to Databricks: Efficient Data Processing Strategy

5 Advantages of Real-Time ETL for Snowflake

Most Essential 2023 Interview Questions on Data Engineering

Part 1: A Survey of Analytics Engineering Work at Netflix

Best Data Processing Frameworks That You Must Know

What is Azure SQL Database? A Complete Guide

Mastering AI-Powered Product Development: Introducing Promptimize for Test-Driven Prompt…

Change Data Capture (CDC): What it is and How it Works

Build Real Time Applications With Operational Simplicity Using Dozer

Build and deploy ML with ease Using Snowpark ML, Snowflake Notebooks, and Snowflake Feature Store

How To Future-Proof Your Data Pipelines

Data Engineering Weekly #203

Modern Data Engineering: Free Spark to Snowpark Migration Accelerator for Faster, Cheaper Pipelines in Snowflake

Startup Spotlight: Making Snowflake Queries Smarter and Cheaper with Sundeck

Ready-to-go sample data pipelines with Dataflow

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Accelerate Your Machine Learning Workflows in Snowflake with Snowpark ML

Data News — Week 22.50

Schema Evolution with Case Sensitivity Handling in Snowflake

Integrating Striim with BigQuery ML: Real-time Data Processing for Machine Learning

Data News — Week 24.16

Stay Connected