Data Process - Data Engineering Digest

Modern Data Engineering with MAGE: Empowering Efficient Data Processing

Analytics Vidhya

JUNE 20, 2023

Introduction In today’s data-driven world, organizations across industries are dealing with massive volumes of data, complex pipelines, and the need for efficient data processing.

Data Process

Data Process Data Engineering Data Engineer Process

Vertical autoscaling for data processing on the cloud

Waitingforcode

DECEMBER 5, 2023

I've always considered horizontal scaling as the single true scaling policy for elastic data processing pipelines. The "vertical scaling" has caught my attention a few times already when I have been reading about cloud updates. Have I been wrong?

Data Process

Data Process Process Cloud Data

Last Mile Data Processing with Ray

Pinterest Engineering

SEPTEMBER 12, 2023

Since it takes so long to iterate on workflows, some ML engineers started to perform data processing directly inside training jobs. This is what we commonly refer to as Last Mile Data Processing. Last Mile processing can boost ML engineers’ velocity as they can write code in Python, directly using PyTorch.

Data Process

Data Process Process Datasets Software Engineer

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Data Engineering Podcast

JANUARY 7, 2024

Summary Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up.

Data Process

Data Process Process Data Lake High Quality Data

5 Real-Time Data Processing and Analytics Technologies – And Where You Can Implement Them

Seattle Data Guy

MARCH 1, 2024

Real-time data processing can satisfy the ever-increasing demand for… Read more The post 5 Real-Time Data Processing and Analytics Technologies – And Where You Can Implement Them appeared first on Seattle Data Guy.

Data Process

Data Process Technology Process Data

Mastering Batch Data Processing with Versatile Data Kit (VDK)

Towards Data Science

NOVEMBER 16, 2023

Data Management A tutorial on how to use VDK to perform batch data processing Photo by Mika Baumeister on Unsplash Versatile Data Ki t (VDK) is an open-source data ingestion and processing framework designed to simplify data management complexities. The following figure shows a snapshot of VDK UI.

Data Process

Data Process Process Raw Data Data

10 Essential PySpark Commands for Big Data Processing

KDnuggets

JANUARY 20, 2025

Check out these 10 ways to leverage efficient distributed dataset processing combining the strengths of Spark and Python libraries for data science.

Big Data

Big Data Data Process Process Datasets

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Netflix Tech

NOVEMBER 14, 2023

Understanding the nature of the late-arriving data and processing requirements will help decide which pattern is most appropriate for a use case. Stateful Data Processing : This pattern is useful when the output depends on a sequence of events across one or more input streams.

Data Process

Data Process Process Metadata Finance

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

LinkedIn Engineering

JANUARY 19, 2024

Data consistency, feature reliability, processing scalability, and end-to-end observability are key drivers to ensuring business as usual (zero disruptions) and a cohesive customer experience. With our new data processing framework, we were able to observe a multitude of benefits, including 99.9%

Recruitment

Recruitment Data Process Process Kafka

StreamNative and Databricks Unite to Power Real-Time Data Processing with Pulsar-Spark Connector

databricks

MARCH 4, 2024

StreamNative, a leading Apache Pulsar-based real-time data platform solutions provider, and Databricks, the Data Intelligence Platform, are thrilled to announce the enhanced Pulsar-Spark.

Data Process

Data Process Process Data

Unlock the Power of Real-time Data Processing with Databricks and Google Cloud

databricks

JUNE 15, 2023

We are excited to announce the official launch of the Google Pub/Sub connector for the Databricks Lakehouse Platform. This new connector adds to.

Google Cloud

Google Cloud Data Process Process Cloud

Apache Beam: Data Processing, Data Pipelines, Dataflow and Flex Templates

Towards Data Science

FEBRUARY 12, 2024

Let’s learn what… Continue reading on Towards Data Science » In this first article, we’re exploring Apache Beam, from a simple pipeline to a more complicated one, using GCP Dataflow.

Data Pipeline

Data Pipeline Data Process Process Data Science

Azure Databricks: A Comprehensive Guide

Analytics Vidhya

FEBRUARY 28, 2023

A collaborative and interactive workspace allows users to perform big data processing and machine learning tasks easily. Introduction Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform that is built on top of the Microsoft Azure cloud.

Big Data

Big Data Machine Learning Cloud Data Process

Unapologetically Technical Episode 17 – Semih Salihoglu

Jesse Anderson

FEBRUARY 11, 2025

Discover the insights he gained from academia and industry, his perspective on the future of data processing and the story behind building a next-generation graph database. Semih explains how Kuzu addresses the challenges of large graph analytics, the benefits of embeddability, and its potential for applications in AI and beyond.

Computer Science

Computer Science Database Design Software Engineering Software Engineer

Most Essential 2023 Interview Questions on Data Engineering

Analytics Vidhya

FEBRUARY 7, 2023

Introduction Data engineering is the field of study that deals with the design, construction, deployment, and maintenance of data processing systems. The goal of this domain is to collect, store, and process data efficiently and efficiently so that it can be used to support business decisions and power data-driven applications.

Data Engineering

Data Engineering Data Engineer Engineering Data

Ace Your Interview with Top 10 Interview Questions on Delta Lake

Analytics Vidhya

FEBRUARY 13, 2023

Introduction Every data scientist demands an efficient and reliable tool to process this big unstoppable data. Today we discuss one such tool called Delta Lake, which data enthusiasts use to make their data processing pipelines more efficient and reliable.

Data Process

Data Process Process Data Data Warehouse

Building cost effective data pipelines with Python & DuckDB

Start Data Engineering

MAY 28, 2024

Building efficient data pipelines with DuckDB 4.1. Use DuckDB to process data, not for multiple users to access data 4.2. Cost calculation: DuckDB + Ephemeral VMs = dirt cheap data processing 4.3. Processing data less than 100GB? Use DuckDB 4.4.

Data Pipeline

Data Pipeline Python Building Data

Top 20 Big Data Tools Used By Professionals in 2023

Analytics Vidhya

FEBRUARY 23, 2023

Introduction Big Data is a large and complex dataset generated by various sources and grows exponentially. It is so extensive and diverse that traditional data processing methods cannot handle it. The volume, velocity, and variety of Big Data can make it difficult to process and analyze.

Big Data Tools

Big Data Tools Big Data Datasets Data

An Ultimate Manual to Apache Oozie

Analytics Vidhya

FEBRUARY 2, 2023

Introduction Big data processing is crucial today. Big data analytics and learning help corporations foresee client demands, provide useful recommendations, and more. Hadoop, the Open-Source Software Framework for scalable and scattered computation of massive data sets, makes it easy.

Hadoop

Hadoop Big Data Data Analytics Data Process

A Detailed Guide of Interview Questions on Apache Kafka

Analytics Vidhya

APRIL 28, 2023

It is a famous Scala-coded data processing tool that offers low latency, extensive throughput, and a unified platform to handle the data in real-time. Introduction Apache Kafka is an open-source publish-subscribe messaging application initially developed by LinkedIn in early 2011.

Kafka

Kafka Scala Coding Data Process

Introducing Snowpark pandas API: Run Distributed pandas at Scale in Snowflake

Snowflake

JUNE 5, 2024

With Snowpark’s existing DataFrame API , users have access to a robust framework for lazily evaluated, relational operations on data, closely resembling Spark’s conventions. pandas is the go-to data processing library for millions worldwide, including countless Snowflake users. Why introduce a distributed pandas API?

Python

Python Programming Language Government SQL

Startup Spotlight: How ROE AI Empowers Data Teams

Snowflake

MARCH 26, 2025

In this edition, we talk to Richard Meng, co-founder and CEO of ROE AI , a startup that empowers data teams to extract insights from unstructured, multimodal data including documents, images and web pages using familiar SQL queries. What inspires you as a founder?

Unstructured Data

Unstructured Data SQL Data Data Workflow

How To Future-Proof Your Data Pipelines

Ascend.io

NOVEMBER 14, 2024

Why Future-Proofing Your Data Pipelines Matters Data has become the backbone of decision-making in businesses across the globe. The ability to harness and analyze data effectively can make or break a company’s competitive edge. Set Up Auto-Scaling: Configure auto-scaling for your data processing and storage resources.

Data Pipeline

Data Pipeline Amazon Web Services Data Integration Data

Snowflake Startup Spotlight: Contextual AI

Snowflake

MARCH 20, 2025

When they deploy Contextual AI as a Snowflake Native App , they get the peace of mind that comes with running the platform and the data processing inside their own Snowflake environment, while Snowflake manages the infrastructure complexity, including server management, scaling and maintenance.

Programming

Programming Certification Building Designing

Unlocking Real-Time Decision-Making with High-Velocity Data Analytics

Striim

APRIL 10, 2025

As data volumes surge and the need for fast, data-driven decisions intensifies, traditional data processing methods no longer suffice. To stay competitive, organizations must embrace technologies that enable them to process data in real time, empowering them to make intelligent, on-the-fly decisions.

Data Analytics

Data Analytics Algorithm Datasets Data

How Retail and Media Leaders Drive Customer Satisfaction and Profits with Data and AI

Snowflake

MARCH 19, 2025

Explore AI and unstructured data processing use cases with proven ROI: This year, retailers and brands will face intense pressure to demonstrate tangible returns on their AI investments.

Retail

Retail Media Entertainment Unstructured Data

Parallel Processing Large File in Python

KDnuggets

JULY 13, 2022

Learn various techniques to reduce data processing time by using multiprocessing, joblib, and tqdm concurrent.

Process

Process Python Data Process Data

ChatGPT as a Python Programming Assistant

KDnuggets

JANUARY 20, 2023

Is ChatGPT useful for Python programmers, specifically those of us who use Python for data processing, data cleaning, and building machine learning models? Let's give it a try and find out.

Python

Python Programming Machine Learning Data Process

Apache Spark Vs Apache Flink – How To Choose The Right Solution

Seattle Data Guy

APRIL 25, 2024

As data increased in volume, velocity, and variety, so, in turn, did the need for tools that could help process and manage those larger data sets coming at us at ever faster speeds.

Big Data

Big Data Data Process Process Management

Unlocking Data Team Success: Are You Process-Centric or Data-Centric?

DataKitchen

MARCH 20, 2025

Understanding this framework offers valuable insights into team efficiency, operational excellence, and data quality. Process-centric data teams focus their energies predominantly on orchestrating and automating workflows. Over the years, we have also been helping data-centric data teams.

Pipeline-centric

Pipeline-centric Database-centric Process Data

Secrets of Spark to Snowflake Migration Success: Customer Stories

Snowflake

NOVEMBER 19, 2024

To overcome these hurdles, CTC moved its processing off of managed Spark and onto Snowflake, where it had already built its data foundation. Thanks to the reduction in costs, CTC now maximizes data to further innovate and increase its market-making capabilities.

Data Governance

Data Governance Government Healthcare Building

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

Examples include “reduce data processing time by 30%” or “minimize manual data entry errors by 50%.” It aims to streamline and automate data workflows, enhance collaboration and improve the agility of data teams. How effective are your current data workflows?

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Complete Guide to Data Transformation: Basics to Advanced

Ascend.io

OCTOBER 28, 2024

Advanced Data Transformation Techniques For data engineers ready to push the boundaries, advanced data transformation techniques offer the tools to tackle complex data challenges and drive innovation. Automated testing and validation steps can also streamline transformation processes, ensuring reliable outcomes.

Raw Data

Raw Data Datasets Aggregated Data Data Pipeline

Stopping a Structured Streaming query

Waitingforcode

APRIL 18, 2024

Streaming jobs are supposed to run continuously but it applies to the data processing logic. After all, sometimes you may need to release a new job package with upgraded dependencies or improved business logic. What happens then?

Data Process

Data Process Process IT Data

What are the Key Parts of Data Engineering?

Start Data Engineering

SEPTEMBER 4, 2024

Key parts of data systems: 2.1. Data flow design 2.3. Data processing design 2.5. Data storage design 2.7. Introduction If you are trying to break into (or land a new) data engineering job, you will inevitably encounter a slew of data engineering tools. Introduction 2. Requirements 2.2. Conclusion 1.

Data Engineering

Data Engineering Data Engineer Engineering Data Storage

How to use nested data types effectively in SQL

Start Data Engineering

OCTOBER 14, 2024

Using nested data types effectively 3.1. Using nested data types in data processing 3.3.1. STRUCT enables more straightforward data schema and data access 3.3.2. Nested data types can be sorted 3.3.3. Use STRUCT for one-to-one & hierarchical relationships 3.2.

SQL

SQL Data Schemas Data Coding

25 SQL tips to level up your data engineering skills

Start Data Engineering

OCTOBER 17, 2024

Handy functions for common data processing scenarios 1.1. STRUCT data types are sorted based on their keys from left to right 1.4. Introduction Setup SQL tips 1. Need to filter on WINDOW function without CTE/Subquery use QUALIFY 1.2. Need the first/last row in a partition, use DISTINCT ON 1.3.

SQL

SQL Data Engineering Data Engineer Engineering

Modern Data Engineering: Free Spark to Snowpark Migration Accelerator for Faster, Cheaper Pipelines in Snowflake

Snowflake

JUNE 20, 2024

In the age of AI, enterprises are increasingly looking to extract value from their data at scale but often find it difficult to establish a scalable data engineering foundation that can process the large amounts of data required to build or improve models.

Data Engineering

Data Engineering Data Engineer Scala Engineering

The Race For Data Quality in a Medallion Architecture

DataKitchen

NOVEMBER 5, 2024

The Race For Data Quality In A Medallion Architecture The Medallion architecture pattern is gaining traction among data teams. It is a layered approach to managing and transforming data. By systematically moving data through these layers, the Medallion architecture enhances the data structure in a data lakehouse environment.

Architecture

Architecture Raw Data Pipeline-centric Data Ingestion

Is the “AI developer”a threat to jobs – or a marketing stunt?

The Pragmatic Engineer

MARCH 19, 2024

A 1959 survey had found that in any data processing installation, the programming cost US$800,000 on average and that translating programs to run on new hardware would cost $600,000. From Wikipedia : “In the late 1950s, computer users and manufacturers were becoming concerned about the rising cost of programming.

Software Engineering

Software Engineering Software Engineer Programming Language Media

Order is king for the performance

Waitingforcode

DECEMBER 19, 2023

Even though nowadays data processing frameworks and data stores have smart query planners, they don't take our responsibility to correctly design the job logic.

Designing

Designing Data Process Process Data

Simplified End-to-End Development for Production-Ready Data Pipelines, Applications, and ML Models

Snowflake

JUNE 4, 2024

Finally, Tasks Backfill (PrPr) automates historical data processing within Task Graphs. Additionally, Dynamic Tables are a new table type that you can use at every stage of your processing pipeline. Follow this quickstart to test-drive Dynamic Tables yourself. Snowflake integrates with GitHub, GitLab, Azure DevOps and Bitbucket.

Data Pipeline

Data Pipeline Python SQL Database

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

Architectural Patterns for Data Quality Now we understand the trade-off between speed & correctness and the difference between data testing and observability. Let’s talk about the data processing types. Two-Phase WAP The Two-Phase WAP, as the name suggests, follows two copy processes.

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Modern Data Engineering with MAGE: Empowering Efficient Data Processing

Vertical autoscaling for data processing on the cloud

Trending Sources

Last Mile Data Processing with Ray

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

5 Real-Time Data Processing and Analytics Technologies – And Where You Can Implement Them

Mastering Batch Data Processing with Versatile Data Kit (VDK)

10 Essential PySpark Commands for Big Data Processing

2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

StreamNative and Databricks Unite to Power Real-Time Data Processing with Pulsar-Spark Connector

Unlock the Power of Real-time Data Processing with Databricks and Google Cloud

Apache Beam: Data Processing, Data Pipelines, Dataflow and Flex Templates

Azure Databricks: A Comprehensive Guide

Unapologetically Technical Episode 17 – Semih Salihoglu

Most Essential 2023 Interview Questions on Data Engineering

Ace Your Interview with Top 10 Interview Questions on Delta Lake

Building cost effective data pipelines with Python & DuckDB

Top 20 Big Data Tools Used By Professionals in 2023

An Ultimate Manual to Apache Oozie

A Detailed Guide of Interview Questions on Apache Kafka

Top 10 Data Pipeline Interview Questions to Read in 2023

Introducing Snowpark pandas API: Run Distributed pandas at Scale in Snowflake

Startup Spotlight: How ROE AI Empowers Data Teams

How To Future-Proof Your Data Pipelines

Snowflake Startup Spotlight: Contextual AI

Unlocking Real-Time Decision-Making with High-Velocity Data Analytics

How Retail and Media Leaders Drive Customer Satisfaction and Profits with Data and AI

Parallel Processing Large File in Python

ChatGPT as a Python Programming Assistant

Apache Spark Vs Apache Flink – How To Choose The Right Solution

Unlocking Data Team Success: Are You Process-Centric or Data-Centric?

Secrets of Spark to Snowflake Migration Success: Customer Stories

How To Prepare Your Data Team for 2025

Complete Guide to Data Transformation: Basics to Advanced

Stopping a Structured Streaming query

What are the Key Parts of Data Engineering?

How to use nested data types effectively in SQL

25 SQL tips to level up your data engineering skills

Modern Data Engineering: Free Spark to Snowpark Migration Accelerator for Faster, Cheaper Pipelines in Snowflake

The Race For Data Quality in a Medallion Architecture

Is the “AI developer”a threat to jobs – or a marketing stunt?

Order is king for the performance

Simplified End-to-End Development for Production-Ready Data Pipelines, Applications, and ML Models

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Stay Connected