Process - Data Engineering Digest

Testing DuckDB’s Large Than Memory Processing Capabilities.

Confessions of a Data Guy

OCTOBER 31, 2024

I find the good, the bad, the ugly, and splay them out before you, string ’em up and […] The post Testing DuckDB’s Large Than Memory Processing Capabilities. appeared first on Confessions of a Data Guy.

Process

Process Data Engineering Data Engineer Engineering

What is Unstructured Data? A Guide to Storage, Processing, and Analysis

Seattle Data Guy

NOVEMBER 13, 2024

A Guide to Storage, Processing, and Analysis appeared first on Seattle Data Guy. However, much of the data that is being created and will be created comes in some form of unstructured format. However, the digital era… Read more The post What is Unstructured Data?

Unstructured Data

Unstructured Data Process Structured Data Data

Modern Data Engineering with MAGE: Empowering Efficient Data Processing

Analytics Vidhya

JUNE 20, 2023

Introduction In today’s data-driven world, organizations across industries are dealing with massive volumes of data, complex pipelines, and the need for efficient data processing.

Data Process

Data Process Data Engineering Data Engineer Process

Webinars

Going Beyond Chatbots: Connecting AI to Your Tools, Systems, & Data

Smart Tech + Human Expertise = How to Modernize Manufacturing Without Losing Control

MORE WEBINARS

Unlocking Data Team Success: Are You Process-Centric or Data-Centric?

DataKitchen

MARCH 20, 2025

Unlocking Data Team Success: Are You Process-Centric or Data-Centric? We’ve identified two distinct types of data teams: process-centric and data-centric. Process-centric data teams focus their energies predominantly on orchestrating and automating workflows. They work in and on these pipelines.

Pipeline-centric

Pipeline-centric Database-centric Process Data

Smart Tech + Human Expertise = How to Modernize Manufacturing Without Losing Control

Speaker: Andrew Skoog, Founder of MachinistX & President of Hexis Representatives

Smart automation and AI-driven software are revolutionizing decision-making, optimizing processes, and improving efficiency. Confident Implementation 🛠 Discover best practices for integrating new technology into your processes without disruption.

Manufacturing

Data and Process Automation Adoption: Challenges, Maturity, and Business Impact

Precisely

MARCH 3, 2025

Data and process automation used to be seen as luxury but those days are gone. Lets explore the top challenges to data and process automation adoption in more detail. Almost half of respondents (47%) reported a medium level of automation adoption, meaning they currently have a mix of automated and manual SAP processes.

Process

Process Government Data Finance

Best Practices for Real-Time Stream Processing

Striim

MARCH 21, 2025

What is Real-Time Stream Processing? To access real-time data, organizations are turning to stream processing. To access real-time data, organizations are turning to stream processing. There are two main data processing paradigms: batch processing and stream processing.

Process

Process Data Warehouse Kafka Data Pipeline

Natural Language Processing(NLP) in Manufacturing

WeCloudData

MARCH 26, 2025

Natural Language Processing (NLP) is transforming the manufacturing industry by enhancing decision-making, enabling intelligent automation, and improving quality control. Lets learn more about the use cases of NLP in manufacturing and […] The post Natural Language Processing(NLP) in Manufacturing appeared first on WeCloudData.

Manufacturing

Manufacturing Process Unstructured Data Data

Queues in Apache Kafka®: Enhancing Message Processing and Scalability

Confluent

DECEMBER 19, 2024

Queue support in Apache Kafka 4.0, enabled by share groups, lets you accommodate traditional queue-type workloads through cooperative consumption.

Kafka

Kafka Process

BI Buyers Guide: Embedding Analytics in Your Software

This exhaustive guide with a foreword from BI analyst Jen Underwood dives deep into the BI buying process and explores how to decide what features you need. And as the number of vendors grows, it gets harder to make sense of it all. Don't go into the fray unarmed.

BI

Best Automation Tools In 2025 for Data Pipelines, Integrations, and More

Seattle Data Guy

MARCH 31, 2025

RevOps teams want to streamline processes… Read more The post Best Automation Tools In 2025 for Data Pipelines, Integrations, and More appeared first on Seattle Data Guy. But automation isnt just for analytics.

Data Pipeline

Data Pipeline Machine Learning Data Process

Top 11 GenAI Powered Data Engineering Tools to Follow in 2025

Analytics Vidhya

DECEMBER 30, 2024

How will generative AI shape the tools and processes Data Engineers rely on today? What will data engineering look like in 2025? As the field evolves, Data Engineers are stepping into a future where innovation and efficiency take center stage.

Data Engineering

Data Engineering Data Engineer Engineering Utilities

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

This belief has led us to developing Privacy Aware Infrastructure (PAI) , which offers efficient and reliable first-class privacy constructs embedded in Meta infrastructure to address different privacy requirements, such as purpose limitation , which restricts the purposes for which data can be processed and used. Hack, C++, Python, etc.)

Data Warehouse

Data Warehouse SQL Programming Language Data

They Handle 500B Events Daily. Here’s Their Data Engineering Architecture.

Monte Carlo

NOVEMBER 12, 2024

A data engineering architecture is the structural framework that determines how data flows through an organization – from collection and storage to processing and analysis. And who better to learn from than the tech giants who process more data before breakfast than most companies see in a year?

Architecture

Architecture Data Engineering Data Engineer Engineering

How to Find and Test Assumptions in Product Development

Assumptions mapping is the process of identifying and testing your riskiest ideas. Watch this webinar with Laura Klein, product manager and author of Build Better Products, to learn how to spot the unconscious assumptions which you’re basing decisions on and guidelines for validating (or invalidating) your ideas.

Project

How does ChatGPT work? As explained by the ChatGPT team.

The Pragmatic Engineer

APRIL 21, 2024

Other shipped things include DALL·E 3 (image generation,) GPT-4 (an advanced model,) and the OpenAI API which developers and companies use to integrate AI into their processes. Each word that spits out of ChatGPT is this same process repeated over and over again many times per second.

Engineering

Engineering Software Engineer Software Engineering Programming

Scalable Model Development and Production in Snowflake ML

Snowflake

MARCH 31, 2025

For image data, running distributed PyTorch on Snowflake ML also with standard settings resulted in over 10x faster processing for a 50,000-image dataset when compared to the same managed Spark solution. Many enterprises are already using Container Runtime to cost-effectively build advanced ML use cases with easy access to GPUs.

Healthcare

Healthcare Medical Government Food

Dealing with quotas and limits - Apache Spark Structured Streaming for Amazon Kinesis Data Streams

Waitingforcode

FEBRUARY 18, 2025

These limits become even more serious when they operate in a latency-sensitive context, as the one of stream processing. From another, they often have quotas and limits that you, as a data engineer, have to take into account in your daily work.

Data Engineering

Data Engineering Data Engineer Cloud Data

Azure Databricks: A Comprehensive Guide

Analytics Vidhya

FEBRUARY 28, 2023

A collaborative and interactive workspace allows users to perform big data processing and machine learning tasks easily. Introduction Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform that is built on top of the Microsoft Azure cloud.

Big Data

Big Data Machine Learning Cloud Data Process

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

Speaker: Jay Allardyce, Deepak Vittal, Terrence Sheflin, and Mahyar Ghasemali

We’ll explore how recent developments are impacting strategic planning and decision-making processes, as well as practical strategies to leverage these trends to the benefit of your organization.

Data

The Roots of Today's Modern Backend Engineering Practices

The Pragmatic Engineer

NOVEMBER 21, 2023

Avoiding downtime was nerve-wracking, and the notion of a 'rollback' was as much a relief as a technical process. After this zero-byte file was deployed to prod, the Apache web server processes slowly picked up the empty configuration file. Our deployments were initially manual. Apache started to log like a maniac.

Engineering

Engineering Bytes Cloud Computing AWS

Netflix’s Distributed Counter Abstraction

Netflix Tech

NOVEMBER 12, 2024

Introducing sufficient jitter to the flush process can further reduce contention. By creating multiple topic partitions and hashing the counter key to a specific partition, we ensure that the same set of counters are processed by the same set of consumers. This process can also be used to track the provenance of increments.

Datasets

Datasets Computer Science Systems Kafka

Scaling Beyond Postgres: How to Choose a Real-Time Analytical Database

Simon Späti

MARCH 11, 2025

Therefore, you’ve probably come across terms like OLAP (Online Analytical Processing) systems, data warehouses, and, more recently, real-time analytical databases. But data volumes grow, analytical demands become more complex, and Postgres stops being enough.

Database

Database Data Warehouse Data Engineering Data Engineer

Gen AI in Action: Customers’ Cortex AI Stories and Outcomes

Snowflake

NOVEMBER 6, 2024

But getting a handle on all the emails, calls and support tickets had historically been a tedious and largely manual process. After migrating all of its historical feedback data to Snowflake, however, Advisor360° created an automated pipeline using Cortex AI to cover that end-to-end process of gauging customer sentiment.

Hospitality

Hospitality Medical Government Software Engineer

LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost

Speaker: Shreya Rajpal, Co-Founder and CEO at Guardrails AI & Travis Addair, Co-Founder and CTO at Predibase

Putting the right LLMOps process in place today will pay dividends tomorrow, enabling you to leverage the part of AI that constitutes your IP – your data – to build a defensible AI strategy for the future.

Data

Building cost effective data pipelines with Python & DuckDB

Start Data Engineering

MAY 28, 2024

Use DuckDB to process data, not for multiple users to access data 4.2. Cost calculation: DuckDB + Ephemeral VMs = dirt cheap data processing 4.3. Processing data less than 100GB? Introduction 2. Project demo 3. Building efficient data pipelines with DuckDB 4.1. Use DuckDB 4.4.

Data Pipeline

Data Pipeline Python Building Data

30 Best Data Science Books to Read in 2023

Analytics Vidhya

FEBRUARY 28, 2023

Each aspect of data science, like data preparation, the importance of big data, and the process of automation, contributes to how data science is the future […] The post 30 Best Data Science Books to Read in 2023 appeared first on Analytics Vidhya.

Data Science

Data Science Data Preparation Big Data Data

Accelerate AI Development with Snowflake

Snowflake

NOVEMBER 11, 2024

Here’s how Snowflake Cortex AI and Snowflake ML are accelerating the delivery of trusted AI solutions for the most critical generative AI applications: Natural language processing (NLP) for data pipelines: Large language models (LLMs) have a transformative potential, but they often batch inference integration into pipelines, which can be cumbersome.

Unstructured Data

Unstructured Data SQL AWS Healthcare

Finding My Pathless Path

Simon Späti

FEBRUARY 25, 2023

It’s a tale of finding my Pathless Path and discovering who I am in the process. As I sit down to write this article, I’m filled with a sense of vulnerability and excitement. You see, this is a story that only I can tell. Along the way, I discovered the importance of staying flexible and adaptable.

Process

Process IT

7+ Graphics Libraries to Enhance Your Embedded Analytics

You’ll learn: Seven graphics libraries developers can use to enhance in-app analytics Easy-to-use wireframe tools to help the design and approval process The importance of modernizing your embedded analytics Download the e-book to learn about the seven-plus graphics libraries to enhance your embedded analytics.

Designing

Most Essential 2023 Interview Questions on Data Engineering

Analytics Vidhya

FEBRUARY 7, 2023

Introduction Data engineering is the field of study that deals with the design, construction, deployment, and maintenance of data processing systems. The goal of this domain is to collect, store, and process data efficiently and efficiently so that it can be used to support business decisions and power data-driven applications.

Data Engineering

Data Engineering Data Engineer Engineering Data

Ace Your Interview with Top 10 Interview Questions on Delta Lake

Analytics Vidhya

FEBRUARY 13, 2023

Introduction Every data scientist demands an efficient and reliable tool to process this big unstoppable data. Today we discuss one such tool called Delta Lake, which data enthusiasts use to make their data processing pipelines more efficient and reliable.

Data Process

Data Process Process Data Data Warehouse

Top 20 Big Data Tools Used By Professionals in 2023

Analytics Vidhya

FEBRUARY 23, 2023

It is so extensive and diverse that traditional data processing methods cannot handle it. The volume, velocity, and variety of Big Data can make it difficult to process and analyze. Introduction Big Data is a large and complex dataset generated by various sources and grows exponentially.

Big Data Tools

Big Data Tools Big Data Datasets Data

Interesting startup idea: benchmarking cloud platform pricing

The Pragmatic Engineer

OCTOBER 17, 2024

Code and raw data repository: Version control: GitHub Heavily using GitHub Actions for things like getting warehouse data from vendor APIs, starting cloud servers, running benchmarks, processing results, and cleaning up after tuns. Internal comms: Chat: Slack Coordination / project management: Linear 3.

Cloud

Cloud AWS Metadata Cloud Computing

How to Optimize the Developer Experience for Monumental Impact

Speaker: Anne Steiner and David Laribee

As an innovative concept, Developer Experience (DX) has gained significant attention in the tech industry, and emphasizes engineers’ efficiency and satisfaction during the product development process.

Systems

Getting Started with Apache Arrow

Analytics Vidhya

MARCH 4, 2025

But processing large-scale data across different systems is often slow. Constant format conversions add processing time and memory overhead. Data is at the core of everything, from business decisions to machine learning. Traditional row-based storage formats struggle to keep up with modern analytics.

Machine Learning

Machine Learning Systems Process Data

Scale Unstructured Text Analytics with Batch LLM Inference

Snowflake

MARCH 6, 2025

Customer intelligence teams analyze reviews and forum comments to identify sentiment trends, while support teams process tickets to uncover product issues and inform gaps in a product roadmap. As data volumes grow and AI automation expands, cost efficiency in processing with LLMs depends on both system architecture and model flexibility.

Unstructured Data

Unstructured Data Medical Media Data Workflow

Modern Data Governance: Trends for 2025

Precisely

JANUARY 30, 2025

Recognize that artificial intelligence is a data governance accelerator and a process that must be governed to monitor ethical considerations and risk. Align people, processes, and technology Successful data governance requires a holistic approach. Tools are important, but they need to complement your strategy.

Data Governance

Data Governance Government Metadata Data

How to build a data project with step-by-step instructions

Start Data Engineering

SEPTEMBER 18, 2024

Identify what tool to use to process data 3.3. Define what the output dataset will look like 3.1.3. Define SLAs so stakeholders know what to expect 3.1.4. Define checks to ensure the output dataset is usable 3.2. Data flow architecture 3.

Project

Project Building Datasets Architecture

How to Package and Price Embedded Analytics

Just by embedding analytics, application owners can charge 24% more for their product. How much value could you add? This framework explains how application enhancements can extend your product offerings. Brought to you by Logi Analytics.

Analytics Application

Unapologetically Technical Episode 17 – Semih Salihoglu

Jesse Anderson

FEBRUARY 11, 2025

Discover the insights he gained from academia and industry, his perspective on the future of data processing and the story behind building a next-generation graph database. Semih explains how Kuzu addresses the challenges of large graph analytics, the benefits of embeddability, and its potential for applications in AI and beyond.

Computer Science

Computer Science Database Design Software Engineer Software Engineering

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

Cloudera

NOVEMBER 13, 2024

The end-to-end lineage also automates tasks such as predicting the impact of a process change, analyzing the impact of a broken process, discovering parallel processes performing the same tasks, and performing root cause analysis to uncover the source of reporting errors.

Metadata

Metadata Management Data Governance Government

The Race For Data Quality in a Medallion Architecture

DataKitchen

NOVEMBER 5, 2024

The Medallion architecture is a design pattern that helps data teams organize data processing and storage into three distinct layers, often called Bronze, Silver, and Gold. By methodically processing data through Bronze, Silver, and Gold layers, this approach supports a variety of use cases. Bronze layers should be immutable.

Architecture

Architecture Raw Data Pipeline-centric Data Ingestion

Simplify Data Warehouse Migrations: Free SnowConvert with Redshift Support

Snowflake

JANUARY 28, 2025

SnowConvert can automate more than 96% of the code and object conversion process as demonstrated with the many migrations projects executed over the years, making it a proven solution for migrations from Oracle, SQL Server and Teradata. And today, we are announcing expanded support for code conversions from Amazon Redshift to Snowflake.

Data Warehouse

Data Warehouse Professional Services SQL Coding

Monetizing Analytics Features: Why Data Visualizations Will Never Be Enough

Think your customers will pay more for data visualizations in your application? Five years ago they may have. But today, dashboards and visualizations have become table stakes. Discover which features will differentiate your application and maximize the ROI of your embedded analytics. Brought to you by Logi Analytics.

Data

Testing DuckDB’s Large Than Memory Processing Capabilities.

What is Unstructured Data? A Guide to Storage, Processing, and Analysis

Webinars

Trending Sources

Modern Data Engineering with MAGE: Empowering Efficient Data Processing

Webinars

Unlocking Data Team Success: Are You Process-Centric or Data-Centric?

Smart Tech + Human Expertise = How to Modernize Manufacturing Without Losing Control

Data and Process Automation Adoption: Challenges, Maturity, and Business Impact

Best Practices for Real-Time Stream Processing

Natural Language Processing(NLP) in Manufacturing

Queues in Apache Kafka®: Enhancing Message Processing and Scalability

BI Buyers Guide: Embedding Analytics in Your Software

Best Automation Tools In 2025 for Data Pipelines, Integrations, and More

Top 11 GenAI Powered Data Engineering Tools to Follow in 2025

How Meta discovers data flows via lineage at scale

They Handle 500B Events Daily. Here’s Their Data Engineering Architecture.

How to Find and Test Assumptions in Product Development

How does ChatGPT work? As explained by the ChatGPT team.

Scalable Model Development and Production in Snowflake ML

Dealing with quotas and limits - Apache Spark Structured Streaming for Amazon Kinesis Data Streams

Azure Databricks: A Comprehensive Guide

Prepare Now: 2025s Must-Know Trends For Product And Data Leaders

The Roots of Today's Modern Backend Engineering Practices

Netflix’s Distributed Counter Abstraction

Scaling Beyond Postgres: How to Choose a Real-Time Analytical Database

Gen AI in Action: Customers’ Cortex AI Stories and Outcomes

LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost

Building cost effective data pipelines with Python & DuckDB

30 Best Data Science Books to Read in 2023

Accelerate AI Development with Snowflake

Finding My Pathless Path

7+ Graphics Libraries to Enhance Your Embedded Analytics

Most Essential 2023 Interview Questions on Data Engineering

Ace Your Interview with Top 10 Interview Questions on Delta Lake

Top 20 Big Data Tools Used By Professionals in 2023

Interesting startup idea: benchmarking cloud platform pricing

How to Optimize the Developer Experience for Monumental Impact

Getting Started with Apache Arrow

Scale Unstructured Text Analytics with Batch LLM Inference

Modern Data Governance: Trends for 2025

How to build a data project with step-by-step instructions

How to Package and Price Embedded Analytics

Unapologetically Technical Episode 17 – Semih Salihoglu

Octopai Acquisition Enhances Metadata Management to Trust Data Across Entire Data Estate

The Race For Data Quality in a Medallion Architecture

Simplify Data Warehouse Migrations: Free SnowConvert with Redshift Support

Monetizing Analytics Features: Why Data Visualizations Will Never Be Enough

Stay Connected