Process and Utilities - Data Engineering Digest

Grid Modernization with AI for More Connected Utilities

RandomTrees

NOVEMBER 29, 2024

Considering how most industries have rapidly evolved thanks to technology, upgrading grids has been of utmost importance for utility companies out there. The application of Artificial Intelligence (AI) technology into grid structures is now a game changer for utility managers.

Utilities

Utilities Big Data Algorithm Finance

Smart Utilities in Action: Generative AI’s Role in Real-Time Fault Detection

RandomTrees

JANUARY 30, 2025

The energy and utility industry is being transformed by AI technology, and it is powered by the digital revolution. One of its newest forms, Generative AI, is bolstering utility operations reliability, efficiency, and resilience. Its place in modern utilities is most evident in real-time fault detection.

Utilities

Utilities Algorithm Machine Learning Systems

Top 11 GenAI Powered Data Engineering Tools to Follow in 2025

Analytics Vidhya

DECEMBER 30, 2024

How will generative AI shape the tools and processes Data Engineers rely on today? GenAI is already transforming how data is managed, analyzed, and utilized, paving the way for […] The post Top 11 GenAI Powered Data Engineering Tools to Follow in 2025 appeared first on Analytics Vidhya.

Data Engineering

Data Engineering Data Engineer Engineering Utilities

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

This belief has led us to developing Privacy Aware Infrastructure (PAI) , which offers efficient and reliable first-class privacy constructs embedded in Meta infrastructure to address different privacy requirements, such as purpose limitation , which restricts the purposes for which data can be processed and used. Hack, C++, Python, etc.)

Data Warehouse

Data Warehouse SQL Programming Language Data

Best Practices for Real-Time Stream Processing

Striim

MARCH 21, 2025

What is Real-Time Stream Processing? To access real-time data, organizations are turning to stream processing. To access real-time data, organizations are turning to stream processing. There are two main data processing paradigms: batch processing and stream processing.

Process

Process Data Warehouse Kafka Data Pipeline

Interesting startup idea: benchmarking cloud platform pricing

The Pragmatic Engineer

OCTOBER 17, 2024

The name comes from the concept of “spare cores:” machines currently unused, which can be reclaimed at any time, that cloud providers tend to offer at a steep discount to keep server utilization high. Spare Cores attempts to make it easier to compare prices across cloud providers. Source: Spare Cores. Tech stack.

Cloud

Cloud AWS Metadata Cloud Computing

Stream Processing with Python, Kafka & Faust

Towards Data Science

FEBRUARY 18, 2024

How to Stream and Apply Real-Time Prediction Models on High-Throughput Time-Series Data Photo by JJ Ying on Unsplash Most of the stream processing libraries are not python friendly while the majority of machine learning and data mining libraries are python based. This design enables the re-reading of old messages.

Kafka

Kafka Python Process Google Cloud

Strobelight: A profiling service built on open source technology

Engineering at Meta

JANUARY 21, 2025

Strobelight combines several technologies, many open source, into a single service that helps engineers at Meta improve efficiency and utilization across our fleet. Engineers and developers can use this information to identify performance and resource bottlenecks, optimize their code, and improve utilization. Python, Java, and Erlang).

Technology

Technology Metadata Utilities Engineering

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

LinkedIn Engineering

OCTOBER 19, 2023

Authors: Bingfeng Xia and Xinyu Liu Background At LinkedIn, Apache Beam plays a pivotal role in stream processing infrastructures that process over 4 trillion events daily through more than 3,000 pipelines across multiple production data centers.

Process

Process Lambda Architecture Kafka Machine Learning

The Stream Processing Model Behind Google Cloud Dataflow

Towards Data Science

APRIL 30, 2024

Balancing correctness, latency, and cost in unbounded data processing Image created by the author. Intro Google Dataflow is a fully managed data processing service that provides serverless unified stream and batch data processing. It is the first choice Google would recommend when dealing with a stream processing workload.

Google Cloud

Google Cloud Process Cloud Lambda Architecture

How Meta understands data at scale

Engineering at Meta

APRIL 28, 2025

Specifically, we have adopted a “shift-left” approach, integrating data schematization and annotations early in the product development process. However, conducting these processes outside of developer workflows presented challenges in terms of accuracy and timeliness.

Metadata

Metadata Data Utilities Data Warehouse

Change Control Process: Benefits, Examples, and Templates

Knowledge Hut

JANUARY 28, 2024

The change control process is a crucial aspect of project management intended to manage and regulate changes made to the project plan, schedule, and budget. These change control process steps are planning, analyzing, approval, testing, implementing, and closing. The change request kickstarts the process of change control.

Process

Process Project Designing Pharmaceutical

Snowflake Startup Challenge 2025: Meet the Top 10

Snowflake

APRIL 9, 2025

KAWA Analytics Digital transformation is an admirable goal, but legacy systems and inefficient processes hold back many companies efforts. By enabling advanced analytics and centralized document management, Digityze AI helps pharmaceutical manufacturers eliminate data silos and accelerate data sharing.

Pharmaceutical

Pharmaceutical Manufacturing Data Ingestion SQL

Foundation Model for Personalized Recommendation

Netflix Tech

MARCH 28, 2025

This scenario underscored the need for a new recommender system architecture where member preference learning is centralized, enhancing accessibility and utility across different models. To harness this data effectively, we employ a process of interaction tokenization, ensuring meaningful events are identified and redundancies are minimized.

Metadata

Metadata Bytes Data Mining Entertainment

Product Development Process: The 7 Stages Explained (with examples)

Knowledge Hut

APRIL 26, 2024

The product development process is just as vital as product management; both seem similar but have subtle variances. Product development focuses on the creation of a product, whereas The entire process is overseen by product management. What Is the Product Development Process? It involves seven product development process steps.

Process

Process Manufacturing Retail Electronics

Netflix’s Distributed Counter Abstraction

Netflix Tech

NOVEMBER 12, 2024

Introducing sufficient jitter to the flush process can further reduce contention. By creating multiple topic partitions and hashing the counter key to a specific partition, we ensure that the same set of counters are processed by the same set of consumers. This process can also be used to track the provenance of increments.

Datasets

Datasets Computer Science Systems Kafka

Part 1: A Survey of Analytics Engineering Work at Netflix

Netflix Tech

DECEMBER 17, 2024

If country_iso_code doesnt already exist in the fact table, the metric owner only needs to tell DJ that account_id is the foreign key to an `users_dimension_table` (we call this process dimension linking ). DJ then can perform the joins to bring in any requested dimensions from `users_dimension_table`.

Engineering

Engineering Entertainment Amazon Web Services Utilities

Twitter vs Instagram Threads: two different approaches to throttling

The Pragmatic Engineer

JULY 6, 2023

The company utilizes Google Cloud to some extent, and my understanding from talking with Twitter engineers is that this was for machine learning (ML) use cases. It surely helped the Threads team that they could utilize the infrastructure of Instagram. Twitter runs its infrastructure on three of its own data centers.

Google Cloud

Google Cloud Media Cloud Utilities

The Roots of Today's Modern Backend Engineering Practices

The Pragmatic Engineer

NOVEMBER 21, 2023

Avoiding downtime was nerve-wracking, and the notion of a 'rollback' was as much a relief as a technical process. After this zero-byte file was deployed to prod, the Apache web server processes slowly picked up the empty configuration file. Our deployments were initially manual. Apache started to log like a maniac.

Engineering

Engineering Bytes Cloud Computing AWS

The Future of Data Management Is Agentic AI

Snowflake

APRIL 13, 2025

Managing and utilizing data effectively is crucial for organizational success in today's fast-paced technological landscape. Manual processes can be time-consuming and error-prone. Agentic AI automates these processes, helping ensure data integrity and offering real-time insights.

Data Management

Data Management Management Consulting Unstructured Data

How To Future-Proof Your Data Pipelines

Ascend.io

NOVEMBER 14, 2024

But when data processes fail to match the increased demand for insights, organizations face bottlenecks and missed opportunities. This elasticity allows data pipelines to scale up or down as needed, optimizing resource utilization and cost efficiency. Regularly review usage patterns and adjust cloud resource allocation as needed.

Data Pipeline

Data Pipeline Amazon Web Services Data Integration Data

An educational side project

The Pragmatic Engineer

JUNE 1, 2023

for the simulation engine Go on the backend PostgreSQL for the data layer React and TypeScript on the frontend Prometheus and Grafana for monitoring and observability And if you were wondering how all of this was built, Juraj documented his process in an incredible, 34-part blog series. You can read this here. Serving a web page.

Education

Education Project PostgreSQL Software Engineer

Accelerate AI Development with Snowflake

Snowflake

NOVEMBER 11, 2024

Here’s how Snowflake Cortex AI and Snowflake ML are accelerating the delivery of trusted AI solutions for the most critical generative AI applications: Natural language processing (NLP) for data pipelines: Large language models (LLMs) have a transformative potential, but they often batch inference integration into pipelines, which can be cumbersome.

Unstructured Data

Unstructured Data SQL AWS Healthcare

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Data Engineering Weekly

FEBRUARY 18, 2025

Fluss is a compelling new project in the realm of real-time data processing. It works with streaming processing like Flink and Lakehouse formats like Iceberg and Paimon. Fluss focuses on storing streaming data and does not offer streaming processing capabilities. It excels in event-driven architectures and data pipelines.

Kafka

Kafka Lambda Architecture SQL Architecture

Introducing Impressions at Netflix

Netflix Tech

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily.

Kafka

Kafka Datasets Metadata Utilities

Handling a Regional Outage: Comparing the Response From AWS, Azure and GCP

The Pragmatic Engineer

OCTOBER 31, 2023

We are still working on processing the backlog of asynchronous Lambda invocations that accumulated during the event, including invocations from other AWS services (such as SQS and EventBridge). As of 3:37 PM PDT, the backlog was fully processed. We are continuing to work to fully recover all services.

AWS

AWS Google Cloud Cloud Engineering

How Precision Time Protocol handles leap seconds

Engineering at Meta

FEBRUARY 3, 2025

Leap second smearing a solution past its time Leap second smearing is a process of adjusting the speeds of clocks to accommodate the correction that has been a common method for handling leap seconds. The service continues to utilize TAI timestamps but can return UTC timestamps to clients via the API. microseconds.

Algorithm

Algorithm Utilities Systems Engineering

RxJS Operations in Angular

Edureka

JANUARY 2, 2025

Additionally, RxJS in Angular offers a full set of tools made to easily handle asynchronous processes and reactive programming. This blog post will talk about how RxJS in Angular helps developers handle data streams, handle complicated asynchronous processes, and make responsive and useful apps. Common RxJS Operations in Angular 1.

Utilities

Utilities Coding Programming Project

A senior engineer/EM job search story

The Pragmatic Engineer

AUGUST 10, 2023

Before we start: plenty of people who subscribe to the newsletter utilize a learning and development budget their company provides. The period from April to mid-May was challenging: I found myself in hiring freezes and canceled processes. ’ How did you find the interview processes?

Engineering

Engineering Recruitment Retail Software Engineer

Data Engineering Weekly #206

Data Engineering Weekly

FEBRUARY 2, 2025

DeepSeek development involves a unique training recipe that generates a large dataset of long chain-of-thought reasoning examples, utilizes an interim high-quality reasoning model, and employs large-scale reinforcement learning (RL). It employs a two-tower model approach to learn query and item embeddings from user engagement data.

Data Engineer

Data Engineer Data Engineering Engineering Data Lake

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

ThoughtSpot

NOVEMBER 5, 2024

In the realm of modern analytics platforms, where rapid and efficient processing of large datasets is essential, swift metadata access and management are critical for optimal system performance. Optimizing the server initialization process for Atlas is vital for maintaining the high availability and performance of the ThoughtSpot system.

Metadata

Metadata PostgreSQL Java Database

Redefining AIOps IT Workflows with Legacy System Visibility

Precisely

DECEMBER 16, 2024

AIOps, or artificial intelligence for IT operations, combines AI technologies like machine learning, natural language processing, and predictive analytics, with traditional IT operations. Understanding AI Operations (AIOps) in IT Environments What is AIOps? This empowers you to take a more proactive approach to protecting your systems.

Systems

Systems IT Machine Learning Insurance

Your Enterprise Data Needs an Agent

Snowflake

FEBRUARY 12, 2025

Snowflake customers now have a unified platform for processing and retrieval of both structured and unstructured data with high accuracy out-of-the-box. Planning: Applications often switch between processing data from structured and unstructured sources. The workflow involves four key components: 1.

Unstructured Data

Unstructured Data Government SQL Structured Data

Gotchas of Streaming Pipelines: Profiling & Performance Improvements

Lyft Engineering

JUNE 6, 2023

This article will cover the following topics: Performance improvement process Strategies to profile streaming pipelines Common performance problems General guidelines to improve performance Performance Improvement Process The performance improvement of any software system is not an independent and isolated task but an iterative process.

Utilities

Utilities Coding Python Systems

Change Data Capture at Pinterest

Pinterest Engineering

NOVEMBER 18, 2024

Liang Mou; Staff Software Engineer, Logging Platform | Elizabeth (Vi) Nguyen; Software Engineer I, Logging Platform | In today’s data-driven world, businesses need to process and analyze data in real-time to make informed decisions. Real-Time Data Processing : CDC enables real-time data processing by capturing changes as they happen.

Kafka

Kafka MySQL Database Software Engineering

How Games Typically Get Built

The Pragmatic Engineer

AUGUST 22, 2023

Over a week, participants build their own game utilizing AI in some form. Each project typically takes several years to create, with shifting hardware specifications and emerging competitors and trends to anticipate and react to, during the process. Learn more or sign up if you’re interested. Prototype vs final version.

Software Engineer

Software Engineer Software Engineering Consulting Entertainment

Snowflake’s Fully Managed Service: Beyond Serverless

Snowflake

FEBRUARY 13, 2025

These processes were costly and time-consuming and also introduced governance and security risks, as once data is moved, customers lose all control. We take care of planning, executing and verifying upgrades, and we do so using a rolling process without downtime. As a result, data often went underutilized.

Management

Management Government Cloud Unstructured Data

Troubleshooting Kafka In Production

Data Engineering Podcast

DECEMBER 24, 2023

Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.

Kafka

Kafka Data Lake High Quality Data SQL

Deploying key transparency at WhatsApp

Engineering at Meta

APRIL 13, 2023

As this is rolled out, security-conscious users who utilize the verify security code page will notice this verification process occurs quickly and automatically. The private key is what you utilize to decrypt your messages sent from another party and never leaves your device. this needs to be redone for all participants.

Utilities

Utilities Coding Database Accessibility

Data Engineering Weekly #195

Data Engineering Weekly

OCTOBER 27, 2024

Astasia Myers: The three components of the unstructured data stack LLMs and vector databases significantly improved the ability to process and understand unstructured data. The learning mostly involves understanding the data's nature, frequency of data processing, and awareness of the computing cost.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Introducing Snowpark pandas API: Run Distributed pandas at Scale in Snowflake

Snowflake

JUNE 5, 2024

In April 2024, Snowflake customers ran approximately 55 million queries in Snowpark on average each day for a spectrum of large-scale data processing tasks in data engineering and data science. pandas is the go-to data processing library for millions worldwide, including countless Snowflake users.

Python

Python Programming Language Government SQL

Google Domains to shut down

The Pragmatic Engineer

JUNE 22, 2023

This transition process will roll out gradually and is expected to take several months.” Why would a company like Google not utilize its status page to display the real status of the service? Google Domains will work with Squarespace to make the transition as seamless as possible for you.

Google Cloud

Google Cloud Media Cloud Engineering

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

The author emphasizes the importance of mastering state management, understanding "local first" data processing (prioritizing single-node solutions before distributed systems), and leveraging an asset graph approach for data pipelines. I honestly don’t have a solid answer, but this blog is an excellent overview of upskilling.

Data Engineering

Data Engineering Data Engineer Engineering Data

Connected Data, Better Insights: Data Enrichment Done Right

Precisely

MARCH 20, 2025

Data enrichment is the process of augmenting your organizations internal data with trusted, curated third-party datasets. And yet, there are inherent struggles with this process especially when integrating data from multiple providers. Well, those processes are now streamlined like never before. What is Data Enrichment?

Insurance

Insurance Datasets Data Programming

Grid Modernization with AI for More Connected Utilities

Smart Utilities in Action: Generative AI’s Role in Real-Time Fault Detection

Webinars

Trending Sources

Top 11 GenAI Powered Data Engineering Tools to Follow in 2025

Webinars

How Meta discovers data flows via lineage at scale

Best Practices for Real-Time Stream Processing

Interesting startup idea: benchmarking cloud platform pricing

Stream Processing with Python, Kafka & Faust

Strobelight: A profiling service built on open source technology

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

The Stream Processing Model Behind Google Cloud Dataflow

How Meta understands data at scale

Change Control Process: Benefits, Examples, and Templates

Snowflake Startup Challenge 2025: Meet the Top 10

Foundation Model for Personalized Recommendation

Product Development Process: The 7 Stages Explained (with examples)

Netflix’s Distributed Counter Abstraction

Part 1: A Survey of Analytics Engineering Work at Netflix

Twitter vs Instagram Threads: two different approaches to throttling

The Roots of Today's Modern Backend Engineering Practices

The Future of Data Management Is Agentic AI

How To Future-Proof Your Data Pipelines

An educational side project

Accelerate AI Development with Snowflake

Beyond Kafka: Conversation with Jark Wu on Fluss - Streaming Storage for Real-Time Analytics

Introducing Impressions at Netflix

Handling a Regional Outage: Comparing the Response From AWS, Azure and GCP

How Precision Time Protocol handles leap seconds

RxJS Operations in Angular

A senior engineer/EM job search story

Data Engineering Weekly #206

Turbocharging Atlas: How we reduced server initialization time to less than 2 minutes

Redefining AIOps IT Workflows with Legacy System Visibility

Your Enterprise Data Needs an Agent

Gotchas of Streaming Pipelines: Profiling & Performance Improvements

Change Data Capture at Pinterest

How Games Typically Get Built

Snowflake’s Fully Managed Service: Beyond Serverless

Troubleshooting Kafka In Production

Deploying key transparency at WhatsApp

Data Engineering Weekly #195

Introducing Snowpark pandas API: Run Distributed pandas at Scale in Snowflake

Google Domains to shut down

Data Engineering Weekly #213

Connected Data, Better Insights: Data Enrichment Done Right

Stay Connected