Blog, Datasets and Systems - Data Engineering Digest

Netflix’s Distributed Counter Abstraction

Netflix Tech

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction.

Datasets

Datasets Computer Science Systems Kafka

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. In this blog, we will discuss: What is the Open Table format (OTF)? These formats are transforming how organizations manage large datasets. Why should we use it?

Architecture

Architecture Systems Data Lake Google Cloud

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Metas systems. In this blog, we will delve into an early stage in PAI implementation: data lineage. Data lineage enables us to efficiently navigate these assets and protect user data.

Data Warehouse

Data Warehouse SQL Programming Language Data

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

How to get datasets for Machine Learning?

Knowledge Hut

APRIL 26, 2024

Datasets are the repository of information that is required to solve a particular type of problem. Datasets play a crucial role and are at the heart of all Machine Learning models. Datasets are often related to a particular type of problem and machine learning models can be built to solve those problems by learning from the data.

Machine Learning

Machine Learning Datasets Deep Learning Finance

Introducing Cloudera Fine Tuning Studio for Training, Evaluating, and Deploying LLMs with Cloudera AI

Cloudera

NOVEMBER 13, 2024

Fine Tuning Studio enables users to track the location of all datasets, models, and model adapters for training and evaluation. The goal is to fine tune a small, cost-effective model that , based on customer input, can extract an appropriate “action” (think API call) that the downstream system should take for the customer.

Datasets

Datasets Machine Learning Coding Data Preparation

Data News — Week 25.02

Christophe Blefari

JANUARY 11, 2025

Over the past four weeks, I took a break from blogging and LinkedIn to focus on building nao. AI companies are aiming for the moon—AGI—promising it will arrive once OpenAI develops a system capable of generating at least $100 billion in profits. Meanwhile, the AI landscape remains unpredictable.

Data

Data Data Warehouse Coding Programming Language

30+ Free Datasets for Your Data Science Projects in 2023

Knowledge Hut

NOVEMBER 28, 2023

Whether you are working on a personal project, learning the concepts, or working with datasets for your company, the primary focus is a data acquisition and data understanding. In this article, we will look at 31 different places to find free datasets for data science projects. What is a Data Science Dataset?

Datasets

Datasets Data Science Project Machine Learning

A Look At The Data Systems Behind The Gameplay For League Of Legends

Data Engineering Podcast

NOVEMBER 20, 2022

Summary The majority of blog posts and presentations about data engineering and analytics assume that the consumers of those efforts are internal business users accessing an environment controlled by the business. The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it.

Systems

Systems Metadata Data Pipeline MongoDB

Apache Ozone – A Multi-Protocol Aware Storage System

Cloudera

NOVEMBER 7, 2023

The vast tapestry of data types spanning structured, semi-structured, and unstructured data means data professionals need to be proficient with various data formats such as ORC, Parquet, Avro, CSV, and Apache Iceberg tables, to cover the ever growing spectrum of datasets – be they images, videos, sensor data, or other type of media content.

Systems

Systems Hadoop Unstructured Data Media

D3: An Automated System to Detect Data Drifts

Uber Engineering

FEBRUARY 23, 2023

In this blog learn how we automated column-level drift detection in batch datasets at Uber scale, reducing the median time to detect issues in critical datasets by 5X. Data quality is of paramount importance at Uber, powering critical decisions and features.

Systems

Systems Datasets Data

Building a large scale unsupervised model anomaly detection system?—?Part 1

Lyft Engineering

APRIL 21, 2023

Building a large scale unsupervised model anomaly detection system — Part 1 Distributed Profiling of Model Inference Logs By Anindya Saha , Han Wang , Rajeev Prabhakar Introduction LyftLearn is Lyft’s ML Platform. In a previous blog post , we explored the architecture and challenges of the platform.

Systems

Systems Building Machine Learning Raw Data

Data Engineering Weekly #210

Data Engineering Weekly

MARCH 2, 2025

I found the blog to be a fresh take on the skill in demand by layoff datasets. DeepSeek continues to impact the Data and AI landscape with its recent open-source tools, such as Fire-Flyer File System (3FS) and smallpond. The blog provides an excellent analysis of smallpond compared to Spark and Daft.

Data Engineer

Data Engineer Data Engineering Engineering Datasets

Solving the weekly menu puzzle pt.2: recommendations at Picnic

Picnic Engineering

APRIL 7, 2025

A little over a year ago, we shared a blog post about our journey to enhance customers meal planning experience with personalized recipe recommendations. We explained how a system that learns from your tastes and habits could solve this issue, ultimately making the daily task of choosing meals both effortless and inspiring.

Datasets

Datasets Systems Architecture Machine Learning

Introducing Impressions at Netflix

Netflix Tech

FEBRUARY 14, 2025

It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profiles exposure. In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily.

Kafka

Kafka Datasets Metadata Utilities

Improving Pinterest Search Relevance Using Large Language Models

Pinterest Engineering

APRIL 4, 2025

In this blog, we will go through the technical design and share some offline and online results for our LLM-based search relevance pipeline. Pin Text Representations Pins on Pinterest are rich multimedia entities that feature images, videos, and other contents, often linked to external webpages or blogs.

Machine Learning

Machine Learning Metadata Architecture Datasets

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

The blog highlights how moving from 6-character base-64 to 20-digit base-2 file distribution brings more distribution in S3 and reduces request failures. The blog is a good summary of how to use Snowflake QUERY_TAG to measure and monitor query performance. The blog post made me curious to understand DataFusion's internals.

Data Engineer

Data Engineer Data Engineering Engineering Insurance

The Race For Data Quality in a Medallion Architecture

DataKitchen

NOVEMBER 5, 2024

This foundational layer is a repository for various data types, from transaction logs and sensor data to social media feeds and system logs. The Medallion architecture is a framework that allows data engineers to build organized and analysis-ready datasets in a lakehouse environment. How do you ensure data quality in every layer?

Architecture

Architecture Raw Data Pipeline-centric Data Ingestion

How Meta understands data at scale

Engineering at Meta

APRIL 28, 2025

Meta’s vast and diverse systems make it particularly challenging to comprehend its structure, meaning, and context at scale. We discovered that a flexible and incremental approach was necessary to onboard the wide variety of systems and languages used in building Metas products. We believe that privacy drives product innovation.

Metadata

Metadata Data Utilities Data Warehouse

Handling Network Throttling with AWS EC2 at Pinterest

Pinterest Engineering

APRIL 7, 2025

In recent years, while managing Pinterests EC2 infrastructure, particularly for our essential online storage systems, we identified a significant challenge: the lack of clear insights into EC2s network performance and its direct impact on our applications reliability and performance. 4xl with up to 12.5

AWS

AWS Bytes Database Data Ingestion

Data Engineering Weekly #215

Data Engineering Weekly

APRIL 6, 2025

The article summarizes the recent macro trends in AI and data engineering, focusing on Vibe coding, human-in-the-loop system design, and rapid simplification of developer tooling. The Grab blog delights me since I have tried to do this many times. Kudos to the Grab team for building a docs-as-code system.

Data Engineer

Data Engineer Data Engineering Engineering Kafka

Change Data Capture at Pinterest

Pinterest Engineering

NOVEMBER 18, 2024

In this blog post, we’ll explore what CDC is, why it’s important, and our journey of implementing Generic CDC solutions for all online databases at Pinterest. This is crucial for applications that require up-to-date information, such as fraud detection systems or recommendation engines. What is Change Data Capture?

Kafka

Kafka MySQL Database Software Engineering

Building ETL Pipeline with Snowpark

Cloudyard

DECEMBER 24, 2024

In this blog, well explore Building an ETL Pipeline with Snowpark by simulating a scenario where commerce data flows through distinct data layersRAW, SILVER, and GOLDEN.These tables form the foundation for insightful analytics and robust business intelligence. Built clean, enriched datasets in the SILVER layer.

Building

Building Raw Data Scala Business Intelligence

Data Engineering Weekly #212

Data Engineering Weekly

MARCH 16, 2025

link] Intuit: Revolutionizing Knowledge Discovery with GenAI to Transform Document Management Intuit writes about using a dual-loop system to build a GenAI-powered pipeline to improve knowledge discovery in its technical documentation. years of manual effort!!! Check out the full report for more insights and DataOps trends!

Data Engineer

Data Engineer Data Engineering Engineering Amazon Web Services

Building a large scale unsupervised model anomaly detection system?—?Part 2

Lyft Engineering

APRIL 25, 2023

Building a large scale unsupervised model anomaly detection system — Part 2 Building ML Models with Observability at Scale By Rajeev Prabhakar , Han Wang , Anindya Saha Photo by Octavian Rosca on Unsplash In our previous blog we discussed the different challenges we faced for model monitoring and our strategy for addressing some of these problems.

Systems

Systems Building Machine Learning Raw Data

Building and Evaluating GenAI Knowledge Management Systems using Ollama, Trulens and Cloudera

Cloudera

MAY 23, 2024

A “Knowledge Management System” (KMS) allows businesses to collate this information in one place, but not necessarily to search through it accurately. Ultimately, Ollama and Cloudera provide enterprise-grade access to localized LLM models, to scale GenAI applications and build robust Knowledge Management systems.

Systems

Systems Building Management Data Lake

Complete Guide to Data Transformation: Basics to Advanced

Ascend.io

OCTOBER 28, 2024

In this blog post, we’ll explore fundamental concepts, intermediate strategies, and cutting-edge techniques that are shaping the future of data engineering. Filling in missing values could involve leveraging other company data sources or even third-party datasets.

Raw Data

Raw Data Datasets Aggregated Data Data Pipeline

Feature Caching for Recommender Systems w/ Cachelib

Pinterest Engineering

SEPTEMBER 19, 2024

Manager, Engineering | At Pinterest, we operate a large-scale online machine learning inference system, where feature caching plays a critical role to achieve optimal efficiency. Background Recommender systems are fundamental to Pinterest’s mission to inspire users to create a life they love.

Systems

Systems Utilities Architecture Software Engineer

Data Engineering Weekly #207

Data Engineering Weekly

FEBRUARY 9, 2025

MoEs necessitate less compute for pre-training compared to dense models, facilitating the scaling of model and dataset size within similar computational budgets. I found the product blog from QuantumBlack gives a view of data quality in unstructured data. The system design is an excellent reminder of thinking from a user's perspective.

Data Engineer

Data Engineer Data Engineering Engineering Unstructured Data

Small Language Models Explained: Benefits & Example

Edureka

MARCH 15, 2025

By learning the details of smaller datasets, they better balance task-specific performance and resource efficiency. It is seamlessly integrated across Meta’s platforms, increasing user access to AI insights, and leverages a larger dataset to enhance its capacity to handle complex tasks. What are Small language models?

Entertainment

Entertainment Retail Education Healthcare

Part 2: A Survey of Analytics Engineering Work at Netflix

Netflix Tech

JANUARY 2, 2025

However, due to the absence of a control group in these countries, we adopt a synthetic control framework ( blog post ) to estimate the counterfactual scenario. Before starting any math, we need to ensure a high quality historical dataset. Data quality plays a huge role in this work.

Engineering

Engineering Entertainment Designing Technology

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

In this blog post, we will ingest a real world dataset into Ozone, create a Hive table on top of it and analyze the data to study the correlation between new vaccinations and new cases per country using a Spark ML Jupyter notebook in CML. Learn more about the impacts of global data sharing in this blog, The Ethics of Data Exchange.

Data Science

Data Science Cloud Hadoop Metadata

Data Engineering Weekly #179

Data Engineering Weekly

JULY 7, 2024

The blog highlights the advantages of GNN over traditional machine learning models, which struggle to discern relationships between various entities, such as users and restaurants, and edges, such as order. Canva writes about the evolution of experimentation system design and the current state of the function.

Data Engineer

Data Engineer Data Engineering Engineering Data Lake

Evaluating Change Data Capture Tools: A Comprehensive Guide

Data Engineering Weekly

AUGUST 6, 2024

CDC Evaluation Guide Google Sheet Link: [link] CDC Evaluation Guide Github Link: [link] Change Data Capture (CDC) is a powerful technology in data engineering that allows for continuously capturing changes (inserts, updates, and deletes) made to source systems. Impacts source system performance during query execution.

Data Lake

Data Lake Data Warehouse Database Data Architecture

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

This blog post expands on that insightful conversation, offering a critical look at Iceberg's potential and the hurdles organizations face when adopting it. In the mid-2000s, Hadoop emerged as a groundbreaking solution for processing massive datasets. Cost: Reducing storage and processing expenses. Speed: Accelerating data insights.

Hadoop

Hadoop Metadata Data Ingestion Data Governance

How To Future-Proof Your Data Pipelines

Ascend.io

NOVEMBER 14, 2024

In this blog post, we’ll explore key strategies for future-proofing your data pipelines. These platforms enable scalable and distributed data processing, allowing data teams to efficiently handle massive datasets. APIs facilitate communication between different systems, allowing data to flow seamlessly across platforms.

Data Pipeline

Data Pipeline Amazon Web Services Data Integration Data

The Downfall of the Data Engineer

Maxime Beauchemin

AUGUST 28, 2017

This leads to systemic, stupid errors that waste hours. Change Management Given that useful datasets become widely used and derived in ways that results in large and complex directed acyclic graphs (DAGs) of dependencies, altering logic or source data tends to break and/or invalidate downstream constructs.

Data Engineer

Data Engineer Data Engineering Engineering Software Engineering

Data Engineering Weekly #182

Data Engineering Weekly

JULY 28, 2024

The blog is an excellent summarization of the common patterns emerging in GenAI platforms. Switching from Apache Spark to Ray improves compact 12X larger datasets than Apache Spark, improves cost efficiency by 91%, and processes 13X more data per hour. Swiggy recently wrote about its internal platform, Hermes, a text-to-SQL solution.

Data Engineer

Data Engineer Data Engineering Engineering Database-centric

Natural Language Processing: A Guide to NLP Use Cases, Approaches, and Tools

AltexSoft

AUGUST 25, 2021

Besides simply looking for email addresses associated with spam, these systems notice slight indications of spam emails, like bad grammar and spelling, urgency, financial language, and so on. Such dialog systems are the hardest to pull off and are considered an unsolved problem in NLP. Language detection. Question answering.

Process

Process Deep Learning Datasets Machine Learning

Last Mile Data Processing with Ray

Pinterest Engineering

SEPTEMBER 12, 2023

transformers) became standardized, ML engineers started to show a growing appetite to iterate on datasets. While such dataset iterations can yield significant gains, we observed that only a handful of such experiments were conducted and productionized in the last six months. As model architecture building blocks (e.g.

Data Process

Data Process Process Datasets Software Engineering

How To Prepare Your Data Team for 2025

Ascend.io

DECEMBER 4, 2024

In this blog post, we’ll explore key strategies that data teams should adopt to prepare for the year ahead. On average, engineers spend over half of their time maintaining existing systems rather than developing new solutions. Leveraging cloud-based platforms and distributed computing can help handle large datasets efficiently.

Data Pipeline

Data Pipeline Metadata Data Workflow Data

Building Pinterest’s new wide column database using RocksDB

Pinterest Engineering

JANUARY 4, 2024

In 2020, anticipating the growing needs of the business and to simplify our storage offerings, we decided to consolidate our different key-value systems in the company into a single unified service called KVStore. Maintaining these disparate systems and building common functionality among them was adding a huge overhead to the teams.

Database

Database Building Datasets Relational Database

Top 10 Data Science Websites to learn More

Knowledge Hut

FEBRUARY 29, 2024

Then, based on this information from the sample, defect or abnormality the rate for whole dataset is considered. Hypothesis testing is a part of inferential statistics which uses data from a sample to analyze results about whole dataset or population. It offers various blogs based on above mentioned technology in alphabetical order.

Data Science

Data Science Datasets Machine Learning Database Design

DeepSeek AI Research Paper Breakdown

Edureka

MARCH 12, 2025

The new DeepSeek AI study paper goes into great detail about the system’s architecture, how it is trained, how it is optimized, and how it can be used in the real world. This blog will break down the research paper’s key aspects, helping you understand how DeepSeek AI works and why it stands out in the AI landscape.

Datasets

Datasets Medical Architecture Healthcare

Introducing Apache Iceberg in Cloudera Data Platform

Cloudera

FEBRUARY 22, 2022

The data architecture layer is one such area where growing datasets have pushed the limits of scalability and performance. The data explosion has to be met with new solutions, that’s why we are excited to introduce the next generation table format for large scale analytic datasets within Cloudera Data Platform (CDP) – Apache Iceberg.

Metadata

Metadata Datasets BI SQL

Netflix’s Distributed Counter Abstraction

Why Open Table Format Architecture is Essential for Modern Data Systems

Webinars

Trending Sources

How Meta discovers data flows via lineage at scale

Webinars

How to get datasets for Machine Learning?

Introducing Cloudera Fine Tuning Studio for Training, Evaluating, and Deploying LLMs with Cloudera AI

Data News — Week 25.02

30+ Free Datasets for Your Data Science Projects in 2023

A Look At The Data Systems Behind The Gameplay For League Of Legends

Apache Ozone – A Multi-Protocol Aware Storage System

D3: An Automated System to Detect Data Drifts

Building a large scale unsupervised model anomaly detection system?—?Part 1

Data Engineering Weekly #210

Solving the weekly menu puzzle pt.2: recommendations at Picnic

Introducing Impressions at Netflix

Improving Pinterest Search Relevance Using Large Language Models

Data Engineering Weekly #198

The Race For Data Quality in a Medallion Architecture

How Meta understands data at scale

Handling Network Throttling with AWS EC2 at Pinterest

Data Engineering Weekly #215

Change Data Capture at Pinterest

Building ETL Pipeline with Snowpark

Data Engineering Weekly #212

Building a large scale unsupervised model anomaly detection system?—?Part 2

Building and Evaluating GenAI Knowledge Management Systems using Ollama, Trulens and Cloudera

Complete Guide to Data Transformation: Basics to Advanced

Feature Caching for Recommender Systems w/ Cachelib

Data Engineering Weekly #207

Small Language Models Explained: Benefits & Example

Part 2: A Survey of Analytics Engineering Work at Netflix

Apache Ozone Powers Data Science in CDP Private Cloud

Data Engineering Weekly #179

Evaluating Change Data Capture Tools: A Comprehensive Guide

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

How To Future-Proof Your Data Pipelines

The Downfall of the Data Engineer

Data Engineering Weekly #182

Natural Language Processing: A Guide to NLP Use Cases, Approaches, and Tools

Last Mile Data Processing with Ray

How To Prepare Your Data Team for 2025

Building Pinterest’s new wide column database using RocksDB

Top 10 Data Science Websites to learn More

DeepSeek AI Research Paper Breakdown

Introducing Apache Iceberg in Cloudera Data Platform

Stay Connected