Blog and Datasets - Data Engineering Digest

How to JOIN datasets in Polars … compared to Pandas.

Confessions of a Data Guy

APRIL 6, 2024

It’s been a while since I wrote about Polars on this blog, I’ve been remiss. appeared first on Confessions of a Data Guy.

Datasets

Datasets IT Data Data Engineering

Netflix’s Distributed Counter Abstraction

Netflix Tech

NOVEMBER 12, 2024

By: Rajiv Shringi , Oleksii Tkachuk , Kartik Sathyanarayanan Introduction In our previous blog post, we introduced Netflix’s TimeSeries Abstraction , a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. For more information regarding this, refer to our previous blog.

Datasets

Datasets Computer Science Systems Kafka

Enhancing Neural Network Training at Yelp: Achieving 1,400x Speedup with WideAndDeep

Yelp Engineering

JANUARY 21, 2025

These models handle large tabular datasets with small parameter spaces, requiring innovative data solutions. This blog post delves into our journey of optimizing training time using TensorFlow and Horovod, along with the development of ArrowStreamServer, our in-house library for low-latency data streaming and serving.

Datasets

Datasets Architecture Data Solutions Data

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

How to get datasets for Machine Learning?

Knowledge Hut

APRIL 26, 2024

Datasets are the repository of information that is required to solve a particular type of problem. Datasets play a crucial role and are at the heart of all Machine Learning models. Datasets are often related to a particular type of problem and machine learning models can be built to solve those problems by learning from the data.

Machine Learning

Machine Learning Datasets Deep Learning Finance

Introducing Cloudera Fine Tuning Studio for Training, Evaluating, and Deploying LLMs with Cloudera AI

Cloudera

NOVEMBER 13, 2024

Fine Tuning Studio enables users to track the location of all datasets, models, and model adapters for training and evaluation. We can import this dataset on the Import Datasets page. The goal is to train an adapter for this base model that gives it better predictive capabilities for our specific dataset. Model Selection.

Datasets

Datasets Machine Learning Coding Data Preparation

30+ Free Datasets for Your Data Science Projects in 2023

Knowledge Hut

NOVEMBER 28, 2023

Whether you are working on a personal project, learning the concepts, or working with datasets for your company, the primary focus is a data acquisition and data understanding. In this article, we will look at 31 different places to find free datasets for data science projects. What is a Data Science Dataset?

Datasets

Datasets Data Science Project Machine Learning

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

LinkedIn Engineering

JUNE 15, 2023

To remove this bottleneck, we built AvroTensorDataset , a TensorFlow dataset for reading, parsing, and processing Avro data. Today, we’re excited to open source this tool so that other Avro and Tensorflow users can use this dataset in their machine learning pipelines to get a large performance boost to their training workloads.

Datasets

Datasets Bytes Process Data Ingestion

Shutterstock's Content Datasets Now on Databricks Marketplace

databricks

JUNE 5, 2024

Image datasets are crucial in. In today's data-driven world, the fusion of visual assets and analytical capabilities unlocks a realm of untapped potential.

Datasets

Datasets Data

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

In this blog, we will delve into an early stage in PAI implementation: data lineage. This took Meta multiple years to complete across our millions of disparate data assets, and well cover each of these more deeply in future blog posts: Inventorying involves collecting various code and data assets (e.g.,

Data Warehouse

Data Warehouse SQL Programming Language Data

Data News — Week 25.02

Christophe Blefari

JANUARY 11, 2025

Over the past four weeks, I took a break from blogging and LinkedIn to focus on building nao. A large international scientist collaboration released The Well : 2 massive datasets from physics simulation (15TB) to astronomical scientific data (100TB). They aim produce the same innovation as ImageNet produced for image recognition.

Data

Data Data Warehouse Coding Programming Language

Data Engineering Weekly #210

Data Engineering Weekly

MARCH 2, 2025

I found the blog to be a fresh take on the skill in demand by layoff datasets. The blog provides an excellent analysis of smallpond compared to Spark and Daft. The blog provides an excellent analysis of smallpond compared to Spark and Daft. Whether you use Datasets already or want to get started, we've got you covered!

Data Engineering

Data Engineering Data Engineer Engineering Datasets

Spotter: Your AI Analyst

ThoughtSpot

APRIL 22, 2025

Level 2: Understanding your dataset To find connected insights in your business data, you need to first understand what data is contained in the dataset. Spotter quickly translates your datasets into business-friendly terminology so business users can confidently explore their data through natural language conversations.

BI

BI Datasets Business Intelligence Raw Data

7 Techniques to Handle Imbalanced Data

KDnuggets

AUGUST 24, 2022

This blog post introduces seven techniques that are commonly applied in domains like intrusion detection or real-time bidding, because the datasets are often extremely imbalanced.

Datasets

Datasets Data Machine Learning

Data Engineering Weekly #198

Data Engineering Weekly

NOVEMBER 24, 2024

The blog highlights how moving from 6-character base-64 to 20-digit base-2 file distribution brings more distribution in S3 and reduces request failures. The blog is a good summary of how to use Snowflake QUERY_TAG to measure and monitor query performance. The blog post made me curious to understand DataFusion's internals.

Data Engineering

Data Engineering Data Engineer Engineering Insurance

The Top 5 Alternatives to GitHub for Data Science Projects

KDnuggets

NOVEMBER 30, 2023

The blog discusses five platforms designed for data scientists with specialized capabilities in managing large datasets, models, workflows, and collaboration beyond what GitHub offers.

Data Science

Data Science Project Datasets Data

How Uber Achieves Operational Excellence in the Data Quality Experience

Uber Engineering

AUGUST 5, 2021

Uber delivers efficient and reliable transportation across the global marketplace, which is powered by hundreds of services, machine learning models, and tens of thousands of datasets.

Transportation

Transportation Datasets Machine Learning Data

Improving Pinterest Search Relevance Using Large Language Models

Pinterest Engineering

APRIL 4, 2025

In this blog, we will go through the technical design and share some offline and online results for our LLM-based search relevance pipeline. Pin Text Representations Pins on Pinterest are rich multimedia entities that feature images, videos, and other contents, often linked to external webpages or blogs.

Machine Learning

Machine Learning Metadata Architecture Datasets

Data Engineering Weekly #212

Data Engineering Weekly

MARCH 16, 2025

link] AWS: An introduction to preparing your own dataset for LLM training Everything in AI eventually comes down to the quality and completeness of your internal data. The blog narrates how Apache Arrow offers better data serialization efficiency and avoids design pitfalls from the past. years of manual effort!!!

Data Engineering

Data Engineering Data Engineer Engineering Amazon Web Services

NVIDIA RAPIDS in Cloudera Machine Learning

Cloudera

MAY 19, 2021

In the previous blog post in this series, we walked through the steps for leveraging Deep Learning in your Cloudera Machine Learning (CML) projects. To try and predict this, an extensive dataset including anonymised details on the individual loanee and their historical credit history are included. Get the Dataset. Introduction.

Machine Learning

Machine Learning Data Science Datasets Raw Data

Solving the weekly menu puzzle pt.2: recommendations at Picnic

Picnic Engineering

APRIL 7, 2025

A little over a year ago, we shared a blog post about our journey to enhance customers meal planning experience with personalized recipe recommendations. The approach struggled with scalability , making it difficult to handle large datasets efficiently. These connections form a dataset of interactions.

Datasets

Datasets Systems Architecture Machine Learning

Simplistic Ways to Find Interesting Data Sets

Team Data Science

MARCH 15, 2020

I am taking you through my recent experience to find a dataset for my project. I defined a few sources in my earlier blog post, which will give a sneak peek of techniques to extract industries. Criteria Define a simple layout to your dataset with elements like size, type of columns, format.

Insurance

Insurance Datasets Banking Finance

Top 10 Data Engineering & AI Trends for 2025

Monte Carlo

NOVEMBER 26, 2024

Synthetic data works by leveraging models to create artificial datasets that reflect what someone might find organically (in some alternate reality where more data actually exists), and then using that new data to train their own models. But is synthetic data a long-term solution? Probably not.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Introducing Impressions at Netflix

Netflix Tech

FEBRUARY 14, 2025

In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily. Architecture Overview The first pivotal step in managing impressions begins with the creation of a Source-of-Truth (SOT) dataset.

Kafka

Kafka Datasets Metadata Utilities

The Race For Data Quality in a Medallion Architecture

DataKitchen

NOVEMBER 5, 2024

The Medallion architecture is a framework that allows data engineers to build organized and analysis-ready datasets in a lakehouse environment. For instance, suppose a new dataset from an IoT device is meant to be ingested daily into the Bronze layer. How do you ensure data quality in every layer?

Architecture

Architecture Raw Data Pipeline-centric Data Ingestion

How to Speed Up Python Pandas by Over 300x

KDnuggets

JULY 5, 2024

In this blog, we will define Pandas and provide an example of how you can vectorize your Python code to optimize dataset analysis using Pandas to speed up your code over 300x times faster.

Python

Python Datasets Coding

A Data Scientist in Engineering Wonderland

Team Data Science

NOVEMBER 28, 2020

During our first week, Andreas helped us navigate the industry that interests each of us, and this is how we picked a dataset to work on for the next following weeks. I ended up picking yelp dataset , which aligns with my interests in analyzing user behaviors. Stay tuned! Cheers, Liuna Say hello on LinkedIn: [link]

Engineering

Engineering Datasets Data Data Engineering

SUMX in Power BI: Comprehensive Guide to DAX Calculations

Edureka

JANUARY 2, 2025

This blog will walk you through SUMX Power BI Functions , one of these traditional and significant functions. Additionally, it manages sizable datasets without causing Power BI to crash or perform less quickly. The purpose of the Power BI SUMX function was to perform calculations across a table or dataset row by row.

BI

BI Datasets Business Intelligence Data Analysis

Building ETL Pipeline with Snowpark

Cloudyard

DECEMBER 24, 2024

In this blog, well explore Building an ETL Pipeline with Snowpark by simulating a scenario where commerce data flows through distinct data layersRAW, SILVER, and GOLDEN.These tables form the foundation for insightful analytics and robust business intelligence. Built clean, enriched datasets in the SILVER layer.

Building

Building Raw Data Scala Business Intelligence

Handling Network Throttling with AWS EC2 at Pinterest

Pinterest Engineering

APRIL 7, 2025

In this blog post, well discuss our experiences in identifying the challenges associated with EC2 network throttling. For these use cases, typically datasets are generated offline in batch jobs and get bulk uploaded from S3 to the database running on EC2. 4xl with up to 12.5

AWS

AWS Bytes Database Data Ingestion

Textual Data Wrangling with Python: A Step-by-Step Guide

WeCloudData

FEBRUARY 17, 2025

In the first blog of the data wrangling series, we introduced the basics of data wrangling using Python. We work on handling missing values, removing special characters, and dropping unnecessary columns to prepare our dataset for further analysis. Welcome back to our Data Wrangling with Python series!

Python

Python Datasets Data Data Engineering

Data Engineering Weekly #215

Data Engineering Weekly

APRIL 6, 2025

The Grab blog delights me since I have tried to do this many times. link] Duolingo: How we built a robust ecosystem for dataset development Duolingo shares how it reimagined data modeling through the lens of software engineering, treating modeled datasets like APIs to enhance consistency, reliability, and developer experience.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Change Data Capture at Pinterest

Pinterest Engineering

NOVEMBER 18, 2024

In this blog post, we’ll explore what CDC is, why it’s important, and our journey of implementing Generic CDC solutions for all online databases at Pinterest. Change Data Capture (CDC) is a crucial technology that enables organizations to efficiently track and capture changes in their databases. What is Change Data Capture? or its affiliates.

Kafka

Kafka MySQL Database Software Engineer

Complete Guide to Data Transformation: Basics to Advanced

Ascend.io

OCTOBER 28, 2024

In this blog post, we’ll explore fundamental concepts, intermediate strategies, and cutting-edge techniques that are shaping the future of data engineering. Filling in missing values could involve leveraging other company data sources or even third-party datasets.

Raw Data

Raw Data Datasets Aggregated Data Data Pipeline

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

In this blog post, we will ingest a real world dataset into Ozone, create a Hive table on top of it and analyze the data to study the correlation between new vaccinations and new cases per country using a Spark ML Jupyter notebook in CML. Learn more about the impacts of global data sharing in this blog, The Ethics of Data Exchange.

Data Science

Data Science Cloud Hadoop Metadata

Announcing Open Source DataOps Data Quality TestGen 3.0

DataKitchen

FEBRUARY 20, 2025

Now With Actionable, Automatic, Data Quality Dashboards Imagine a tool that can point at any dataset, learn from your data, screen for typical data quality issues, and then automatically generate and perform powerful tests, analyzing and scoring your data to pinpoint issues before they snowball. Announcing DataOps Data Quality TestGen 3.0:

Datasets

Datasets Metadata Data Government

Cost Effective and Secure Data Sharing: The Advantages of Leveraging Data Partitions for Sharing Large Datasets

databricks

MARCH 26, 2023

In today's business landscape, secure and cost-effective data sharing is more critical than ever for organizations looking to optimize their internal and external.

Datasets

Datasets Data

AI Powered BI for Games

databricks

SEPTEMBER 24, 2024

This blog post explores how to create a Genie space using a World of Warcraft dataset, enabling users to interactively query data and gain insights like a data analyst. Unlock the potential of your data with Databricks' AI/BI Genie spaces!

BI

BI Datasets Data Analysis Entertainment

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

The world we live in today presents larger datasets, more complex data, and diverse needs, all of which call for efficient, scalable data systems. In this blog, we will discuss: What is the Open Table format (OTF)? These formats are transforming how organizations manage large datasets. Why should we use it?

Architecture

Architecture Systems Data Lake Google Cloud

Data Engineering Weekly #207

Data Engineering Weekly

FEBRUARY 9, 2025

MoEs necessitate less compute for pre-training compared to dense models, facilitating the scaling of model and dataset size within similar computational budgets. I found the product blog from QuantumBlack gives a view of data quality in unstructured data.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Last Mile Data Processing with Ray

Pinterest Engineering

SEPTEMBER 12, 2023

transformers) became standardized, ML engineers started to show a growing appetite to iterate on datasets. While such dataset iterations can yield significant gains, we observed that only a handful of such experiments were conducted and productionized in the last six months. As model architecture building blocks (e.g.

Data Process

Data Process Process Datasets Software Engineer

How Meta understands data at scale

Engineering at Meta

APRIL 28, 2025

Machine learning models : trained on labeled datasets using supervised learning and improved through unsupervised learning to identify patterns and anomalies in unlabeled data. For example, in the data warehouse, it’s represented as a Dataset – an in-code Python class capturing the asset’s schema and metadata.

Metadata

Metadata Data Utilities Data Warehouse

D3: An Automated System to Detect Data Drifts

Uber Engineering

FEBRUARY 23, 2023

In this blog learn how we automated column-level drift detection in batch datasets at Uber scale, reducing the median time to detect issues in critical datasets by 5X. Data quality is of paramount importance at Uber, powering critical decisions and features.

Systems

Systems Datasets Data

Introducing Apache Iceberg in Cloudera Data Platform

Cloudera

FEBRUARY 22, 2022

The data architecture layer is one such area where growing datasets have pushed the limits of scalability and performance. The data explosion has to be met with new solutions, that’s why we are excited to introduce the next generation table format for large scale analytic datasets within Cloudera Data Platform (CDP) – Apache Iceberg.

Metadata

Metadata Datasets BI SQL

Data Engineering Weekly #179

Data Engineering Weekly

JULY 7, 2024

The blog highlights the advantages of GNN over traditional machine learning models, which struggle to discern relationships between various entities, such as users and restaurants, and edges, such as order. The blog gives an overview of statistical techniques used in drift detection.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

How to JOIN datasets in Polars … compared to Pandas.

Netflix’s Distributed Counter Abstraction

Webinars

Trending Sources

Enhancing Neural Network Training at Yelp: Achieving 1,400x Speedup with WideAndDeep

Webinars

How to get datasets for Machine Learning?

Introducing Cloudera Fine Tuning Studio for Training, Evaluating, and Deploying LLMs with Cloudera AI

30+ Free Datasets for Your Data Science Projects in 2023

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

Shutterstock's Content Datasets Now on Databricks Marketplace

How Meta discovers data flows via lineage at scale

Data News — Week 25.02

Data Engineering Weekly #210

Spotter: Your AI Analyst

7 Techniques to Handle Imbalanced Data

Data Engineering Weekly #198

The Top 5 Alternatives to GitHub for Data Science Projects

How Uber Achieves Operational Excellence in the Data Quality Experience

Improving Pinterest Search Relevance Using Large Language Models

Data Engineering Weekly #212

NVIDIA RAPIDS in Cloudera Machine Learning

Solving the weekly menu puzzle pt.2: recommendations at Picnic

Simplistic Ways to Find Interesting Data Sets

Top 10 Data Engineering & AI Trends for 2025

Introducing Impressions at Netflix

The Race For Data Quality in a Medallion Architecture

How to Speed Up Python Pandas by Over 300x

A Data Scientist in Engineering Wonderland

SUMX in Power BI: Comprehensive Guide to DAX Calculations

Building ETL Pipeline with Snowpark

Handling Network Throttling with AWS EC2 at Pinterest

Textual Data Wrangling with Python: A Step-by-Step Guide

Data Engineering Weekly #215

Change Data Capture at Pinterest

Complete Guide to Data Transformation: Basics to Advanced

Apache Ozone Powers Data Science in CDP Private Cloud

Announcing Open Source DataOps Data Quality TestGen 3.0

Cost Effective and Secure Data Sharing: The Advantages of Leveraging Data Partitions for Sharing Large Datasets

AI Powered BI for Games

Why Open Table Format Architecture is Essential for Modern Data Systems

Data Engineering Weekly #207

Last Mile Data Processing with Ray

How Meta understands data at scale

D3: An Automated System to Detect Data Drifts

Introducing Apache Iceberg in Cloudera Data Platform

Data Engineering Weekly #179

Stay Connected