Datasets - Data Engineering Digest

Practicing Machine Learning with Imbalanced Dataset

Analytics Vidhya

JANUARY 31, 2023

The quality of data we feed to the algorithms […] The post Practicing Machine Learning with Imbalanced Dataset appeared first on Analytics Vidhya. The machine learning algorithms heavily rely on data that we feed to them.

Machine Learning

Machine Learning Datasets Algorithm Structured Data

Best Practices For Loading and Querying Large Datasets in GCP BigQuery

Analytics Vidhya

FEBRUARY 15, 2023

Source: dataedo.com It is designed to handle big data and is ideal for […] The post Best Practices For Loading and Querying Large Datasets in GCP BigQuery appeared first on Analytics Vidhya. Its importance lies in its ability to handle big data and provide insights that can inform business decisions.

Datasets

Datasets Big Data Designing Data Analysis

Tips for Handling Large Datasets in Python

KDnuggets

NOVEMBER 29, 2024

Working with large datasets is common but challenging. Here are some tips to make working with such large datasets in Python simpler.

Datasets

Datasets Python

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Static enrichment dataset with Delta Lake

Waitingforcode

JANUARY 23, 2024

It's relatively easy to implement with static datasets because of the data availability. Data enrichment is one of common data engineering tasks. However, this apparently easy task can become a nightmare if used with inappropriate technologies.

Datasets

Datasets Data Engineering Data Engineer Technology

Apache Airflow® 101 Essential Tips for Beginners

In this eBook you will learn everything you need to know to get started, including: Key Airflow terms and concepts How to write and schedule your first DAG How to connect Airflow to other tools in your data ecosystem How to get started with two key Airflow features: Datasets and Dynamic task mapping A list of resources to continue your Airflow journey (..)

Datasets

How to Perform Memory-Efficient Operations on Large Datasets with Pandas

KDnuggets

JULY 29, 2024

Let's learn how to perform memory-efficient operations in pandas with large dataset.

Datasets

Datasets Python

Mosaic datasets: More than the sum of its parts

ArcGIS

AUGUST 25, 2024

Mosaic datasets are the backbone of imagery layers, but provide much more to your organization than simply creating imagery layers.

Datasets

Datasets IT

How to JOIN datasets in Polars … compared to Pandas.

Confessions of a Data Guy

APRIL 6, 2024

Some time ago I wrote a very simple comparison of switching from Pandas to Polars, I didn’t put much real effort into it, yet it was popular, so this is my attempt at trying to expand on that topic a […] The post How to JOIN datasets in Polars … compared to Pandas. appeared first on Confessions of a Data Guy.

Datasets

Datasets IT Data Data Engineering

Getting Started with Amazon SageMaker Ground Truth

Analytics Vidhya

JULY 6, 2023

Building an accurate machine learning and AI model requires a high-quality dataset. Introduction In this era of Generative Al, data generation is at its peak.

Datasets

Datasets Machine Learning Building Algorithm

Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data

Towards Data Science

JANUARY 30, 2025

Best runs for furthest-from-centroid selection compared to full dataset. In my recent experiments with the MNIST dataset, thats exactly what happened. Data PruningResults The plot above shows the models accuracy compared to the training dataset size when using the most effective pruning method Itested. Image byauthor.

Database-centric

Database-centric Datasets Data Architecture

Introducing Cloudera Fine Tuning Studio for Training, Evaluating, and Deploying LLMs with Cloudera AI

Cloudera

NOVEMBER 13, 2024

Fine Tuning Studio enables users to track the location of all datasets, models, and model adapters for training and evaluation. We can import this dataset on the Import Datasets page. The goal is to train an adapter for this base model that gives it better predictive capabilities for our specific dataset. Model Selection.

Datasets

Datasets Machine Learning Coding Data Preparation

How to build a data project with step-by-step instructions

Start Data Engineering

SEPTEMBER 18, 2024

Understand input datasets available 3.1.2. Define what the output dataset will look like 3.1.3. Define checks to ensure the output dataset is usable 3.2. Introduction 2. Parts of data engineering 3.1. Requirements 3.1.1. Define SLAs so stakeholders know what to expect 3.1.4. Identify what tool to use to process data 3.3.

Project

Project Building Datasets Architecture

How to Handle Outliers in Dataset with Pandas

KDnuggets

AUGUST 27, 2024

Dealing with outliers is crucial in data preprocessing. This guide covers multiple ways to handle outliers along with their pros and cons.

Datasets

Datasets Data Python

How to Fine-Tune DeepSeek-R1 for Your Custom Dataset (Step-by-Step)

KDnuggets

FEBRUARY 3, 2025

Fine-tune the DeepSeek model step by step. even if you're new to LLMs!

Datasets

LLM Training on Unity Catalog data with MosaicML Streaming Dataset

databricks

OCTOBER 17, 2023

Introduction Large Language Models (LLMs) have given us a way to generate text, extract information, and identify patterns in industries from healthcare to.

Datasets

Datasets Healthcare Data

Netflix’s Distributed Counter Abstraction

Netflix Tech

NOVEMBER 12, 2024

The Counter Abstraction API resembles Java’s AtomicInteger interface: AddCount/AddAndGetCount : Adjusts the count for the specified counter by the given delta value within a dataset. We create one such Rollup table per dataset and use Cassandra as our persistent store. The delta value can be positive or negative.

Datasets

Datasets Computer Science Systems Kafka

How to implement data quality checks with greatexpectations

Start Data Engineering

JULY 26, 2024

Running checks on one dataset 5.2. Checks involving the current dataset and its historical data 5.3. Checks involving comparing datasets 5. TL;DR: How the greatexpectations library works 4.1. greatexpectations quick setup 5. From an implementation perspective, there are four types of tests 5.1.

Datasets

Datasets Data Project IT

Enhancing Neural Network Training at Yelp: Achieving 1,400x Speedup with WideAndDeep

Yelp Engineering

JANUARY 21, 2025

These models handle large tabular datasets with small parameter spaces, requiring innovative data solutions. At Yelp, we encountered challenges that prompted us to enhance the training time of our ad-revenue generating models, which use a Wide and Deep Neural Network architecture for predicting ad click-through rates (pCTR).

Datasets

Datasets Architecture Data Solutions Data

Top 20 Big Data Tools Used By Professionals in 2023

Analytics Vidhya

FEBRUARY 23, 2023

Introduction Big Data is a large and complex dataset generated by various sources and grows exponentially. It is so extensive and diverse that traditional data processing methods cannot handle it. The volume, velocity, and variety of Big Data can make it difficult to process and analyze.

Big Data Tools

Big Data Tools Big Data Datasets Data

Data News — Week 24.11

Christophe Blefari

MARCH 15, 2024

Matthaus gives the dlt vision about creating the foundation for developers to be able to create sources in a wink creating a large ecosystem of APIs datasets easily maintainable. This is Croissant. Starting today it will be supported by 3 majors platforms: Kaggle, HuggingFace and OpenML.

Metadata

Metadata Data Data Warehouse Software Engineer

The Journey of a Senior Data Scientist and Machine Learning Engineer at Spice Money

Analytics Vidhya

JUNE 12, 2023

Tajinder’s passion for unraveling hidden patterns in complex datasets has driven impactful outcomes, transforming raw data into actionable intelligence. Introduction Meet Tajinder, a seasoned Senior Data Scientist and ML Engineer who has excelled in the rapidly evolving field of data science.

Machine Learning

Machine Learning Engineering Raw Data Data Science

Characterizing Datasets and Building Better Models with Continued Pre-Training

databricks

NOVEMBER 21, 2024

While large language models (LLMs) are increasingly adept at solving general tasks, they can often fall short on specific domains that are dissimilar.

Datasets

Datasets Building

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Data Engineering Weekly

JANUARY 15, 2025

Images and Videos: Computer vision algorithms must analyze visual content and deal with noisy, blurry, or mislabeled datasets. They are responsible for designing, implementing, and maintaining robust, scalable data pipelines that transform raw unstructured data—text, images, videos, and more—into high-quality, AI-ready datasets.

Data Engineering

Data Engineering Data Engineer Unstructured Data Engineering

Data Appending vs. Data Enrichment: How to Maximize Data Quality and Insights

Precisely

APRIL 7, 2025

After my (admittedly lengthy) explanation of what I do as the EVP and GM of our Enrich business, she summarized it in a very succinct, but new way: “Oh, you manage the appending datasets.” Matching accuracy: Matching records between datasets is complex. ” That got me thinking.

Retail

Retail Datasets Data Portfolio

Data Link for Dun & Bradstreet is a Game-Changer: Here’s Why

Precisely

MAY 2, 2025

Earlier this year, Precisely announced Data Link : an ecosystem of pre-linked datasets from leading data providers. Historically, seamless integration across these datasets has been extremely difficult they’re not standardized across providers, which leaves your team with tedious manual mapping and troubleshooting.

Datasets

Datasets Insurance Portfolio Retail

A Beginner’s Guide to the Basics of Big Data and Hadoop

Analytics Vidhya

FEBRUARY 5, 2023

Big data is nothing but the vast volume of datasets measured in terabytes or petabytes or even more. Introduction In this technical era, Big Data is proven as revolutionary as it is growing unexpectedly. According to the survey reports, around 90% of the present data was generated only in the past two years.

Hadoop

Hadoop Big Data Datasets Data

The Race For Data Quality in a Medallion Architecture

DataKitchen

NOVEMBER 5, 2024

The Medallion architecture is a framework that allows data engineers to build organized and analysis-ready datasets in a lakehouse environment. For instance, suppose a new dataset from an IoT device is meant to be ingested daily into the Bronze layer. How do you ensure data quality in every layer?

Architecture

Architecture Raw Data Pipeline-centric Data Ingestion

Spotter: Your AI Analyst

ThoughtSpot

APRIL 22, 2025

Level 2: Understanding your dataset To find connected insights in your business data, you need to first understand what data is contained in the dataset. Spotter quickly translates your datasets into business-friendly terminology so business users can confidently explore their data through natural language conversations.

BI

BI Datasets Business Intelligence Raw Data

The Future of Data Lakehouses: A Fireside Chat with Vinoth Chandar - Founder CEO Onehouse & PMC Chair of Apache Hudi

Data Engineering Weekly

JANUARY 8, 2025

This hybrid approach empowers enterprises to efficiently handle massive datasets while maintaining flexibility and reducing operational overhead. These advancements address enterprises' real-world challenges, such as maintaining fresh, up-to-date datasets and optimizing for high-throughput scenarios. Exploring Apache Hudi 1.0:

Data Lake

Data Lake Datasets Retail Data Ingestion

Connected Data, Better Insights: Data Enrichment Done Right

Precisely

MARCH 20, 2025

Data enrichment is the process of augmenting your organizations internal data with trusted, curated third-party datasets. The Multiple Data Provider Challenge If you rely on data from multiple vendors, you’ve probably run into a major challenge: the datasets are not standardized across providers. What is data enrichment?

Insurance

Insurance Datasets Data Programming

A Dive into the Basics of Big Data Storage with HDFS

Analytics Vidhya

FEBRUARY 6, 2023

It is a core component of the Apache Hadoop ecosystem and allows for storing and processing large datasets across multiple commodity servers. Introduction HDFS (Hadoop Distributed File System) is not a traditional database but a distributed file system designed to store and process big data.

Data Storage

Data Storage Big Data Hadoop Datasets

Data News — Week 25.02

Christophe Blefari

JANUARY 11, 2025

A large international scientist collaboration released The Well : 2 massive datasets from physics simulation (15TB) to astronomical scientific data (100TB). It's easily readable (mildly large ~10 pages) and gives metrics about the performance plateau that we start to see at scale.

Data

Data Data Warehouse Coding Programming Language

The Journey of a Senior Data Scientist and Machine Learning Engineer in Fintech Domain

Analytics Vidhya

JUNE 12, 2023

Tajinder’s passion for unraveling hidden patterns in complex datasets has driven impactful outcomes, transforming raw data into actionable intelligence. Introduction Meet Tajinder, a seasoned Senior Data Scientist and ML Engineer who has excelled in the rapidly evolving field of data science.

Machine Learning

Machine Learning Engineering Raw Data Data Science

Announcing Open Source DataOps Data Quality TestGen 3.0

DataKitchen

FEBRUARY 20, 2025

Now With Actionable, Automatic, Data Quality Dashboards Imagine a tool that can point at any dataset, learn from your data, screen for typical data quality issues, and then automatically generate and perform powerful tests, analyzing and scoring your data to pinpoint issues before they snowball. Announcing DataOps Data Quality TestGen 3.0:

Datasets

Datasets Metadata Data Government

Behind the Scenes with Two New Salary Transparency Websites

The Pragmatic Engineer

APRIL 6, 2023

Most jobs vendors have a ton of ‘junk jobs,’ so we spent a fair bit of time culling the dataset to jobs that are unique. During processing, we match companies, titles and more, with our dataset. We put the jobs data into Amazon S3. We have a network of Lamdas that fire any time new data is added.

Software Engineer

Software Engineer Software Engineering Datasets Database

How Meta discovers data flows via lineage at scale

Engineering at Meta

JANUARY 22, 2025

The following diagram illustrates how the lineage graph has expanded: Collecting data flow signals for the AI system For our AI systems, we collect lineage signals by tracking relationships between various assets, such as input datasets, features, models, workflows, and inferences.

Data Warehouse

Data Warehouse SQL Programming Language Data

Introducing Impressions at Netflix

Netflix Tech

FEBRUARY 14, 2025

Architecture Overview The first pivotal step in managing impressions begins with the creation of a Source-of-Truth (SOT) dataset. This foundational dataset is essential, as it supports various downstream workflows and enables a multitude of usecases.

Kafka

Kafka Datasets Metadata Utilities

How Meta understands data at scale

Engineering at Meta

APRIL 28, 2025

Machine learning models : trained on labeled datasets using supervised learning and improved through unsupervised learning to identify patterns and anomalies in unlabeled data. For example, in the data warehouse, it’s represented as a Dataset – an in-code Python class capturing the asset’s schema and metadata.

Metadata

Metadata Data Utilities Data Warehouse

7 Computer Vision Projects for All Levels

KDnuggets

OCTOBER 30, 2024

Each project, from beginner tasks like Image Classification to advanced ones like Anomaly Detection, includes a link to the dataset and source code for easy access and implementation.

Project

Project Datasets Coding Accessible

Data Engineering Weekly #210

Data Engineering Weekly

MARCH 2, 2025

I found the blog to be a fresh take on the skill in demand by layoff datasets. Our internal benchmark of the NYC dataset shows a 48% performance gain of smallpond over Spark!! Whether you use Datasets already or want to get started, we've got you covered! link] Mehdio: DuckDB goes distributed?

Data Engineering

Data Engineering Data Engineer Engineering Datasets

10 GitHub Repositories to Master Computer Vision

KDnuggets

SEPTEMBER 12, 2024

The GitHub repository includes up-to-date learning resources, research papers, guides, popular tools, tutorials, projects, and datasets.

Datasets

Datasets Project

Power BI Running Total: Easy Methods to Calculate

Edureka

JANUARY 21, 2025

First, navigate to Table View The dataset labeled “sample dataset” is shown below. Data Measure = CALCULATE ( SUM ( 'sample dataset'[Revenue] )) 6. Data Measure = CALCULATE ( SUM ( 'sample dataset'[Revenue] ), FILTER ( 7. Data Measure = CALCULATE ( SUM ( 'sample dataset'[Revenue] ), FILTER ( ALL ( 8.

BI

BI Datasets Business Intelligence Certification

Top 10 Data Engineering & AI Trends for 2025

Monte Carlo

NOVEMBER 26, 2024

Synthetic data works by leveraging models to create artificial datasets that reflect what someone might find organically (in some alternate reality where more data actually exists), and then using that new data to train their own models. But is synthetic data a long-term solution? Probably not.

Data Engineering

Data Engineering Data Engineer Engineering Unstructured Data

Data Cleaning with Python Cheat Sheet

KDnuggets

FEBRUARY 21, 2023

An intuitive guide that will help you to prepare and preprocess your dataset before applying the machine learning model.

Python

Python Datasets Machine Learning Data

Practicing Machine Learning with Imbalanced Dataset

Best Practices For Loading and Querying Large Datasets in GCP BigQuery

Webinars

Trending Sources

Tips for Handling Large Datasets in Python

Webinars

Static enrichment dataset with Delta Lake

Apache Airflow® 101 Essential Tips for Beginners

How to Perform Memory-Efficient Operations on Large Datasets with Pandas

Mosaic datasets: More than the sum of its parts

How to JOIN datasets in Polars … compared to Pandas.

Getting Started with Amazon SageMaker Ground Truth

Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data

Introducing Cloudera Fine Tuning Studio for Training, Evaluating, and Deploying LLMs with Cloudera AI

How to build a data project with step-by-step instructions

How to Handle Outliers in Dataset with Pandas

How to Fine-Tune DeepSeek-R1 for Your Custom Dataset (Step-by-Step)

LLM Training on Unity Catalog data with MosaicML Streaming Dataset

Netflix’s Distributed Counter Abstraction

How to implement data quality checks with greatexpectations

Enhancing Neural Network Training at Yelp: Achieving 1,400x Speedup with WideAndDeep

Top 20 Big Data Tools Used By Professionals in 2023

Data News — Week 24.11

The Journey of a Senior Data Scientist and Machine Learning Engineer at Spice Money

Characterizing Datasets and Building Better Models with Continued Pre-Training

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success

Data Appending vs. Data Enrichment: How to Maximize Data Quality and Insights

Data Link for Dun & Bradstreet is a Game-Changer: Here’s Why

A Beginner’s Guide to the Basics of Big Data and Hadoop

The Race For Data Quality in a Medallion Architecture

Spotter: Your AI Analyst

The Future of Data Lakehouses: A Fireside Chat with Vinoth Chandar - Founder CEO Onehouse & PMC Chair of Apache Hudi

Connected Data, Better Insights: Data Enrichment Done Right

A Dive into the Basics of Big Data Storage with HDFS

Data News — Week 25.02

The Journey of a Senior Data Scientist and Machine Learning Engineer in Fintech Domain

Announcing Open Source DataOps Data Quality TestGen 3.0

Behind the Scenes with Two New Salary Transparency Websites

How Meta discovers data flows via lineage at scale

Introducing Impressions at Netflix

How Meta understands data at scale

7 Computer Vision Projects for All Levels

Data Engineering Weekly #210

10 GitHub Repositories to Master Computer Vision

Power BI Running Total: Easy Methods to Calculate

Top 10 Data Engineering & AI Trends for 2025

Data Cleaning with Python Cheat Sheet

Stay Connected