Blog, Data Ingestion and Data Process - Data Engineering Digest

Last Mile Data Processing with Ray

Pinterest Engineering

SEPTEMBER 12, 2023

Since it takes so long to iterate on workflows, some ML engineers started to perform data processing directly inside training jobs. This is what we commonly refer to as Last Mile Data Processing. Last Mile processing can boost ML engineers’ velocity as they can write code in Python, directly using PyTorch.

Data Process

Data Process Process Datasets Software Engineer

The Race For Data Quality in a Medallion Architecture

DataKitchen

NOVEMBER 5, 2024

The Race For Data Quality In A Medallion Architecture The Medallion architecture pattern is gaining traction among data teams. It is a layered approach to managing and transforming data. By systematically moving data through these layers, the Medallion architecture enhances the data structure in a data lakehouse environment.

Architecture

Architecture Raw Data Pipeline-centric Data Ingestion

Drafting Your Data Pipelines

Team Data Science

MAY 10, 2020

I can now begin drafting my data ingestion/ streaming pipeline without being overwhelmed. With careful consideration and learning about your market, the choices you need to make become narrower and more clear. I'll use Python and Spark because they are the top 2 requested skills in Toronto.

Data Pipeline

Data Pipeline Data Ingestion AWS Kafka

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Data Engineering Weekly #217

Data Engineering Weekly

APRIL 20, 2025

The blog took out the last edition’s recommendation on AI and summarized the current state of AI adoption in enterprises. The simplistic model expressed in the blog made it easy for me to reason about the transactional system design. Kafka is probably the most reliable data infrastructure in the modern data era.

Data Engineer

Data Engineer Data Engineering Engineering Kafka

Complete Guide to Data Ingestion: Types, Process, and Best Practices

Databand.ai

JULY 19, 2023

Complete Guide to Data Ingestion: Types, Process, and Best Practices Helen Soloveichik July 19, 2023 What Is Data Ingestion? Data Ingestion is the process of obtaining, importing, and processing data for later use or storage in a database.

Data Ingestion

Data Ingestion Process Data Cleanse Data Governance

Introducing Compute-Compute Separation for Real-Time Analytics

Rockset

MARCH 1, 2023

When you deconstruct the core database architecture, deep in the heart of it you will find a single component that is performing two distinct competing functions: real-time data ingestion and query serving. When data ingestion has a flash flood moment, your queries will slow down or time out making your application flaky.

Data Ingestion

Data Ingestion Database Architecture SQL

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

[link] Georg Heiler: Upskilling data engineers What should I prefer for 2028, or how can I break into data engineering? I honestly don’t have a solid answer, but this blog is an excellent overview of upskilling. These are common LinkedIn requests.

Data Engineer

Data Engineer Data Engineering Engineering Data

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

In addition to big data workloads, Ozone is also fully integrated with authorization and data governance providers namely Apache Ranger & Apache Atlas in the CDP stack. While we walk through the steps one by one from data ingestion to analysis, we will also demonstrate how Ozone can serve as an ‘S3’ compatible object store.

Data Science

Data Science Cloud Hadoop Metadata

Snowflake Migration Success Stories: Core Digital Media and NAVEX

Snowflake

OCTOBER 16, 2024

For organizations who are considering moving from a legacy data warehouse to Snowflake, are looking to learn more about how the AI Data Cloud can support legacy Hadoop use cases, or are struggling with a cloud data warehouse that just isn’t scaling anymore, it often helps to see how others have done it.

Digital Media

Digital Media Media Data Lake Data Warehouse

Complete Guide to Data Transformation: Basics to Advanced

Ascend.io

OCTOBER 28, 2024

Data transformation helps make sense of the chaos, acting as the bridge between unprocessed data and actionable intelligence. You might even think of effective data transformation like a powerful magnet that draws the needle from the stack, leaving the hay behind.

Raw Data

Raw Data Datasets Aggregated Data Data Pipeline

Next Stop – Building a Data Pipeline from Edge to Insight

Cloudera

FEBRUARY 8, 2021

This is part 2 in this blog series. You can read part 1, here: Digital Transformation is a Data Journey From Edge to Insight. The first blog introduced a mock connected vehicle manufacturing company, The Electric Car Company (ECC), to illustrate the manufacturing data path through the data lifecycle.

Data Pipeline

Data Pipeline Building Manufacturing Data Warehouse

Digital Transformation is a Data Journey From Edge to Insight

Cloudera

JANUARY 20, 2021

The missing chapter is not about point solutions or the maturity journey of use cases, the missing chapter is about the data, it’s always been about the data, and most importantly the journey data weaves from edge to artificial intelligence insight. . Conclusion.

Manufacturing

Manufacturing Data Warehouse Kafka Retail

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

In the second part, we will focus on architectural patterns to implement data quality from a data contract perspective. Why is Data Quality Expensive? I won’t bore you with the importance of data quality in the blog. Let’s talk about the data processing types.

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Machine Learning with Python, Jupyter, KSQL and TensorFlow

Confluent

FEBRUARY 6, 2019

The blog posts How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka and Using Apache Kafka to Drive Cutting-Edge Machine Learning describe the benefits of leveraging the Apache Kafka ® ecosystem as a central, scalable and mission-critical nervous system. For now, we’ll focus on Kafka.

Machine Learning

Machine Learning Python Kafka Java

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms. In this blog, we will discuss: What is the Open Table format (OTF)? Why should we use it? A Brief History of OTF A comparative study between the major OTFs.

Architecture

Architecture Systems Data Lake Google Cloud

A Guide to Data Pipelines (And How to Design One From Scratch)

Striim

SEPTEMBER 11, 2024

Data Collection/Ingestion The next component in the data pipeline is the ingestion layer, which is responsible for collecting and bringing data into the pipeline. By efficiently handling data ingestion, this component sets the stage for effective data processing and analysis.

Data Pipeline

Data Pipeline Designing Data Lake Data Warehouse

Cloudera named a Strong Performer in The Forrester Wave™: Streaming Analytics, Q2 2021

Cloudera

JUNE 7, 2021

It calls out that Cloudera DataFlow “ includes streaming flow and streaming data processing unified with Cloudera Data Platform ”. The post Cloudera named a Strong Performer in The Forrester Wave™: Streaming Analytics, Q2 2021 appeared first on Cloudera Blog.

Kafka

Kafka Data Ingestion Cloud Architecture

Announcing the General Availability of Cloudera Flow Management and Cloudera Edge Management

Cloudera

APRIL 15, 2019

While Cloudera Flow Management has been eagerly awaited by our Cloudera customers for use on their existing Cloudera platform clusters, Cloudera Edge Management has generated equal buzz across the industry for the possibilities that it brings to enterprises in their IoT initiatives around edge management and edge data collection.

Management

Management Data Ingestion Data Collection Government

Unify your data: AI and Analytics in an Open Lakehouse

Cloudera

MAY 30, 2024

Here’s what implementing an open data lakehouse with Cloudera delivers: Integration of Data Lake and Data Warehouse : An open data lakehouse brings together the best of both worlds by integrating the storage flexibility of a data lake with the query performance and structured querying capabilities of a data warehouse.

Data Lake

Data Lake Data Warehouse Programming Language Data Ingestion

Evaluating Data Observability Tools: A Comprehensive Guide

Data Engineering Weekly

SEPTEMBER 18, 2024

The Rise of Data Observability Data observability has become increasingly critical as companies seek greater visibility into their data processes. This growing demand has found a natural synergy with the rise of the data lake.

Data Lake

Data Lake Data Pipeline Unstructured Data Data

Back to the Financial Regulatory Future

Cloudera

FEBRUARY 15, 2024

Data integration and ingestion: With robust data integration capabilities, a modern data architecture makes real-time data ingestion from various sources—including structured, unstructured, and streaming data, as well as external data feeds—a reality.

Insurance

Insurance Banking Data Architecture Data Ingestion

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

He wrote some years ago 3 articles defining data engineering field. Some concepts When doing data engineering you can touch a lot of different concepts. The main difference between both is the fact that your computation resides in your warehouse with SQL rather than outside with a programming language loading data in memory.

Data Engineer

Data Engineer Data Engineering Engineering Hadoop

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

LinkedIn Engineering

JUNE 15, 2023

In this blog post, we will discuss the AvroTensorDataset API, techniques we used to improve data processing speeds by up to 162x over existing solutions (thereby decreasing overall training time by up to 66%), and performance results from benchmarks and production.

Datasets

Datasets Bytes Process Data Ingestion

A Beginner’s Guide to Learning PySpark for Big Data Processing

ProjectPro

JANUARY 25, 2022

Here’s What You Need to Know About PySpark This blog will take you through the basics of PySpark, the PySpark architecture, and a few popular PySpark libraries , among other things. Finally, you'll find a list of PySpark projects to help you gain hands-on experience and land an ideal job in Data Science or Big Data.

Big Data

Big Data Data Process Process Kafka

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

LinkedIn Engineering

OCTOBER 19, 2023

Authors: Bingfeng Xia and Xinyu Liu Background At LinkedIn, Apache Beam plays a pivotal role in stream processing infrastructures that process over 4 trillion events daily through more than 3,000 pipelines across multiple production data centers.

Process

Process Lambda Architecture Kafka Machine Learning

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

The fact that ETL tools evolved to expose graphical interfaces seems like a detour in the history of data processing, and would certainly make for an interesting blog post of its own. Sure, there’s a need to abstract the complexity of data processing, computation and storage.

Data Engineer

Data Engineer Data Engineering Engineering ETL Tools

DataOps Architecture: 5 Key Components and How to Get Started

Databand.ai

AUGUST 30, 2023

DataOps is a collaborative approach to data management that combines the agility of DevOps with the power of data analytics. It aims to streamline data ingestion, processing, and analytics by automating and integrating various data workflows.

Architecture

Architecture Data Ingestion Data Governance Data Cleanse

Building Netflix’s Distributed Tracing Infrastructure

Netflix Tech

OCTOBER 19, 2020

In our previous blog post we introduced Edgar, our troubleshooting tool for streaming sessions. This flexibility allows tracer libraries to record 100% traces in our mission-critical streaming microservices while collecting minimal traces from auxiliary systems like offline batch data processing.

Building

Building Transportation Java Metadata

DataOps Tools: Key Capabilities & 5 Tools You Must Know About

Databand.ai

AUGUST 30, 2023

DataOps , short for data operations, is an emerging discipline that focuses on improving the collaboration, integration, and automation of data processes across an organization. Accelerated Data Analytics DataOps tools help automate and streamline various data processes, leading to faster and more efficient data analytics.

Data Cleanse

Data Cleanse Data Pipeline Data Ingestion Data Validation

Turning Streams Into Data Products

Cloudera

JUNE 16, 2022

Use cases like fraud detection, network threat analysis, manufacturing intelligence, commerce optimization, real-time offers, instantaneous loan approvals, and more are now possible by moving the data processing components up the stream to address these real-time needs. . Faster data ingestion: streaming ingestion pipelines.

Kafka

Kafka Manufacturing Data Lake SQL

Azure Data Engineer Resume

Edureka

FEBRUARY 9, 2023

Azure Data Engineering is a rapidly growing field that involves designing, building, and maintaining data processing systems using Microsoft Azure technologies. Contents: What is the role of an Azure Data Engineer? Azure data engineers are essential in the design, implementation, and upkeep of cloud-based data solutions.

Data Engineer

Data Engineer Data Engineering Engineering Amazon Web Services

Simplify Metrics on Apache Druid With Rill Data and Cloudera

Cloudera

JULY 21, 2022

The table below summarizes Hive and Druid key features and strengths and suggests how combining the feature sets can provide the best of both worlds for data analytics. Cloudera Data Warehouse). Efficient batch data processing. Complex data transformations. Native streaming ingestion support from Kafka and Kinesis.

BI

BI Digital Media Data Warehouse Kafka

Dynamic Tables for Data Vault

Snowflake

SEPTEMBER 11, 2023

As Snowflake streams define an offset to track change data capture (CDC) changes on underlying tables and views, Tasks can be used to schedule the consumption of that data. We covered this in depth in a previous blog post. Today’s Snowflake Dynamic Tables do not support append-only data processing.

SQL

SQL Data Raw Data Architecture

Why Meeting Latency Requirements is Crucial to Successful Data Integration + Streaming

Striim

JUNE 6, 2024

Conversely, high latency can hinder your organization’s data integration and streaming efforts. As data-driven decision-making becomes increasingly vital, the importance of minimizing latency has never been clearer. The way that you can do so is by harnessing real-time data processing over batch processing methodologies.

Data Integration

Data Integration Data Ingestion Healthcare Data Pipeline

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

DECEMBER 15, 2022

Since we announced the general availability of Apache Iceberg in Cloudera Data Platform (CDP), Cloudera customers, such as Teranet , have built open lakehouses to future-proof their data platforms for all their analytical workloads. Cloudera partners are also benefiting from Apache Iceberg in CDP. ORC open file format support.

Cloud

Cloud Metadata Data Warehouse Google Cloud

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

Confluent

SEPTEMBER 26, 2019

In the early days, many companies simply used Apache Kafka ® for data ingestion into Hadoop or another data lake. ® , Go, and Python SDKs where an application can use SQL to query raw data coming from Kafka through an API (but that is a topic for another blog). However, Apache Kafka is more than just messaging.

Kafka

Kafka SQL BI Hadoop

The Five Use Cases in Data Observability: Mastering Data Production

DataKitchen

MAY 10, 2024

The Five Use Cases in Data Observability: Mastering Data Production (#3) Introduction Managing the production phase of data analytics is a daunting challenge. Overseeing multi-tool, multi-dataset, and multi-hop data processes ensures high-quality outputs.

Raw Data

Raw Data Data Ingestion Datasets Data

Optimizing data warehouse storage

Netflix Tech

DECEMBER 21, 2020

Merge As the data lands into the data warehouse through real-time data ingestion systems, it comes in different sizes. Merging those numerous smaller files into a handful of larger files can make query processing faster and reduce storage space. We will publish a follow-up blog post about AutoAnalyze in the future.

Data Warehouse

Data Warehouse Metadata Algorithm Data

Data Engineering Weekly #146

Data Engineering Weekly

SEPTEMBER 11, 2023

Data Engineering Weekly Is Brought to You by RudderStack RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. The blog narrates the key concepts of the Kimball model and a modern outlook on the concepts.

Data Engineer

Data Engineer Data Engineering Engineering Kafka

Internet of Things (IoT) and Event Streaming at Scale with Apache Kafka and MQTT

Confluent

OCTOBER 10, 2019

Most scenarios require a reliable, scalable, and secure end-to-end integration that enables bidirectional communication and data processing in real time. MQTT Proxy for data ingestion without an MQTT broker. But that doesn’t move much.

Kafka

Kafka Google Cloud Architecture Machine Learning

Data Engineering Weekly #108

Data Engineering Weekly

NOVEMBER 20, 2022

[link] The short YouTube video gives a nice overview of the Data Cards. We often think of AI/ ML as a complex data processing problem, but it doesn’t make any use until it is exposed to an end user or an application. The blog narrates one such application that uses video quality with neural networks.

Data Engineer

Data Engineer Data Engineering Engineering Datasets

DataOps vs. MLOps: Similarities, Differences, and How to Choose

Databand.ai

JULY 17, 2023

Aim to automate processes: Automation is a key aspect of both DataOps and MLOps as it helps streamline workflows, reduce errors, increase efficiency, and ensure consistency across projects. Better data observability equals better data quality.

Data Pipeline

Data Pipeline Machine Learning High Quality Data Data Ingestion

What is AWS Kinesis (Amazon Kinesis Data Streams)?

Edureka

AUGUST 23, 2024

The AWS training will prepare you to become a master of the cloud, storing, processing, and developing applications for the cloud data. Amazon AWS Kinesis makes it possible to process and analyze data from multiple sources in real-time. What can I do with Kinesis Data Streams? How Amazon Kinesis Works?

AWS

AWS Kafka Amazon Web Services Medical

The Need For Personalized Data Journeys for Your Data Consumers

DataKitchen

OCTOBER 20, 2023

The Challenge: High Stakes in the Age of Personalized Data Observability The primary challenge stems from the requirement of Data Consumers for personalized monitoring and alerts based on their unique data processing needs. Data Observability platforms often need to deliver this level of customization.

Insurance

Insurance Pharmaceutical Data Data Ingestion

Last Mile Data Processing with Ray

The Race For Data Quality in a Medallion Architecture

Webinars

Trending Sources

Drafting Your Data Pipelines

Webinars

Data Engineering Weekly #217

Complete Guide to Data Ingestion: Types, Process, and Best Practices

Introducing Compute-Compute Separation for Real-Time Analytics

Data Engineering Weekly #213

Apache Ozone Powers Data Science in CDP Private Cloud

Snowflake Migration Success Stories: Core Digital Media and NAVEX

Complete Guide to Data Transformation: Basics to Advanced

Next Stop – Building a Data Pipeline from Edge to Insight

Digital Transformation is a Data Journey From Edge to Insight

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Machine Learning with Python, Jupyter, KSQL and TensorFlow

Why Open Table Format Architecture is Essential for Modern Data Systems

A Guide to Data Pipelines (And How to Design One From Scratch)

Cloudera named a Strong Performer in The Forrester Wave™: Streaming Analytics, Q2 2021

Announcing the General Availability of Cloudera Flow Management and Cloudera Edge Management

Unify your data: AI and Analytics in an Open Lakehouse

Evaluating Data Observability Tools: A Comprehensive Guide

Back to the Financial Regulatory Future

How to learn data engineering

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

A Beginner’s Guide to Learning PySpark for Big Data Processing

Revolutionizing Real-Time Streaming Processing: 4 Trillion Events Daily at LinkedIn

The Rise of the Data Engineer

DataOps Architecture: 5 Key Components and How to Get Started

Building Netflix’s Distributed Tracing Infrastructure

DataOps Tools: Key Capabilities & 5 Tools You Must Know About

Turning Streams Into Data Products

Azure Data Engineer Resume

Simplify Metrics on Apache Druid With Rill Data and Cloudera

Dynamic Tables for Data Vault

Why Meeting Latency Requirements is Crucial to Successful Data Integration + Streaming

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset

The Five Use Cases in Data Observability: Mastering Data Production

Optimizing data warehouse storage

Data Engineering Weekly #146

Internet of Things (IoT) and Event Streaming at Scale with Apache Kafka and MQTT

Data Engineering Weekly #108

DataOps vs. MLOps: Similarities, Differences, and How to Choose

What is AWS Kinesis (Amazon Kinesis Data Streams)?

The Need For Personalized Data Journeys for Your Data Consumers

Stay Connected