Blog, Data Ingestion and Metadata - Data Engineering Digest

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance.

Metadata

Metadata MongoDB MySQL Scala

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

This blog post expands on that insightful conversation, offering a critical look at Iceberg's potential and the hurdles organizations face when adopting it. This ecosystem includes: Catalogs: Services that manage metadata about Iceberg tables (e.g., Compute Engines: Tools that query and process data stored in Iceberg tables (e.g.,

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Data Engineering Weekly #217

Data Engineering Weekly

APRIL 20, 2025

The blog took out the last edition’s recommendation on AI and summarized the current state of AI adoption in enterprises. The simplistic model expressed in the blog made it easy for me to reason about the transactional system design. Kafka is probably the most reliable data infrastructure in the modern data era.

Data Engineering

Data Engineering Data Engineer Engineering Kafka

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

In addition to big data workloads, Ozone is also fully integrated with authorization and data governance providers namely Apache Ranger & Apache Atlas in the CDP stack. While we walk through the steps one by one from data ingestion to analysis, we will also demonstrate how Ozone can serve as an ‘S3’ compatible object store.

Data Science

Data Science Cloud Hadoop Metadata

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

[link] Georg Heiler: Upskilling data engineers What should I prefer for 2028, or how can I break into data engineering? I honestly don’t have a solid answer, but this blog is an excellent overview of upskilling. These are common LinkedIn requests.

Data Engineering

Data Engineering Data Engineer Engineering Data

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms. In this blog, we will discuss: What is the Open Table format (OTF)? First, we create an Iceberg table in Snowflake and then insert some data.

Architecture

Architecture Systems Data Lake Google Cloud

Data Engineering Zoomcamp – Data Ingestion (Week 2)

Hepta Analytics

FEBRUARY 14, 2022

DE Zoomcamp 2.2.1 – Introduction to Workflow Orchestration Following last weeks blog , we move to data ingestion. We already had a script that downloaded a csv file, processed the data and pushed the data to postgres database. This week, we got to think about our data ingestion design.

Data Ingestion

Data Ingestion Data Engineering Data Engineer Engineering

NVIDIA RAPIDS in Cloudera Machine Learning

Cloudera

MAY 19, 2021

In the previous blog post in this series, we walked through the steps for leveraging Deep Learning in your Cloudera Machine Learning (CML) projects. Data Ingestion. The raw data is in a series of CSV files. We will firstly convert this to parquet format as most data lakes exist as object stores full of parquet files.

Machine Learning

Machine Learning Data Science Datasets Raw Data

Scalable Annotation Service?—?Marken

Netflix Tech

JANUARY 25, 2023

Scalable Annotation Service — Marken by Varun Sekhri , Meenakshi Jindal Introduction At Netflix, we have hundreds of micro services each with its own data models or entities. For example, we have a service that stores a movie entity’s metadata or a service that stores metadata about images. In this case it is BOUNDING_BOX.

Algorithm

Algorithm Media Metadata Data Ingestion

Data Engineering Weekly #179

Data Engineering Weekly

JULY 7, 2024

Experience Enterprise-Grade Apache Airflow Astro augments Airflow with enterprise-grade features to enhance productivity, meet scalability and availability demands across your data pipelines, and more. Hudi seems to be a de facto choice for CDC data lake features. Notion migrated the insert heavy workload from Snowflake to Hudi.

Data Engineering

Data Engineering Data Engineer Engineering Data Lake

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Netflix Tech

MARCH 25, 2019

We adopted the following mission statement to guide our investments: “Provide a complete and accurate data lineage system enabling decision-makers to win moments of truth.” Therefore, the ingestion approach for data lineage is designed to work with many disparate data sources. push or pull.

Building

Building Metadata Transportation Data Ingestion

Snowflake’s Best-in-Class Enterprise Data Foundation Unlocks Interoperability with Open Data and Internal Collaboration

Snowflake

JUNE 4, 2024

Governed internal collaboration with better discoverability and AI-powered object metadata Snowflake is introducing an entirely new way for data teams to easily discover, curate and share data, apps and now also models (private preview soon). Getting data ingested now only takes a few clicks, and the data is encrypted.

Government

Government Data Ingestion Data PostgreSQL

Running Unified PubSub Client in Production at Pinterest

Pinterest Engineering

NOVEMBER 7, 2023

Jeff Xiang | Software Engineer, Logging Platform Vahid Hashemian | Software Engineer, Logging Platform Jesus Zuniga | Software Engineer, Logging Platform At Pinterest, data is ingested and transported at petabyte scale every day, bringing inspiration for our users to create a life they love.

Kafka

Kafka Java Software Engineer Software Engineering

Next Stop – Building a Data Pipeline from Edge to Insight

Cloudera

FEBRUARY 8, 2021

This is part 2 in this blog series. You can read part 1, here: Digital Transformation is a Data Journey From Edge to Insight. The first blog introduced a mock connected vehicle manufacturing company, The Electric Car Company (ECC), to illustrate the manufacturing data path through the data lifecycle.

Data Pipeline

Data Pipeline Building Manufacturing Data Warehouse

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

The fact that ETL tools evolved to expose graphical interfaces seems like a detour in the history of data processing, and would certainly make for an interesting blog post of its own. Sure, there’s a need to abstract the complexity of data processing, computation and storage.

Data Engineering

Data Engineering Data Engineer Engineering ETL Tools

The Modern Data Lakehouse: An Architectural Innovation

Cloudera

SEPTEMBER 9, 2022

With this in mind, it’s clear that no “one size fits all” architecture will work here; we need a diverse set of data services, fit for each workload and purpose, backed by optimized compute engines and tools. . Data changes in numerous ways: the shape and form of the data changes; the volume, variety, and velocity changes.

Architecture

Architecture Metadata Machine Learning Unstructured Data

Recognizing Organizations Leading the Way in Data Security & Governance

Cloudera

DECEMBER 20, 2021

In the past year, the Bank of the West has begun using the Cloudera platform to establish a data governance and security framework to manage and protect its customers’ sensitive information. The platform is centralizing the data, data management & governance, and building custom controls for data ingestion into the system.

Government

Government Data Security Banking Metadata

Apache Ozone and Dense Data Nodes

Cloudera

APRIL 22, 2021

Collects and aggregates metadata from components and present cluster state. Metadata in cluster is disjoint across components. Cloudera will publish separate blog posts with results of performance benchmarks. Cisco Data Intelligence Platform. The post Apache Ozone and Dense Data Nodes appeared first on Cloudera Blog.

Pipeline-centric

Pipeline-centric Data Lake Hadoop Big Data

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Cloudera

OCTOBER 11, 2021

The platform converges data cataloging, data ingestion, data profiling, data tagging, data discovery, and data exploration into a unified platform, driven by metadata. Modak Nabu automates repetitive tasks in the data preparation process and thus accelerates the data preparation by 4x.

Data Engineering

Data Engineering Data Engineer Cloud Engineering

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Cloudera

MARCH 31, 2021

Customers who have chosen Google Cloud as their cloud platform can now use CDP Public Cloud to create secure governed data lakes in their own cloud accounts and deliver security, compliance and metadata management across multiple compute clusters. Data Preparation (Apache Spark and Apache Hive) .

Google Cloud

Google Cloud Cloud Amazon Web Services Cloud Storage

Accelerate Analytics for All

Cloudera

AUGUST 17, 2022

Zero Ops functionality that enables fast and easy self-service analytics on any type of data without the need for specialized ops or cloud expertise. Only data platform with built-in capability to ingest data from on-prem to the cloud. Readily Accessible Data Ingestion and Analytics.

Cloud Computing

Cloud Computing Cloud Storage Data Science Government

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

This customer’s workloads leverage batch processing of data from 100+ backend database sources like Oracle, SQL Server, and traditional Mainframes using Syncsort. Data Science and machine learning workloads using CDSW. The customer is a heavy user of Kafka for data ingestion. on roadmap). Instead use Ranger REST API.

Cloud

Cloud Kafka Professional Services Metadata

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

LinkedIn Engineering

JUNE 15, 2023

In this blog post, we will discuss the AvroTensorDataset API, techniques we used to improve data processing speeds by up to 162x over existing solutions (thereby decreasing overall training time by up to 66%), and performance results from benchmarks and production. The bytes are decoded based on the provided features metadata (i.e.

Datasets

Datasets Bytes Process Data Ingestion

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

DECEMBER 15, 2022

Since we announced the general availability of Apache Iceberg in Cloudera Data Platform (CDP), Cloudera customers, such as Teranet , have built open lakehouses to future-proof their data platforms for all their analytical workloads. Only metadata will be regenerated. Data quality using table rollback.

Cloud

Cloud Metadata Data Warehouse Google Cloud

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

The main difference between both is the fact that your computation resides in your warehouse with SQL rather than outside with a programming language loading data in memory. In this category I recommend also to have a look at data ingestion (Airbyte, Fivetran, etc.), workflows (Airflow, Prefect, Dagster, etc.)

Data Engineering

Data Engineering Data Engineer Engineering Hadoop

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Data Engineering Weekly

MAY 16, 2023

In the second part, we will focus on architectural patterns to implement data quality from a data contract perspective. Why is Data Quality Expensive? I won’t bore you with the importance of data quality in the blog. In the 'Write' stage, we capture the computed data in a log or a staging area.

Engineering

Engineering Kafka Data Pipeline Data Warehouse

Turning Streams Into Data Products

Cloudera

JUNE 16, 2022

CSP was recently recognized as a leader in the 2022 GigaOm Radar for Streaming Data Platforms report. Faster data ingestion: streaming ingestion pipelines. The DevOps/app dev team wants to know how data flows between such entities and understand the key performance metrics (KPMs) of these entities. Not to worry.

Kafka

Kafka Manufacturing Data Lake SQL

Building Netflix’s Distributed Tracing Infrastructure

Netflix Tech

OCTOBER 19, 2020

In our previous blog post we introduced Edgar, our troubleshooting tool for streaming sessions. We could also get contextual information about the streaming session by joining relevant traces with account metadata and service logs. The high data ingestion rate eventually degraded both read and write operations.

Building

Building Transportation Java Metadata

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

JUNE 30, 2022

With Cloudera’s vision of hybrid data , enterprises adopting an open data lakehouse can easily get application interoperability and portability to and from on premises environments and any public cloud without worrying about data scaling. Why integrate Apache Iceberg with Cloudera Data Platform?

Data Lake

Data Lake Business Intelligence Metadata Data Warehouse

Optimizing data warehouse storage

Netflix Tech

DECEMBER 21, 2020

We built AutoOptimize to efficiently and transparently optimize the data and metadata storage layout while maximizing their cost and performance benefits. Sometimes Data Engineers write downstream ETLs on ingested data to optimize the data/metadata layouts to make other ETL processes cheaper and faster.

Data Warehouse

Data Warehouse Metadata Algorithm Data

Implementing the Netflix Media Database

Netflix Tech

DECEMBER 14, 2018

In the previous blog posts in this series, we introduced the N etflix M edia D ata B ase ( NMDB ) and its salient “Media Document” data model. A fundamental requirement for any lasting data system is that it should scale along with the growth of the business applications it wishes to serve.

Media

Media Database Metadata Data Schemas

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

FEBRUARY 8, 2023

Do ETL and data integration activities seem complex to you? Read this blog to understand everything about AWS Glue that makes it one of the most popular data integration solutions in the industry. Did you know the global big data market will likely reach $268.4 Businesses are leveraging big data now more than ever.

AWS

AWS Scala Metadata Data Lake

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Cloudera

FEBRUARY 9, 2021

Today’s customers have a growing need for a faster end to end data ingestion to meet the expected speed of insights and overall business demand. This ‘need for speed’ drives a rethink on building a more modern data warehouse solution, one that balances speed with platform cost management, performance, and reliability.

Data Warehouse

Data Warehouse Cloud Kafka Cloud Storage

Accenture’s Smart Data Transition Toolkit Now Available for Cloudera Data Platform

Cloudera

AUGUST 31, 2021

Running on CDW is fully integrated with streaming, data engineering, and machine learning analytics. It has a consistent framework that secures and provides governance for all data and metadata on private clouds, multiple public clouds, or hybrid clouds. Consideration of both data & metadata in the migration.

Data Warehouse

Data Warehouse Database-centric Metadata Cloud

Data Cloud Deployment Framework: Architecture

Cloudyard

MARCH 4, 2023

DCDW Architecture Above all, Architecture was divided into three Business layers: Firstly,Agile Data ingestion : Heterogeneous Source System fed the data into Cloud. Respective Cloud would consume/Store the data in bucket or containers. Load the data AS-IS into Snowflake called RAW layer.

Architecture

Architecture Cloud Metadata Data Ingestion

DataOps Architecture: 5 Key Components and How to Get Started

Databand.ai

AUGUST 30, 2023

DataOps is a collaborative approach to data management that combines the agility of DevOps with the power of data analytics. It aims to streamline data ingestion, processing, and analytics by automating and integrating various data workflows.

Architecture

Architecture Data Ingestion Data Governance Data Cleanse

Of Muffins and Machine Learning Models

Cloudera

FEBRUARY 16, 2022

Weak model lineage can result in reduced model performance, a lack of confidence in model predictions and potentially violation of company, industry or legal regulations on how data is used. . Within the CML data service, model lineage is managed and tracked at a project level by the SDX. Figure 03: lineage.yaml.

Machine Learning

Machine Learning Algorithm Government Metadata

Privacy Preserving Single Post Analytics

LinkedIn Engineering

DECEMBER 12, 2023

Pinot is a columnar OLAP store that serves analytics queries on data ingested from realtime streams. PEDAL also consists of a metadata store that holds various algorithmic parameters, including the scale of noise that we introduce and whether we should use one-shot or continual observation algorithms.

Algorithm

Algorithm Metadata SQL Utilities

Data Engineering Weekly #105

Data Engineering Weekly

OCTOBER 30, 2022

I found the blog helpful in understanding the generative model’s historical development and the path forward. link] Sponsored- [New eBook] The Ultimate Data Observability Platform Evaluation Guide Considering investing in a data quality solution? The author explains how to dump the history of blockchains into S3.

Data Engineering

Data Engineering Data Engineer Engineering Data Ingestion

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

phData: Data Engineering

SEPTEMBER 19, 2023

By employing robust data modeling techniques, businesses can unlock the true value of their data lake and transform it into a strategic asset. With many data modeling methodologies and processes available, choosing the right approach can be daunting. Want to learn more about data governance?

Data Lake

Data Lake Process Metadata Data Warehouse

Data Vault on Snowflake: Feature Engineering and Business Vault

Snowflake

MARCH 30, 2023

Data Vault as a practice does not stipulate how you transform your data, only that you follow the same standards to populate business vault link and satellite tables as you would to populate raw vault link and satellite tables. Feature engineering: Data is transformed to support ML model training. ML workflow, ubr.to/3EJHjvm

Engineering

Engineering Raw Data Data Science Machine Learning

Unleash the Power of SCD2 with Finalizer Tasks

Cloudyard

JULY 16, 2024

Read Time: 3 Minute, 11 Second This blog post showcases a real-time data pipeline built in Snowflake that leverages Slowly Changing Dimensions (SCD 2) and Finalizer Tasks to ensure your customer data is always fresh, accurate, and reflects historical changes.

Data Ingestion

Data Ingestion Data Pipeline Metadata Utilities

Accelerate your Data Migration to Snowflake

RandomTrees

SEPTEMBER 6, 2020

The architecture is three layered: Database Storage: Snowflake has a mechanism to reorganize the data into its internal optimized, compressed and columnar format and stores this optimized data in cloud storage. This stage handles all the aspects of data storage like organization, file size, structure, compression, metadata, statistics.

Cloud Storage

Cloud Storage Data Ingestion Data Cleanse Data Warehouse

DataOps Tools: Key Capabilities & 5 Tools You Must Know About

Databand.ai

AUGUST 30, 2023

DataOps , short for data operations, is an emerging discipline that focuses on improving the collaboration, integration, and automation of data processes across an organization. Accelerated Data Analytics DataOps tools help automate and streamline various data processes, leading to faster and more efficient data analytics.

Data Cleanse

Data Cleanse Data Pipeline Data Ingestion Data Validation

Level Up Your Data Platform With Active Metadata

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Webinars

Trending Sources

Data Engineering Weekly #217

Webinars

Apache Ozone Powers Data Science in CDP Private Cloud

Data Engineering Weekly #213

Why Open Table Format Architecture is Essential for Modern Data Systems

Data Engineering Zoomcamp – Data Ingestion (Week 2)

NVIDIA RAPIDS in Cloudera Machine Learning

Scalable Annotation Service?—?Marken

Data Engineering Weekly #179

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Snowflake’s Best-in-Class Enterprise Data Foundation Unlocks Interoperability with Open Data and Internal Collaboration

Running Unified PubSub Client in Production at Pinterest

Next Stop – Building a Data Pipeline from Edge to Insight

The Rise of the Data Engineer

The Modern Data Lakehouse: An Architectural Innovation

Recognizing Organizations Leading the Way in Data Security & Governance

Apache Ozone and Dense Data Nodes

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Accelerate Analytics for All

Upgrade Journey: The Path from CDH to CDP Private Cloud

Open-Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

How to learn data engineering

An Engineering Guide to Data Quality - A Data Contract Perspective - Part 2

Turning Streams Into Data Products

Building Netflix’s Distributed Tracing Infrastructure

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Optimizing data warehouse storage

Implementing the Netflix Media Database

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Accenture’s Smart Data Transition Toolkit Now Available for Cloudera Data Platform

Data Cloud Deployment Framework: Architecture

DataOps Architecture: 5 Key Components and How to Get Started

Of Muffins and Machine Learning Models

Privacy Preserving Single Post Analytics

Data Engineering Weekly #105

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

Data Vault on Snowflake: Feature Engineering and Business Vault

Unleash the Power of SCD2 with Finalizer Tasks

Accelerate your Data Migration to Snowflake

DataOps Tools: Key Capabilities & 5 Tools You Must Know About

Stay Connected