Accessibility, Data Ingestion and Metadata

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Data Engineering Weekly

MARCH 5, 2025

This ecosystem includes: Catalogs: Services that manage metadata about Iceberg tables (e.g., Compute Engines: Tools that query and process data stored in Iceberg tables (e.g., Maintenance Processes: Operations that optimize Iceberg tables, such as compacting small files and managing metadata. Trino, Spark, Snowflake, DuckDB).

Hadoop

Hadoop Metadata Data Ingestion Data Governance

Simplifying Data Architecture and Security to Accelerate Value

Snowflake

NOVEMBER 11, 2024

With Hybrid Tables’ fast, high-concurrency point operations, you can store application and workflow state directly in Snowflake, serve data without reverse ETL and build lightweight transactional apps while maintaining a single governance and security model for both transactional and analytical data — all on one platform.

Data Architecture

Data Architecture Architecture Data Lake Kafka

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

First, we create an Iceberg table in Snowflake and then insert some data. Then, we add another column called HASHKEY , add more data, and locate the S3 file containing metadata for the iceberg table. In the screenshot below, we can see that the metadata file for the Iceberg table retains the snapshot history.

Architecture

Architecture Systems Data Lake Google Cloud

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Manufacturing Data Ingestion into Snowflake

Snowflake

JANUARY 26, 2023

Accessing data from the manufacturing shop floor is one of the key topics of interest with the majority of cloud platform vendors due to the pace of Industry 4.0 Working with our partners, this architecture includes MQTT-based data ingestion into Snowflake. Industry 4.0, Stay tuned for more insights on Industry 4.0

Data Ingestion

Data Ingestion Manufacturing Unstructured Data Architecture

Data Engineering Weekly #213

Data Engineering Weekly

MARCH 23, 2025

link] LinkedIn: Journey of next-generation control plane for data systems LinkedIn writes about the evolution of Nuage, its internal control plane framework for managing data infrastructure resources. Initially a self-service platform (Nuage 1.0), it transitioned to a decentralized model (Nuage 2.0) and then to Nuage 3.0,

Data Engineering

Data Engineering Data Engineer Engineering Data

Snowflake and the Pursuit Of Precision Medicine

Snowflake

NOVEMBER 29, 2023

In medicine, lower sequencing costs and improved clinical access to NGS technology has been shown to increase diagnostic yield for a range of diseases, from relatively well-understood Mendelian disorders, including muscular dystrophy and epilepsy , to rare diseases such as Alagille syndrome.

Metadata

Metadata Healthcare Medical Data Storage

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

While we walk through the steps one by one from data ingestion to analysis, we will also demonstrate how Ozone can serve as an ‘S3’ compatible object store. Learn more about the impacts of global data sharing in this blog, The Ethics of Data Exchange. Data ingestion through ‘s3’. Ozone Namespace Overview.

Data Science

Data Science Cloud Hadoop Metadata

Data Engineering Zoomcamp – Data Ingestion (Week 2)

Hepta Analytics

FEBRUARY 14, 2022

DE Zoomcamp 2.2.1 – Introduction to Workflow Orchestration Following last weeks blog , we move to data ingestion. We already had a script that downloaded a csv file, processed the data and pushed the data to postgres database. This week, we got to think about our data ingestion design.

Data Ingestion

Data Ingestion Data Engineering Data Engineer Engineering

Snowflake’s Best-in-Class Enterprise Data Foundation Unlocks Interoperability with Open Data and Internal Collaboration

Snowflake

JUNE 4, 2024

Iceberg tables (now generally available), when combined with the capabilities of the Snowflake platform, allow you to build various open architectures, including a data lakehouse and data mesh. Parquet Direct (private preview) allows you to use Iceberg without rewriting or duplicating Parquet files — even as new Parquet files arrive.

Government

Government Data Ingestion Data PostgreSQL

Improved Ascend for Databricks, New Lineage Visualization, and Better Incremental Data Ingestion

Ascend.io

DECEMBER 19, 2022

More and more customers are dramatically accelerating their time to value with Databricks data pipelines by leveraging Ascend automation. Instead, it is a Sankey diagram driven by the same dynamic metadata that runs the Ascend control plane. Improved performance by upgrading our ingestion engine from Spark 3.2.0

Data Ingestion

Data Ingestion Data Pipeline Metadata AWS

New Snowflake Features Released in January 2024

Snowflake

FEBRUARY 13, 2024

Snowpark Updates Model management with the Snowpark Model Registry – public preview Snowpark Model Registry is an integrated solution to register, manage and use models and their metadata natively in Snowflake. Learn more here. This improves manageability, troubleshooting and auditability for security admins. Learn more here.

Data Ingestion

Data Ingestion AWS Python Metadata

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Netflix Tech

MARCH 25, 2019

Therefore, the ingestion approach for data lineage is designed to work with many disparate data sources. Our data ingestion approach, in a nutshell, is classified broadly into two buckets?—?push We leverage Metacat data, our internal metadata store and service, to enrich lineage data with additional table metadata.

Building

Building Metadata Transportation Data Ingestion

Accelerate Analytics for All

Cloudera

AUGUST 17, 2022

?. What if you could access all your data and execute all your analytics in one workflow, quickly with only a small IT team? CDP One is a new service from Cloudera that is the first data lakehouse SaaS offering with cloud compute, cloud storage, machine learning (ML), streaming analytics, and enterprise grade security built-in.

Cloud Computing

Cloud Computing Cloud Storage Data Science Machine Learning

The Modern Data Lakehouse: An Architectural Innovation

Cloudera

SEPTEMBER 9, 2022

The promise of a modern data lakehouse architecture. Imagine having self-service access to all business data, anywhere it may be, and being able to explore it all at once. Imagine quickly answering burning business questions nearly instantly, without waiting for data to be found, shared, and ingested.

Architecture

Architecture Metadata Machine Learning Unstructured Data

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

Attribute-based access control and SparkSQL fine-grained access control. Lineage and chain of custody, advanced data discovery and business glossary. Store and access schemas across clusters and rebalance clusters with Cruise Control. Data Science and machine learning workloads using CDSW. Ranger 2.0. on roadmap).

Cloud

Cloud Kafka Professional Services Metadata

Recognizing Organizations Leading the Way in Data Security & Governance

Cloudera

DECEMBER 20, 2021

The right set of tools helps businesses utilize data to drive insights and value. But balancing a strong layer of security and governance with easy access to data for all users is no easy task. Winner of the Data Impact Awards 2021: Security & Governance Leadership. You can become a data hero too.

Government

Government Data Security Banking Metadata

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Cloudera

OCTOBER 11, 2021

The platform converges data cataloging, data ingestion, data profiling, data tagging, data discovery, and data exploration into a unified platform, driven by metadata. Modak Nabu automates repetitive tasks in the data preparation process and thus accelerates the data preparation by 4x.

Data Engineering

Data Engineering Data Engineer Cloud Engineering

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

DECEMBER 15, 2022

Since we announced the general availability of Apache Iceberg in Cloudera Data Platform (CDP), Cloudera customers, such as Teranet , have built open lakehouses to future-proof their data platforms for all their analytical workloads. Only metadata will be regenerated. Only metadata will be regenerated.

Cloud

Cloud Metadata Data Warehouse Google Cloud

The Rise of the Data Engineer

Maxime Beauchemin

JANUARY 20, 2017

In retrospect, complex SCD modeling techniques are not intuitive and reduce accessibility. It also becomes the role of the data engineering team to be a “center of excellence” through the definitions of standards, best practices and certification processes for data objects.

Data Engineering

Data Engineering Data Engineer Engineering ETL Tools

Introducing Vector Search on Rockset: How to run semantic search with OpenAI and Rockset

Rockset

APRIL 18, 2023

Vector search has seen an explosion in popularity due to improvements in accuracy and broadened accessibility to the models used to generate embeddings. Rockset offers a number of benefits along with vector search support to create relevant experiences: Real-Time Data: Ingest and index incoming data in real-time with support for updates.

Unstructured Data

Unstructured Data Metadata Machine Learning SQL

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Cloudera

MARCH 31, 2021

Customers who have chosen Google Cloud as their cloud platform can now use CDP Public Cloud to create secure governed data lakes in their own cloud accounts and deliver security, compliance and metadata management across multiple compute clusters. Data Preparation (Apache Spark and Apache Hive) .

Google Cloud

Google Cloud Cloud Amazon Web Services Cloud Storage

dbt Core, Snowflake, and GitHub Actions: pet project for Data Engineers

Towards Data Science

DECEMBER 1, 2023

The data volume we will deal with is small, so we will not try to overkill with data partitioning, time travel, Snowpark, and other Snowflake advanced capabilities. However, we will pay particular attention to Access Control (this will be used for dbt access). There are two ways to access dbt: dbt Cloud and dbt Core.

Data Engineering

Data Engineering Data Engineer Project Engineering

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

FEBRUARY 8, 2023

When Glue receives a trigger, it collects the data, transforms it using code that Glue generates automatically, and then loads it into Amazon S3 or Amazon Redshift. Then, Glue writes the job's metadata into the embedded AWS Glue Data Catalog. being data exactly matches the classifier, and 0.0 Why Use AWS Glue?

AWS

AWS Scala Metadata Data Lake

Next Stop – Building a Data Pipeline from Edge to Insight

Cloudera

FEBRUARY 8, 2021

ECC will enrich the data collected and will make it available to be used in analysis and model creation later in the data lifecycle. Below is the entire set of steps in the data lifecycle, and each step in the lifecycle will be supported by a dedicated blog post(see Fig.

Data Pipeline

Data Pipeline Building Manufacturing Data Warehouse

Modern Data Engineering

Towards Data Science

NOVEMBER 4, 2023

Indeed, why would we build a data connector from scratch if it already exists and is being managed in the cloud? Very often it is row-based and might become quite expensive on an enterprise level of data ingestion, i.e. big data pipelines. Dataform’s dependency graph and metadata. Image by author.

Data Engineering

Data Engineering Data Engineer Engineering BI

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

JUNE 30, 2022

With Cloudera’s vision of hybrid data , enterprises adopting an open data lakehouse can easily get application interoperability and portability to and from on premises environments and any public cloud without worrying about data scaling. Why integrate Apache Iceberg with Cloudera Data Platform?

Data Lake

Data Lake Business Intelligence Metadata Data Warehouse

Implementing the Netflix Media Database

Netflix Tech

DECEMBER 14, 2018

A fundamental requirement for any lasting data system is that it should scale along with the growth of the business applications it wishes to serve. NMDB is built to be a highly scalable, multi-tenant, media metadata system that can serve a high volume of write/read throughput as well as support near real-time queries.

Media

Media Database Metadata Data Schemas

Turning Streams Into Data Products

Cloudera

JUNE 16, 2022

Faster data ingestion: streaming ingestion pipelines. The DevOps/app dev team wants to know how data flows between such entities and understand the key performance metrics (KPMs) of these entities. Moving beyond traditional data-at-rest analytics: next generation stream processing with Apache Flink.

Kafka

Kafka Manufacturing Data Lake SQL

A Definitive Guide to Using BigQuery Efficiently

Towards Data Science

MARCH 5, 2024

With on-demand pricing, you will generally have access to up to 2000 concurrent slots, shared among all queries in a single project, which is more than enough in most cases. Choosing the right model depends on your data access patterns and compression capabilities. Data can easily be uploaded and stored for low costs.

Bytes

Bytes Google Cloud Cloud Storage Utilities

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Cloudera

FEBRUARY 9, 2021

Today’s customers have a growing need for a faster end to end data ingestion to meet the expected speed of insights and overall business demand. This ‘need for speed’ drives a rethink on building a more modern data warehouse solution, one that balances speed with platform cost management, performance, and reliability.

Data Warehouse

Data Warehouse Cloud Kafka Cloud Storage

Data Engineering Weekly #164

Data Engineering Weekly

MARCH 24, 2024

The APIs support emitting unstructured log lines and typed metadata key-value pairs (per line). Ingestion clusters read objects from queues and support additional parsing based on user-defined regex extraction rules. The extracted key-value pairs are written to the line’s metadata.

Data Engineering

Data Engineering Data Engineer Engineering Metadata

Of Muffins and Machine Learning Models

Cloudera

FEBRUARY 16, 2022

Within the CML data service, model lineage is managed and tracked at a project level by the SDX. SDX provides open metadata management and governance across each deployed environment by allowing organisations to catalogue, classify as well as control access to and manage all data assets. Figure 03: lineage.yaml.

Machine Learning

Machine Learning Algorithm Government Metadata

The Data Integration Solution Checklist: Top 10 Considerations

Precisely

MAY 13, 2024

A true enterprise-grade integration solution calls for source and target connectors that can accommodate: VSAM files COBOL copybooks open standards like JSON modern platforms like Amazon Web Services ( AWS ), Confluent , Databricks , or Snowflake Questions to ask each vendor: Which enterprise data sources and targets do you support?

Data Integration

Data Integration Metadata Amazon Web Services Data Governance

TensorFlow Transform: Ensuring Seamless Data Preparation in Production

Towards Data Science

JULY 8, 2024

ML Pipeline operations begins with data ingestion and validation, followed by transformation. The transformed data is trained and deployed. You can access it from here. This process also creates a sqlite database for storing the metadata of the pipeline process.

Data Preparation

Data Preparation Datasets Metadata Data Ingestion

Accenture’s Smart Data Transition Toolkit Now Available for Cloudera Data Platform

Cloudera

AUGUST 31, 2021

Running on CDW is fully integrated with streaming, data engineering, and machine learning analytics. It has a consistent framework that secures and provides governance for all data and metadata on private clouds, multiple public clouds, or hybrid clouds. Consideration of both data & metadata in the migration.

Data Warehouse

Data Warehouse Database-centric Metadata Cloud

Accelerate your Data Migration to Snowflake

RandomTrees

SEPTEMBER 6, 2020

The architecture is three layered: Database Storage: Snowflake has a mechanism to reorganize the data into its internal optimized, compressed and columnar format and stores this optimized data in cloud storage. This stage handles all the aspects of data storage like organization, file size, structure, compression, metadata, statistics.

Cloud Storage

Cloud Storage Data Ingestion Data Cleanse Data Warehouse

Link Multiple Data Clouds to Ascend

Ascend.io

FEBRUARY 6, 2023

This database then hosts the workloads for the Data Flows in that Data Service. As a result, you can distribute workloads across more infrastructure, and align your Ascend teams with their access to different Snowflake resources. Now we add Data Plane configuration for Databricks and BigQuery.

Cloud

Cloud Data Ingestion Raw Data Data Pipeline

Link Multiple Data Clouds to Ascend

Ascend.io

FEBRUARY 6, 2023

This database then hosts the workloads for the Data Flows in that Data Service. As a result, you can distribute workloads across more infrastructure, and align your Ascend teams with their access to different Snowflake resources. Now we add Data Plane configuration for Databricks and BigQuery.

Cloud

Cloud Data Ingestion Raw Data Data Pipeline

5 Layers of Data Lakehouse Architecture Explained

Monte Carlo

JANUARY 5, 2024

This architecture format consists of several key layers that are essential to helping an organization run fast analytics on structured and unstructured data. Table of Contents What is data lakehouse architecture? The 5 key layers of data lakehouse architecture 1. Ingestion layer 2. Metadata layer 4. API layer 5.

Architecture

Architecture Data Lake Metadata Unstructured Data

Data Lakehouse Architecture Explained: 5 Layers

Monte Carlo

JANUARY 5, 2024

This architecture format consists of several key layers that are essential to helping an organization run fast analytics on structured and unstructured data. Table of Contents What is data lakehouse architecture? The 5 key layers of data lakehouse architecture 1. Ingestion layer 2. Metadata layer 4. API layer 5.

Architecture

Architecture Data Lake Metadata Unstructured Data

Are Apache Iceberg Tables Right For Your Data Lake? 6 Reasons Why.

Monte Carlo

NOVEMBER 14, 2023

Databricks announced that Delta tables metadata will also be compatible with the Iceberg format, and Snowflake has also been moving aggressively to integrate with Iceberg. How Apache Iceberg tables structure metadata. I think it’s safe to say it’s getting pretty cold in here. Image courtesy of Dremio. So, is Iceberg right for you?

Data Lake

Data Lake Metadata Data Warehouse SQL

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

phData: Data Engineering

SEPTEMBER 19, 2023

Data Governance and Security By defining data models, organizations can establish policies, access controls, and security measures to protect sensitive data. Data models can also facilitate compliance with regulations and ensure proper data handling and protection. Want to learn more about data governance?

Data Lake

Data Lake Process Metadata Data Warehouse

Data Pipeline Observability: A Model For Data Engineers

Databand.ai

JUNE 28, 2023

Having a bigger and more specialized data team can help, but it can hurt if those team members don’t coordinate. More people accessing the data and running their own pipelines and their own transformations causes errors and impacts data stability. is a unified data observability platform built for data engineers.

Data Pipeline

Data Pipeline Data Engineering Data Engineer Engineering

Data Vault on Snowflake: Feature Engineering and Business Vault

Snowflake

MARCH 30, 2023

3EJHjvm Once a business need is defined and a minimal viable product ( MVP ) is scoped, the data management phase begins with: Data ingestion: Data is acquired, cleansed, and curated before it is transformed. Feature engineering: Data is transformed to support ML model training. ML workflow, ubr.to/3EJHjvm

Engineering

Engineering Raw Data Data Science Machine Learning

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses

Simplifying Data Architecture and Security to Accelerate Value

Webinars

Trending Sources

Why Open Table Format Architecture is Essential for Modern Data Systems

Webinars

Manufacturing Data Ingestion into Snowflake

Data Engineering Weekly #213

Snowflake and the Pursuit Of Precision Medicine

Apache Ozone Powers Data Science in CDP Private Cloud

Data Engineering Zoomcamp – Data Ingestion (Week 2)

Snowflake’s Best-in-Class Enterprise Data Foundation Unlocks Interoperability with Open Data and Internal Collaboration

Improved Ascend for Databricks, New Lineage Visualization, and Better Incremental Data Ingestion

New Snowflake Features Released in January 2024

Building and Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability, and…

Accelerate Analytics for All

The Modern Data Lakehouse: An Architectural Innovation

Upgrade Journey: The Path from CDH to CDP Private Cloud

Recognizing Organizations Leading the Way in Data Security & Governance

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

The Rise of the Data Engineer

Introducing Vector Search on Rockset: How to run semantic search with OpenAI and Rockset

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

dbt Core, Snowflake, and GitHub Actions: pet project for Data Engineers

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

Next Stop – Building a Data Pipeline from Edge to Insight

Modern Data Engineering

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Implementing the Netflix Media Database

Turning Streams Into Data Products

A Definitive Guide to Using BigQuery Efficiently

A Cost-Effective Data Warehouse Solution in CDP Public Cloud – Part1

Data Engineering Weekly #164

Of Muffins and Machine Learning Models

The Data Integration Solution Checklist: Top 10 Considerations

TensorFlow Transform: Ensuring Seamless Data Preparation in Production

Accenture’s Smart Data Transition Toolkit Now Available for Cloudera Data Platform

Accelerate your Data Migration to Snowflake

Link Multiple Data Clouds to Ascend

Link Multiple Data Clouds to Ascend

5 Layers of Data Lakehouse Architecture Explained

Data Lakehouse Architecture Explained: 5 Layers

Are Apache Iceberg Tables Right For Your Data Lake? 6 Reasons Why.

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

Data Pipeline Observability: A Model For Data Engineers

Data Vault on Snowflake: Feature Engineering and Business Vault

Stay Connected