Blog, Data Lake and Metadata - Data Engineering Digest

Level Up Your Data Platform With Active Metadata

Data Engineering Podcast

JUNE 19, 2022

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance.

Metadata

Metadata MongoDB MySQL Scala

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Snowflake

NOVEMBER 2, 2023

Over the years, the technology landscape for data management has given rise to various architecture patterns, each thoughtfully designed to cater to specific use cases and requirements. These patterns include both centralized storage patterns like data warehouse , data lake and data lakehouse , and distributed patterns such as data mesh.

Data Lake

Data Lake Data Warehouse Cloud Unstructured Data

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Cloudera

OCTOBER 23, 2024

In August, we wrote about how in a future where distributed data architectures are inevitable, unifying and managing operational and business metadata is critical to successfully maximizing the value of data, analytics, and AI.

Metadata

Metadata BI Data Lake Business Intelligence

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Why Open Table Format Architecture is Essential for Modern Data Systems

phData: Data Engineering

NOVEMBER 8, 2024

Open Table Format (OTF) architecture now provides a solution for efficient data storage, management, and processing while ensuring compatibility across different platforms. In this blog, we will discuss: What is the Open Table format (OTF)? First, we create an Iceberg table in Snowflake and then insert some data.

Architecture

Architecture Systems Data Lake Google Cloud

Data Engineering Weekly #209

Data Engineering Weekly

FEBRUARY 23, 2025

[link] Alireza Sadeghi: Open Source Data Engineering Landscape 2025 This article comprehensively overviews the 2025 open-source data engineering landscape, highlighting key trends, active projects, and emerging technologies. I found the blog to be a comprehensive roadmap for data engineering in 2025.

Data Engineer

Data Engineer Data Engineering Engineering Kafka

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Snowflake

DECEMBER 4, 2023

Apache Iceberg’s ecosystem of diverse adopters, contributors and commercial support continues to grow, establishing itself as the industry standard table format for an open data lakehouse architecture. Snowflake’s support for Iceberg Tables is now in public preview, helping customers build and integrate Snowflake into their lake architecture.

Building

Building Metadata Cloud Storage AWS

Data Engineering Weekly #215

Data Engineering Weekly

APRIL 6, 2025

The Grab blog delights me since I have tried to do this many times. A cross-encoder teacher model, fine-tuned on human-labeled data and enriched Pin metadata, was distilled into a lightweight student model using semi-supervised learning over billions of impressions. Kudos to the Grab team for building a docs-as-code system.

Data Engineer

Data Engineer Data Engineering Engineering Kafka

Data Engineering Weekly #179

Data Engineering Weekly

JULY 7, 2024

Learn More → Notion: Building and scaling Notion’s data lake Notion writes about scaling the data lake by bringing critical data ingestion operations in-house. Hudi seems to be a de facto choice for CDC data lake features. Notion migrated the insert heavy workload from Snowflake to Hudi.

Data Engineer

Data Engineer Data Engineering Engineering Data Lake

Data Lake vs. Data Warehouse vs. Data Lakehouse

Sync Computing

NOVEMBER 7, 2024

While data warehouses are still in use, they are limited in use-cases as they only support structured data. Data lakes add support for semi-structured and unstructured data, and data lakehouses add further flexibility with better governance in a true hybrid solution built from the ground-up.

Data Lake

Data Lake Data Warehouse Business Intelligence Unstructured Data

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

phData: Data Engineering

SEPTEMBER 19, 2023

With the amount of data companies are using growing to unprecedented levels, organizations are grappling with the challenge of efficiently managing and deriving insights from these vast volumes of structured and unstructured data. What is a Data Lake? Consistency of data throughout the data lake.

Data Lake

Data Lake Process Metadata Data Warehouse

2024 Governance Trends for Data Leaders

phData: Data Engineering

NOVEMBER 1, 2024

In an effort to better understand where data governance is heading, we spoke with top executives from IT, healthcare, and finance to hear their thoughts on the biggest trends, key challenges, and what insights they would recommend. This blog is a collection of those insights, but for the full trendbook, we recommend downloading the PDF.

Government

Government Data Governance Finance Metadata

Apache Ozone and Dense Data Nodes

Cloudera

APRIL 22, 2021

Apache Ozone is one of the major innovations introduced in CDP, which provides the next generation storage architecture for Big Data applications, where data blocks are organized in storage containers for larger scale and to handle small objects. Collects and aggregates metadata from components and present cluster state.

Pipeline-centric

Pipeline-centric Data Lake Hadoop Big Data

Unifying Iceberg Tables on Snowflake

Snowflake

AUGUST 31, 2023

Catalog Integration: Our newly developed Catalog Integration feature allows you to seamlessly plug Snowflake into other Iceberg catalogs tracking table metadata. In this blog post, we’ll dive into the details of these features and the benefits for customers. In addition to Iceberg External Tables, we introduced Native Iceberg Tables.

Metadata

Metadata AWS Data Lake Datasets

5 Reasons Data Discovery Platforms Are Best For Data Lakes

Monte Carlo

APRIL 1, 2021

Over the past few years, data lakes have emerged as a must-have for the modern data stack. But while the technologies powering our access and analysis of data have matured, the mechanics behind understanding this data in a distributed environment have lagged behind. Data discovery tools and platforms can help.

Data Lake

Data Lake Data Warehouse Unstructured Data Government

NVIDIA RAPIDS in Cloudera Machine Learning

Cloudera

MAY 19, 2021

In the previous blog post in this series, we walked through the steps for leveraging Deep Learning in your Cloudera Machine Learning (CML) projects. Data Ingestion. The raw data is in a series of CSV files. We will firstly convert this to parquet format as most data lakes exist as object stores full of parquet files.

Machine Learning

Machine Learning Data Science Datasets Raw Data

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

In this blog, I will demonstrate the value of Cloudera DataFlow (CDF) , the edge-to-cloud streaming data platform available on the Cloudera Data Platform (CDP) , as a Data integration and Democratization fabric. Data and Metadata: Data inputs and data outputs produced based on the application logic.

Architecture

Architecture Metadata Kafka Government

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes.

Data Lake

Data Lake Data Warehouse BI SQL

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

This blog post outlines detailed step by step instructions to perform Hive Replication from an on-prem CDH cluster to a CDP Public Cloud Data Lake. CDP Data Lake cluster versions – CM 7.4.0, CDP Data Lake cluster versions – CM 7.4.0, Pre-Check: Data Lake Cluster.

Cloud

Cloud Data Lake Cloud Storage Metadata

A Reflection On Data Observability As It Reaches Broader Adoption

Data Engineering Podcast

SEPTEMBER 4, 2022

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. RudderStack helps you build a customer data platform on your warehouse or data lake.

IT

IT Metadata MongoDB MySQL

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

JUNE 30, 2022

With Cloudera’s vision of hybrid data , enterprises adopting an open data lakehouse can easily get application interoperability and portability to and from on premises environments and any public cloud without worrying about data scaling. Why integrate Apache Iceberg with Cloudera Data Platform?

Data Lake

Data Lake Business Intelligence Metadata Data Warehouse

Change Data Capture (CDC): What it is and How it Works

Striim

MARCH 21, 2025

Change Data Capture (CDC) has emerged as an ideal solution for near real-time movement of data from relational databases (like SQL Server or Oracle) to data warehouses, data lakes or other databases. Data can be extracted using database queries (batch-based) or Change Data Capture (near-real-time).

IT

IT Data Lake Data Warehouse Relational Database

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

With CDW, as an integrated service of CDP, your line of business gets immediate resources needed for faster application launches and expedited data access, all while protecting the company’s multi-year investment in centralized data management, security, and governance. Proprietary file formats mean no one else is invited in!

IT

IT Data Lake Data Warehouse Cloud Storage

The Security Challenges of Data Warehousing in the Cloud

Cloudera

NOVEMBER 5, 2020

When you register an Environment in CDP, a Data Lake is automatically deployed for that environment. Data Lake security and governance is managed by a shared set of services running within a Data Lake cluster. Apache Atlas — metadata management and governance: lineage, analytics, attributes.

Cloud

Cloud Data Lake Data Warehouse Metadata

How to learn data engineering

Christophe Blefari

JANUARY 20, 2024

You'll be seen as the most technical person of a data team and you'll need to help regarding "low-level" stuff you team. You'll be also asked to put in place a data infrastructure. It means a data warehouse, a data lake or other concepts starting with data. Is it really modern?

Data Engineer

Data Engineer Data Engineering Engineering Hadoop

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

ProjectPro

FEBRUARY 8, 2023

Do ETL and data integration activities seem complex to you? Read this blog to understand everything about AWS Glue that makes it one of the most popular data integration solutions in the industry. Did you know the global big data market will likely reach $268.4 Businesses are leveraging big data now more than ever.

AWS

AWS Scala Metadata Data Lake

The Modern Data Lakehouse: An Architectural Innovation

Cloudera

SEPTEMBER 9, 2022

analyst Sumit Pal, in “Exploring Lakehouse Architecture and Use Cases,” published January 11, 2022: “Data lakehouses integrate and unify the capabilities of data warehouses and data lakes, aiming to support AI, BI, ML, and data engineering on a single platform.” Iceberg handles massive data born in the cloud.

Architecture

Architecture Metadata Machine Learning Unstructured Data

Real World Change Data Capture At Datacoral

Data Engineering Podcast

MARCH 22, 2021

ELT – keep all source tables and use DBT for converting relevant tables into star/snowflake/data vault/wide tables) What are your thoughts on the viability of a data lake as the destination system? e.g. APIs and third party data sources How can we integrage CDC into metadata/lineage tooling?

Data Warehouse

Data Warehouse Metadata Hadoop Data Lake

Habib Bank manages data at scale with Cloudera Data Platform

Cloudera

NOVEMBER 17, 2022

The Solution: CDP Private Cloud brings a next-generation hybrid architecture with cloud-native benefits to HBL’s data platform. HBL started their data journey in 2019 when data lake initiative was started to consolidate complex data sources and enable the bank to use single version of truth for decision making.

Banking

Banking Management Data Lake Professional Services

Operational Database Security – Part 2

Cloudera

SEPTEMBER 23, 2020

Access audits are mastered centrally in Apache Ranger which provides comprehensive non-repudiable audit log for every access event to every resource with rich access event metadata such as: IP. Both fine-grained access control of database objects and access to metadata is provided. Sensitive data identification.

Database

Database Data Lake Metadata Java

Breaking State and Local Data Silos with Modern Data Architectures

Cloudera

AUGUST 30, 2022

Data Lakehouse: Data lakehouses integrate and unify the capabilities of data warehouses and data lakes, aiming to support artificial intelligence, business intelligence, machine learning, and data engineering use cases on a single platform. Forrester ).

Data Architecture

Data Architecture Architecture Data Lake NoSQL

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Cloudera

OCTOBER 11, 2021

The platform converges data cataloging, data ingestion, data profiling, data tagging, data discovery, and data exploration into a unified platform, driven by metadata. Modak Nabu automates repetitive tasks in the data preparation process and thus accelerates the data preparation by 4x.

Data Engineer

Data Engineer Data Engineering Cloud Engineering

Turning Streams Into Data Products

Cloudera

JUNE 16, 2022

CSP was recently recognized as a leader in the 2022 GigaOm Radar for Streaming Data Platforms report. The DevOps/app dev team wants to know how data flows between such entities and understand the key performance metrics (KPMs) of these entities. She is a smart data analyst and former DBA working at a planet-scale manufacturing company.

Kafka

Kafka Manufacturing Data Lake SQL

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Cloudera

MARCH 31, 2021

Customers who have chosen Google Cloud as their cloud platform can now use CDP Public Cloud to create secure governed data lakes in their own cloud accounts and deliver security, compliance and metadata management across multiple compute clusters. You can get started with CDP Public Cloud by requesting a trial account here.

Google Cloud

Google Cloud Cloud Amazon Web Services Cloud Storage

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

First-generation – expensive, proprietary enterprise data warehouse and business intelligence platforms maintained by a specialized team drowning in technical debt. Second-generation – gigantic, complex data lake maintained by a specialized team drowning in technical debt. The post What is a Data Mesh?

Pharmaceutical

Pharmaceutical Data Lake Data Architecture Architecture

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

Overview This blog post describes support for materialized views for the Iceberg table format. It brings the reliability and simplicity of SQL tables to big data while enabling engines like Hive, Impala, Spark, Trino, Flink, and Presto to work with the same tables at the same time. These tables are created as Iceberg tables.

Metadata

Metadata Data Warehouse BI AWS

Improving Multi-tenancy with Virtual Private Clusters

Cloudera

JUNE 6, 2019

While this approach provides isolation, it creates another significant challenge: duplication of data, metadata, and security policies, or ‘split-brain’ data lake. Now the admins need to synchronize multiple copies of the data and metadata and ensure that users across the many clusters are not viewing stale information.

Metadata

Metadata Data Lake Cloud Big Data

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

LinkedIn Engineering

JULY 19, 2023

Open source data lakehouse deployments are built on the foundations of compute engines (like Apache Spark, Trino, Apache Flink), distributed storage (HDFS, cloud blob stores), and metadata catalogs / table formats (like Apache Iceberg, Delta, Hudi, Apache Hive Metastore). Tables are governed as per agreed upon company standards.

Big Data

Big Data Data Management Management Metadata

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

DECEMBER 15, 2022

With in-place table migration, you can rapidly convert to Iceberg tables since there is no need to regenerate data files. Only metadata will be regenerated. Newly generated metadata will then point to source data files as illustrated in the diagram below. . Data quality using table rollback. Metadata management .

Cloud

Cloud Metadata Data Warehouse Google Cloud

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

Cloudera

DECEMBER 16, 2022

Cloudera has long had the capabilities of a data lakehouse, if not the label. Cloudera enables an open data lakehouse architecture that combines all the flexibility of the data lake with the performance of the data warehouse, so enterprises can use all data — both structured and unstructured.

Database

Database Cloud Systems Management

Gartner® Magic Quadrant™ for Cloud Database Report Recognizes Cloudera as a Visionary

Cloudera

JANUARY 19, 2022

Cloudera Data Platform (CDP) scored among the top 10 vendors on all four Analytical Use Cases — Data Warehouse, Logical Data Warehouse, Data Lake and Operational Intelligence in the Critical Capabilities for Cloud Database Management Systems for Analytics Use Cases.

Database

Database Cloud Data Warehouse Data Lake

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Cloudera

MARCH 23, 2022

The table information (such as schema, partition) is stored as part of the metadata (manifest) file separately, making it easier for applications to quickly integrate with the tables and the storage formats of their choice. The post 5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP) appeared first on Cloudera Blog.

Metadata

Metadata Data Architecture Machine Learning BI

Cloudera Data Warehouse Demonstrates Best-in-Class Cloud-Native Price-Performance

Cloudera

JANUARY 15, 2021

Some examples of recent optimizations in Impala include: New multithreading model (see dedicated blog post ). Remote read optimizations: IMPALA-8341: Data cache for remote reads. Impala use of KRPC (see dedicated blog post ). Parquet page indexes (see dedicated blog post ). IMPALA-8690: Add LIRS cache eviction algorithm.

Data Warehouse

Data Warehouse Cloud Consulting SQL

Educating ChatGPT on Data Lakehouse

Cloudera

MARCH 17, 2023

I took the free version of ChatGPT on a test drive (in March 2023) and asked some simple questions on data lakehouse and its components. Hopefully this blog will give ChatGPT an opportunity to learn and correct itself while counting towards my 2023 contribution to social good. I thought this was a fairly comprehensive list.

Education

Education Unstructured Data Data Lake Data Warehouse

Addressing Data Mesh Technical Challenges with DataOps

DataKitchen

AUGUST 9, 2021

The domain also includes code that acts upon the data, including tools, pipelines, and other artifacts that drive analytics execution. The domain requires a team that creates/updates/runs the domain, and we can’t forget metadata: catalogs, lineage, test results, processing history, etc., ….

Pharmaceutical

Pharmaceutical Raw Data Data Data Lake

Level Up Your Data Platform With Active Metadata

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Webinars

Trending Sources

Cloudera and Snowflake Partner to Deliver the Most Comprehensive Open Data Lakehouse

Webinars

Why Open Table Format Architecture is Essential for Modern Data Systems

Data Engineering Weekly #209

Build an Open Data Lakehouse with Iceberg Tables, Now in Public Preview

Data Engineering Weekly #215

Data Engineering Weekly #179

Data Lake vs. Data Warehouse vs. Data Lakehouse

What Are the Best Data Modeling Methodologies & Processes for My Data Lake?

2024 Governance Trends for Data Leaders

Apache Ozone and Dense Data Nodes

Unifying Iceberg Tables on Snowflake

5 Reasons Data Discovery Platforms Are Best For Data Lakes

NVIDIA RAPIDS in Cloudera Machine Learning

How Cloudera Data Flow Enables Successful Data Mesh Architectures

The Future of the Data Lakehouse – Open

Migrate Hive data from CDH to CDP public cloud

A Reflection On Data Observability As It Reaches Broader Adoption

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Change Data Capture (CDC): What it is and How it Works

Get Your Analytics Insights Instantly – Without Abandoning Central IT

The Security Challenges of Data Warehousing in the Cloud

How to learn data engineering

AWS Glue-Unleashing the Power of Serverless ETL Effortlessly

The Modern Data Lakehouse: An Architectural Innovation

Real World Change Data Capture At Datacoral

Habib Bank manages data at scale with Cloudera Data Platform

Operational Database Security – Part 2

Breaking State and Local Data Silos with Modern Data Architectures

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Turning Streams Into Data Products

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

What is a Data Mesh?

Materialized Views in Hive for Iceberg Table Format

Improving Multi-tenancy with Virtual Private Clusters

Taking Charge of Tables: Introducing OpenHouse for Big Data Management

Implement a Multi-Cloud Open Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

Gartner® Magic Quadrant™ for Cloud Database Report Recognizes Cloudera as a Visionary

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Cloudera Data Warehouse Demonstrates Best-in-Class Cloud-Native Price-Performance

Educating ChatGPT on Data Lakehouse

Addressing Data Mesh Technical Challenges with DataOps

Stay Connected